Preppin' Data in Snowflake: Week 2, 2021

by Harry Beardon

This blog aims as a walkthrough of how to complete the Preppin' Data Week 2, 2021 challenge using SQL queries in Snowflake.

Here is the input data:

The challenge requires two different outputs, so this post will be split up into two parts to address them both

Output 1.

Here is the desired output for Output 1:

We need to:

1.       Extract the Brand name from the Model column

2.       Bring in original fields needed in the output

3.       Find the total quantity sold for each Bike type and Brand

4.       Find the total order value for each bike type and brand

5.       Find the average Bike Value per Brand and Type

1. Extract the Brand name from the Model column

SELECT
REGEXP_REPLACE(REGEXP_REPLACE(COLLATE("Model", ''), '[0-9]', ''), '/','') as "Brand"

Here I have chained REGEXP_REPLACE functions which will search for a specified pattern and replace them with your desired text. I have had to chain two of these - one removes the numbers from our model field ([0-9]). The other removes the '/' which was left over after the first REGEXP_REPLACE. If they’re not chained they will each return their own column. We can follow this by giving our newly generated column an alias using the AS function - lets call this “Bikes.

2. Bring in original fields needed in the output

, "Bike Type"

We need to bring in other fields that we don’t need to manipulated but need in our output, in this case we only have one; "Bike Type"

3. Find the total quantity sold for each Bike type and Brand

, SUM("Quantity") as "Quanity Sold"

We need to return the total Quantity sold for each combination of bike type and brand. To do this we can wrap our Quantity field in the SUM() function and give this the alias "Quantity Sold"

4. Find the total order value for each bike type and brand

, SUM("Quantity"*"Value per Bike") as "Order Value"

This one can be easy to make mistakes on and explode the values in your resulting table. What we want to do is find the order value at the row level of our data, and then find the total number this totals to. To do this, we simply want to multiply our Quantity field by the Value per bike, which will allow the calculation to happen at the row level. To get the total sum from there we just wrap this entire calculation in one SUM() function. We can give this the alias "Order Value".

5. Find the average Bike Value per Brand and Type

, ROUND(AVG("Value per Bike"),1) as "Avg Bike Value Sold per Brand & Type"
FROM "HARRY B | WK2 PD 2021"
GROUP BY "Bike Type", "Brand"

As the granularity of our output table is at the Brand and Bike type level, this calculation is as simple as wrapping the Value per Bike field in an AVG() function. I have also then wrapped this calculation in a ROUND() function, specified to one decimal place to clean the number up a little.

Following this step, we should also produce a FROM query to tell Snowflake which table it is querying to produce our desired output. This should be the input data for Week 2 you uploaded to your snowflake warehouse. We also need a GROUP BY function afterwards, grouping on our fields that aren't aggregations - Bike Type and Brand

Here is the entire query for output 1:

Output 2.

Here is the desired output for Output 1:

We need to:

1.       Extract the Brand name from the Model column

2.       Bring in original fields needed in the output

3.       Find the total order value for each Brand and Store

4.       Find the total quantity sold for each Brand and Store

5.       Find the average days taken to ship from date of order

1. Extract the Brand name from the Model column

SELECT DISTINCT
REGEXP_REPLACE(REGEXP_REPLACE(COLLATE("Model", ''), '[0-9]', ''), '/','') as "Brand"

Repeat the step used in Output 1- but instead now use SELECT DISTINCT to only return unique results. We don’t want any duplication for store or brands.

2. Bring in original fields needed in the output

, "Store"

We need to bring in other fields that we don’t need to manipulated but need in our output, in this case we only have one; "Store"

3. Find the total order value for each Brand and Store

, SUM("Quantity"*"Value per Bike") as "Total Order Value"

This calculation is the same as our one in Output 1 for bike type and brand . As with that one, we want it to first happen at the row level of our original table before being summed.

4. Find the total quantity sold for each Brand and Store

, SUM("Quantity") as "Total Quanity Sold"

This calculation is also the same as the one used in Output 1 for total quantity.

5. Find the average days taken to ship from date of order

, ROUND(AVG(DATEDIFF(day, "Order Date", "Shipping Date")),1) as "Average Days to Ship"
FROM "HARRY B | WK2 PD 2021"
GROUP BY "Store", "Brand"

As with Output 1 - Following this step, we should also produce a FROM query to tell Snowflake which table it is querying to produce our desired output. This should be the input data for Week 2 you uploaded to your snowflake warehouse. We also need a GROUP BY function afterwards, grouping on our fields that aren't aggregations - Store and Brand in this case.

Here is the entire query for Output 2: