Preppin' Data in Snowflake: Week 1, 2021

by Harry Beardon

This blog aims as a walkthrough of how to complete the Preppin' Data Week 1, 2021 challenge using SQL queries in Snowflake.

Here is the input:

We need to:

  1. Extract the year quarter from the Date field

2. Splitting the 'Store - Bike' column into two separate columns for Store and Bike

3. Rename incorrectly spelled values in the newly created Bike field

4. Add original fields needed in the output that don't need manipulating

5. Extract the day of month from the Date field

6.  Remove sample orders from the Orders field (Order ID 1-10)

to return the desired output which should look like so:

  1. Extracting the year quarter from the Date field

We need to start our SQL query with SELECT to specify which columns we want to include. On a new line, our first column we'll specify is Quarter(“Date”) which will return the year quarter for our "Date" column.  We can follow this by saying "as “Quarter”" to give this newly generated column a new alias known as Quarter.

2. (& 3.) Splitting the Store - Bike column into two separate columns  for Store and Bike (and renaming incorrectly spelled values)

To do this, we will use a Snowflake-specific funciton called SPLIT_PART. This function returns a column for each part of the split, specific by the user which part of the split they want to return. We will repeat this function twice to return two different parts of the split for Store and Bike.

Function 1.

SPLIT_PART("Store - Bike", '- ', 1)  as "Store" will return the first part of the split which is our store, so we can give this the alias “Store”

Function 2.

SPLIT_PART((REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(COLLATE("Store - Bike", ''),'R.*d','Road'),'Mo.*n','Mountain'),'G.*l','Gravel')), ' - ', 2) as "Bike"

Here I have chained REGEXP_REPLACE functions which will search for a specified pattern and replace them with your desired text. I have had to chain three of these to get the correct spelling of “Mountain” “Gravel” and “Road in one column. If they’re not chained they will each return their own column. We are then wrapping this in a SPLIT_PART function to only return the bike type, not the store by returning the second part of our SPLIT_PART. We can give this the alias “Bikes”

4. Adding original fields needed in the output

From there we need to bring in other fields that we don’t need to manipulate but need in our output, such as “Order ID”, “Customer Age” etc.

5. Extracting the day of month from the Date field

The output also has the day of the month extracted from the date field. To do this I wrapped the Date column in a DAY() function and gave this the alias “Day of Month”

Following this step, we should also produce a FROM query to tell Snowflake which table it is querying to produce our desired output. This should be the input data for Week 1 you uploaded to your snowflake warehouse.

6. Removing sample orders from the Orders field (Order ID 1-10)

The last step is to remove the first 10 Orders as these were test orders. To do this I wrote a WHERE clause specifying to return only Order ID's that are greater than 10.

That should give you the desired output in snowflake!

Here is the entire query:

And here is the results window in Snowflake: