The Data School - Strolling with SQL: Solving Preppin' Data Challenges (part 4)

In this blog, I’ll be working through the Preppin’ Data challenge called New Customers. You can find it here. So far, this challenge has been the most fun! This challenge involves a lot of string manipulation, stacking several tables on top of each other and then changing the shape of the resulting table. Let’s get right into it!

This blog will consist of screenshots of my queries, outputs and descriptions of what the queries are doing.

Let’s take a look at what we have to work with. We are given a whopping 12 tables that we need to consolidate somehow. After that, we need to transform that table in order to create the relevant fields we’re looking for. Below is a picture of the input data showcasing a sample from January.

In our case, since we’re working with Snowflake, our tables aren’t expressed in the form of Excel sheets. No problem! To consolidate these tables, we will need to use the UNION function (and a lot of them!)

Here’s the SQL query that handles that. This CTE is aliased as data and has been referred to as the “unioned output” throughout this article.

As you can see above, both the SQL query and the result table have been condensed. Both this output and the desired output, which will be shown after this paragraph, have 990 records.

You may have noticed the Month_Part field. As the name suggests, I’ll be using it to create a date field. Let’s compare this output with the desired output which comes from the challenge post and is pictured below.

At least for me, the biggest difference between the two tables is that the original input has Demographic and Value fields whereas the desired output does not. When I alluded to the need to transform the shape of the data, this discrepancy was what I was referring to. In other words, the only way for us to get the Account Type, Date of Birth and Ethnicity fields is by transforming the unioned output (data CTE).

As for the Joining Date field, this can be constructed by concatenating the following elements Joining Day (where zero-padding is applied where appropriate, i.e. 01 or 09), Month Part and 2023.

With all of that said, let’s start diving into how the various fields were made! Let’s begin with the most straightforward one which is Joining Date. The SQL query is pictured below.

As you can see in the output above, the Joining Date field is probably the easiest field to make because it doesn’t require transforming the unioned output. This is because none of the necessary elements for the field is contained with the Demographic-Value field pair.

In order to make this field, I used the REGEXP_REPLACE and CONCAT functions. The string that I want to replace is the concatenation of Joining Day, /, Month_Part, /, 2023. At first blush, you might be wondering why I would need to change this concatenated string. Let’s look at a supplemental query that illustrates the earlier concatenation. This will make it obvious why the string needs to be modified.

The data of the Original_Joining Date represents the first clause of the REGEX_REPLACE function. In other words, that field is the original concatenation that needs to be modified further. This is what I was referring to earlier about how the Joining Day field has to be zero-padded. The second clause, which is the RegEx string pattern of ‘^[1-9]\/’, is responsible for capturing all of the single digit Joining Day values. The final clause, which is the replacement clause, is simply the logic that adds the zero-padding to the aforementioned captured values. This is handled by another CONCAT function.

Moving on, let’s take a look at the handling of the process of transforming the unioned output. Again, this is critical in order to properly construct the Account Type, Date of Birth and Ethnicity fields. Originally, I had tried using the PIVOT function to try to separate the elements from the Demographic field and align them with their respective elements from the Value field. Unfortunately, I couldn’t figure this out. So, instead, I decided to create three CTEs which represent each of those aforementioned fields. It is through those CTEs that I handle the logic of reshaping the unioned output.

To see this logic in action, let's dive into each CTE. First up is the dob CTE which handles the creation of the Date of Birth field.

As you can probably imagine, this was the most involved part of the challenge. This is for the following reasons:

1) day-values and month-values less than 10 are not zero-padded (the original Date of Birth field from the Demographic-Value field pair has dates like 6/6/1991, which does not conform to the desired output of 06/06/1991)

2) the original Date of Birth Field is formatted incorrectly. It is initially formatted as MM/DD/YYYY while the desired output format is DD/MM/YYYY

To reconcile these issues, I used these functions: CASE, CONCAT, REPLACE, TO_NUMBER and SUBSTR. Originally, I was trying to use some RegEx functions but I ran into a wall. Who knows, I might revisit this section of my SQL code and revise it to incorporate RegEx.

There are two CASE statements that are responsible for creating the properly cleaned and formatted Date of Birth field, which are:

1) the first CASE statement creates a dob_month field, which distinguishes between months prior to October (line 13 of the query) and months between October to December (line 15 of the query).

2) the second CASE statement creates a dob_day field, which distinguishes between day values that are below 10 and greater than 10 GIVEN the corresponding dob_month value of that row. This translates to four conditions:

a) when dob_month is between October and December AND the day-value is less than 10 — then dob_day is any day prior to the 10th of October, November and December — (line 20 of the query)

b) when dob_month is prior to October AND the day-value is greater than 10 — then dob_day is any day after the 10th of any month before October — (line 22 of the query)

c) when dob_month is prior to October AND the day-value is less than 10 — then dob_day is any day prior to the 10th of any month before October — (line 24 of the query)

d) when dob_month is between October and December AND the day-value is greater than or equal to 10 — then dob_day is on the 10th, or any day after, of October, November and December — (line 26 of the query) — Just a note: I made this conditional statement inclusive because I noticed that I was getting a null Date of Birth value because one of the days were structured like 10/10.

The year value (dob_year) is just taken from the Value element that corresponds to the Demographic element of ‘Date of Birth’. This explains why the WHERE statement is used. So, the year value is just taking the last four characters of that aforementioned Value element, which will always be a valid year value.

Finally, once all of the relevant date parts are created and formatted, the Date of Birth field can be created by concatenating said parts in the correct order.

Moving on, for the remaining two CTEs, which are for Ethnicity and Account Type, aliased as ethn and acc_type, respectively — their queries and results are essentially the same. I am just pulling the ID field, Value field and using a WHERE statement to properly align the chosen Demographic element with the desired Value element. In other words, the two CTEs are just working with data that corresponds to them.

The WHERE statement for each of the CTEs is critically important. If that statement doesn’t exist in each of the CTEs, then we will get a result like the following. To be clear, both results are based on the ethn CTE but everything I’ve said about this CTE applies also to the other CTEs as well.

The actual result we want, which includes the appropriate WHERE statement, is pictured below.

To summarize, I am able to “transform the shape” of the unioned output by properly aligning the Value elements to just correspond to the chosen Demographic elements. Doing so means that I’m creating the desired fields of Ethnicity, Account Type and Date of Birth without having to use the PIVOT function.

We are finally ready to join the data CTE (referred to as “unioned output”) with the dob, ethn and acc_type CTEs. After that, we can pull the specific columns of interest. With all of this, we have the last CTE, known as final_data, which matches up with the desired output from the challenge post.

Unfortunately, I cannot show the complete SQL query (it’s about 150 lines!), so here’s the SQL query for the final_data CTE and its output, instead. Also, I wasn’t sure how to order the result so that it perfectly matches with the desired output from the challenge post. So, I settled with just explicitly showing the results for each of the IDs. Everything seems to check out!

This was a phenomenal challenge because of all the moving parts. I think there are plenty of ways to optimize the query (reduce it from ~150 lines) and to utilize more REGEX functions to potentially help with reducing nested functions. Finally, I wonder if instead of making three CTEs (dob, ethn, acc_type), limiting them with WHERE statements, and then joining them with the unioned output (data CTE)– if it would be possible, easier and more performant to use a PIVOT function.

Let me know how you work through this challenge and how many lines your query is!