The Data School - Strolling with SQL: Solving Preppin' Data Challenges (part 6)

In this blog, I’ll be working through the Preppin’ Data challenge called “Is it the teacher or the student? Part 1”. Try it out here! I’ll be using the incredible SQL to solve this challenge!

This blog will consist of screenshots of my queries, outputs and descriptions of what the queries are doing.

Initially, this challenge appeared deceptively simple but ended up being quite the brain racker! This had to do with the need to unpivot one of my CTEs but I’ll say more about that later.

For this challenge, we need to join two tables, Student Information and Student Results. After that, we need to average the scores in the English, Psychology and Economics fields and group them by the Class field. Then we need to find the lowest average score across the three subjects and relate those scores to their respective classes as shown in the output table below. Overall, I used four CTEs to solve this challenge.

Let’s take a look at what the input tables are.

The output table is pictured below.

Here’s the query and output that represents the join between the two input tables. This is our first CTE which is called data.

Now we can build out the second CTE, averages, which calculates the average score across the English, Psychology and Economics fields and groups this result by the Class field.

To make my life easier later down the line, I rounded each of the average test scores using the ROUND function. One thing to note for this CTE is that it’s crucial that we also bring in the Class field and group by it. If not, we get the following supplemental output that is too aggregated.

In case the text in the above image is too small, they show an average score of 73.8, 62.3 and 55.5 for the subjects of English, Psychology and Economics. These values are not the ones we’re looking for. Instead, we are looking for the lowest average score. So, we need the Class field so that the result for the averages CTE is six rows instead of one (this will also be relevant later on).

Moving on, the below query and output shows us the min_averages CTE, which results in only one row with the correct values that we’re looking for.

As you can see, in contrast to the aforementioned supplemental output (what would happen if we didn’t group the averages CTE by the Class field), the values shown here of 68.2, 52.4 and 59.6 for the subjects of English, Economics and Psychology respectively are correct.

With that, we can now make the unpivoted_data CTE, which is just the min_averages CTE but unpivoted.

What the UNPIVOT function does is to transform the shape of the table. Like I mentioned earlier, the min_averages CTE has three fields and one row. The fields represent each of the subjects and the single row represents the score for each subject. However, through the UNPIVOT function, the unpivoted_data CTE now has two fields and three rows. The fields are Subject and Grade while the three rows represent each of the scores for the aforementioned subjects. In other words, we took the three fields from the min_averages CTE and transformed to be rows in the unpivoted_data CTE.

Finally, we can bring everything together with the code between lines 36 and 45 shown in the query below.

And voila, we were able to replicate the desired output for this challenge!

The reason why this challenge was difficult for me was because, at one point, I tried nesting two aggregate functions. It looked something like this MIN(AVG(english)) AS “min_avg_english_score” — but this is completely wrong and would return an error.

The second issue I was running into was properly setting up the UNPIVOT function as shown in the query for the unpivoted_data CTE. I had to read through the documentation of that function a few times, but I think, at least for this problem, I figured out what the syntax of the function amounts to in the output as shown above.

UNPIVOT(grade FOR subject IN ("lowest_avg_english_score", "lowest_avg_psychology_score", "lowest_avg_economics_score"))

“Grade” is the name of the field that will contain each of the lowest average scores.

“Subject” is the name of the field that contains aliased values for each of the scores. So, while the actual value of MIN(“avg_english_score”) is 68.2, its aliased value is “lowest_average_english score”.

One last thing that’s worth mentioning is the need for the WHERE statement. What it’s doing is limiting the result to only show the rows where grade values (which are lowest average scores across the three subjects) are equal to the average scores for the three subjects. To illustrate what this logic is actually doing, let’s take a look at a supplemental output which is returned when we remove the WHERE statement.

As you can in the output above, we are showing the lowest average scores for the three subjects across every class. The null values in the Subject field occur because of the logic of the CASE statement from lines 37 to 40 in the aforementioned query. In other words, when the average score for each of the subjects from the averages CTE is equal to the values in the Grade field from the unpivoted CTE (which represents the lowest average scores for the three subjects) — return the name of each of the subjects. The reason why the six classes are duplicated three times (for a total of 18 rows) is because there are three values in the Grade field. We know there are six classes because we use the Class field from the averages CTE.