The Journey from Prep to Desktop: Preppin’ Data 2020 Week 49 Challenge – First Week at the DS

by Simon Evans

Reflecting on a busy but thoroughly enjoyable first week in training at the DS, I thought I would write a blog on the first project DS22 group 2 were assigned with.  This week focused heavily on using Data Preparation Tools (Alteryx and Tableau Prep). We were tasked with choosing a challenge of our choice (either an Alteryx weekly challenge or a Preppin’ Data Challenge) and presenting this back to Coach Carl.

I decided to choose the Preppin’ Data Week 49 challenge from 2020, and used Tableau Prep as my tool. The challenge can be found here: 2020: Week 49 (preppindata.blogspot.com). The challenge involved preparing the data in order to rank NBA basketball teams in the 2018/19 regular season, and find a story within the data, such as Carl did here Tableau Public. I also created a viz in Tableau desktop which I will come to later!

The challenge in Tableau Prep was to get from this:

to this:

Part 1:

Input data from all tabs in the workbook

Rather than inputting all tabs from the Excel file individually, Prep offers a neat little trick where you can use a “wildcard” match. By leaving this Matching Pattern section blank, Prep will input all the data from each of the month sheets, in effect creating a “union” without having to put a union step in manually (as long as the column headers are the same in each month sheet).

Part 2: Determine who won each game

Before working out who won each game, there was quite a bit of cleaning to do. In this “cleaning step” Prep allows you to see how many changes you have made to the data, in this case I made 12, which included removing unwanted rows (As we were only after “regular” season data, there were rows in the April sheet which contained playoff games, so these could be removed), removing unwanted fields (7 in this case – things like the attendance wouldn’t help us answer the question at hand!), and I re-named a few fields so they would make  more sense to me (Home, Away Team). I also had to validate the points for each team using a google search to make sure these matched up – i.e. I wanted to make sure the score was Indiana 111 – 102 Detroit on 1st April 2019, so I could name these points fields accordingly.

Now I was in a position to work out who won each game by putting another clean step into the flow and using an IF Calculation

What I am telling Prep here, is to compare the home and away team points, and if the home points is greater than the away points, then give me the home team in a new column, but if not, give me the away team. This field can aptly be called “Game Winner”.

Now I had the winner of each game, I could insert another clean step and remove the Home and Away Team Points as these fields are now redundant.

Part 3: Work out the order of each team’s games

In order to work out the order of each team’s games, I was going to have to change the data shape (I would need both the Home team and Away team as a combined field. This can be done by inserting a pivot and selecting “Columns to Rows” for the Home and Away Team. Now I have my teams in one field and whether they played home or away becomes a field of its own also.

The next step was to create a game number per team field, in order to work out the order of the games they played chronologically by inserting a clean step and creating another calculated field.

This calculation is telling Prep, that for each team, order entries by date, and rank in ascending order (so that the earliest game would get a game number of 1, and the last game chronologically would get a game number of 82).

This step can be validated by confirming there are 82 unique values for the new game number per field calculation in this pane.

I could now insert another clean step and further clean the data by removing the home or away field as this was now redundant.

Part 4: Work out each team’s cumulative wins after each of their 82 games

In order to work out the cumulative wins for each team I would need the game winner field to be in numerical form (in order to sum the amount of wins), however currently we have the winner in a string form (i.e. the team written as text). Therefore, another IF calculation can be used that says IF the Team field matches the Game Winner field, then give me a “1”, or else give me a “0”.

The string field for game winner could now be removed, as we have the win in numerical form.

Actually calculating the cumulative wins for each team was particularly challenging, so I had to sound out Prep expert Tom Prowse for some help at this point (Thanks Tom!).

Because multi-row calculations cannot be performed outright in Prep, in order to create a running sum of wins, it was necessary to perform a self-join on the data, such that for each row it has the values from all prior rows.

Here the join clause, is keeping all game number rows from the right side of the join and only game numbers equal or less from the left.

The next stage was to aggregate the number of wins for each team using the aggregation step.

Part 5: Compare the teams to each other

- Rank 1 is the most wins by that game number

- Rank 30 is the least wins by that game number

- In event of a tie, rank the team alphabetically (A-Z)

Ranking is fairly straight forward in Prep compared to other multi row functions, so lastly it was straightforward enough to insert a ranking calculation to match the above criteria.

This also left me with my required output to take straight into Tableau desktop!

Part 6: Visualising this data in Tableau desktop

Because the data looked at ranks of teams over the course of the season, I thought it would be nice to visualize this as a “Bump chart”. When I built this out for all teams, this was the view:

The result of course is chaos! Therefore I had to think how I could look at a subset of this data to tell a story.

I played around with the idea of “team consistency”, i.e. which team has been most consistent throughout the season, therefore I had a look at the Standard Deviation of each teams rank (how much variation has there been around each teams rank over 82 games). The lower the standard deviation, the less variation there has been.

I thought of this as being a kind of consistency index and it might be interesting to compare the most consistent team (The Milwaukee Bucks) and the least consistent team (The Memphis Grizzlies), which lead to the following view:

I’m glad to say there was definitely a story there and a link to the viz is here

Preppin Data 2020 Week 49 Visualisation - Simon Evans | Tableau Public