Understanding student dropouts through Sankey Charts

by Emily Chen

Note to readers: I’ve been asked to change a couple of visuals here so it might look a bit different from the previous version. I’ve replicated the method below with a dataset on student results from Harvard’s most popular online class – Introduction to Computer Science and Programming.

————————————————-

Last week I worked on revisualizing a dashboard showing a student cohort’s progression through university. The client, a higher education institution, was interested in finding answers to some  big questions –  which students drop out? why do they drop out? how can we prevent this?

Their current way of revisualizing student progression was with stacked bars with all cohorts in the view.  The difficulty with this view is the lack of attribution as a student moves between different stages. A more insightful view is with a Sankey chart. It’s not for the faint of heart – it easily breaks and will double the size of your current dataset and marks so I had lots of performance issues into the week. I’m going to hopefully provide a bit more context to Olivier Catherin’s fantastic article on building Sankey charts with polygons. If you’ve got a large dataset. I highly recommend this method.

Final Output as a Sankey chart with the Harvard Online Class dataset

Sankey Chart replacement

 

Step 1: Build a basic one with the superstore dataset.

I found Chris Love’s article to be extremely helpful in unstanding the underlying merchanics of a Sankey diagram. TL;DR: You’ll need to duplicate your dataset and stack them on top of each other with an additional column called ‘dummy’ and ‘real’ (arbitrary names) so the sankey lines have a beginning and end point. Then you’ll use the binning to help plot the points on the curve and you’ll use the e function to calculate and draw the Sankey’s signature curve.

If your dataset is rather small (maybe 5000 records), then this method will suffice. However, with 28K unique students I needed to build the Sankey with polygons instead. What’s the difference? Much faster performance as it calculates the the min and max points of a stream, instead of plotting individual marks as Chris’ method above.

Bare bones version of a Sankey by Chris Love

Sankey Diagrams - Chris Love's method

Step 2: Build a Sankey with the polygon method.

My final output was mostly based on Olivier Catherin’s article with slightly different approach to the data prep. I found doing the union in Tableau’s custom SQL still gave me performance issues, so I ended up doing as much work as possible in Alteryx. The workflow gets a bit crazy as I needed the data to be ‘short and wide’ vs ‘long and tall’ so I could produce the viz.

Short and Wide:

Student ID Year 1 Year 2 Year 3 Year 4
ABCD Full Credit Full Credit Full Credit Graduation
EFGH Full Credit Full Credit Full Credit Graduation

 

Long and Tall:

Student ID Year Status
ABCD Year 1 Full Credit
ABCD Year 2 Full Credit
ABCD Year 3 Full Credit
ABCD Year 4 Graduation
EFGH Year 1 Full Credit
EFGH Year 2 Full Credit
EFGH Year 3 Full Credit
EFGH Year 4 Graduation

 

My Alteryx workflow – I’ve put them into containers with descriptions of the data prep for my handoff to the client. The first 3 containers describe pivoting the data to be ‘short and wide’ so I’m going to show you the last 2 steps since these are the pieces which helped with performance in my final output.

Alteryx workflow and a description of the tool actions:

Alteryx Model MDX

 

Joining against the ‘Sankey Model’ in Alteryx.

The first box describes the dummy and real values I had mentioned earlier. The second box joins my original dataset against the binning values (called t values) that help plot my points on the Sankey diagram. In Chris’s version, these values in the table are created in Tableau (see ‘Step 2: Densification in Bins) but I’ve done it in Alteryx to ease the strain on Tableau. Since the t values are static, I’ve joined it against the table which can be downloaded here from Data+Science (using the second sheet called ‘Model’).

Join against polygon table

Reducing the size of my dataset through random sampling

After the union and join against the t-values table, I ended up with 5M+ records. Chris had the excellent suggestion to reduce my dataset through random sampling by using students whose IDs ended in ‘0’. This way I get 1/10th of the values, but we’re still able to see the overall behavior patterns in the Sankey.

randomization

Step 3: Cracking open the Superstore Sankey by Olivier Catherin

Look at this magnificnent Sankey chart! My favorite part is the separation between the categories at each pillar – a necessity given the multiple stages and layers.  The separation within each pillar is one of the polygon dimensions on size, so pretty easy to implement. Note that its nested table calculationis that drive the separation in the curves/paths so make sure you’ve got the addressing and partitioning in the correct order. The  year to year curves (Year 1 to Year 2 vs Year 2 to Year 3) are have the same order in the addressing but its the partitioning that changes, which makes sense since each year to year curve has a different start and end year. Make sure you’ve got a ton of patience and a good friend to look it over for you – I had 6 years of pathing and was a bit cross-eyed by the end of it!

ezgif.com-video-to-gif(4)

 

Overall, this chart was one of my most rewarding projects at the data school. Not only was it an opportunity to learn how to implement probably one of the most technical visuals in the Tableau sphere, it was the perfect way to address the client’s deliverable. If you’ve got any questions on how to build it, feel free to give me a ping 😀

Sankey Chart replacement