NHS data – A lesson in Alteryx & the Pareto Principle

by Samuel Shurmer

So today was the first day of Dashboard week, the cohort-ly challenge week, where we are given a day to prep and make a dashboard, present it (from the previous day), and then write a blog about what we learnt (every day) – so here is my first account.

This morning we were quite excited to get going on dashboard week, Hanna even described it as “going on holiday”, how wrong she was (well for us, maybe she thinks differently). Andy set us these requirements (lifted from his blog which can be found here) –

“For day 1, the team will be using data about the NHS from data.gov.uk. Specifically, they will be looking at spending over £25,000 at NHS Trusts. Using this search, which results in 391 data sources, they will need to download and combine all of the data across the 20 pages of search results. Each NHS Trust has multiple files as well.”

However, this search tool was not all that it seemed. After many hours of attempting to understand why we were getting very different results to the actual desired amount (391 across 20 pages, we were getting more than 2000 off 64 pages, which shouldn’t be happening) we discovered that the download tool was causing confusion by “Encoding the URL” with a tick box that was adding characters, and causing the website to search all files, not just those 391.

After a quick lunch I was back looking at these problems and had finally managed to get the links of over 1500 different data sets, now to download them and start to build a dashboard. HOWEVER, this was not the case, many of these links were .pdf, .zip, & .html and therefore unusable. Eventually I narrowed it down to csv & xlsx files, which still had issues with many files missing or hiding behind expired security certificates causing a catastrophic failure of my download tool (it would have been better to use a batch macro that would have ignored these failures, in hindsight). In the end we agreed to look at just one NHS trust; I chose to look at Tameside & Glossop in Manchester. After quickly wildcard unioning my data, I was finally in Tableau.

With much of the data I attempted to look at trends that could be found.

Many holes here

Upon inspection I noticed that the Pareto Principle applied to this data and set about showing how similar the trust’s spending was to the 80/20 principle.

Looks like one to me

So, all in all, I managed to put together a decent sized dashboard in about an hour, but I learnt the dangers of not understanding certain elements of html encoding and how the download tool works. Hopefully, tomorrow I will get to spend more time creating an actual dashboard and not attempting to spend my day in Alteryx. Though they do say, 80% of our time is on Alteryx and 20% on Tableau. Maybe this Pareto guy was on to something…