"You're gonna need a bigger boat": Using Exasol and Tableau for a 550 million rows database

by Pablo Sáenz de Tejada

So we just finished week 6 in the Data School, which means a third of or training period. Actually I have a strange mix of feelings: On the one hand we are learning so much that it seems like we have been working with our coaches for 4 or 5 months, and on the other hand the mix of learning, preparing the projects and having a good time everyday makes it difficult to realize that we are close to the second half of the training.

Apart from our weekly project, which I will explain in more detail, this weeks was focused in three main topics:

  • Advanced Table Calcs and LOD Calcs with Craig Bloodworth.
  • Improving viz design skills with Laszlo Zsom.
  • Learning SQL with Robin Kennedy.

If you have not been able to see in detail the potential of the LOD Calcs introduced in Tableau 9.0, I recommend you this post: Top 15 LOD Expressions to understand this new feature and its possibilities.

For this weeks projects, we had to analyze GP prescriptions data in England during the last 5 years: more than 550 million rows of data! So, as the famous quote from Jaws says: You’re gonna need a bigger boat, and we have to use Exasol to start analyzing that big amount of data and optimize our first approach to the database. It is important to highlight that all this data refers to all prescriptions for all type of diseases, so after a first approach to the data, and searching for additional secondary information sources, I decided to focus my analysis on the prescription of products for Dementia. Why? Because as shown by the World Health Organization, Dementia will be the at the end of this year the fourth main death cause in high income countries, and the third one in 2030, causing this year more than 700.000 deaths in these countries and will be the death cause with higher along with lower respiratory infections.

In addition, focusing on a specific disease allowed me to have a more feasible amount of data considering the time available to the project without loosing data quality (even had more than 2 million prescriptions to analyze), focus on a very interesting topic and blend the information with other resources from the WHO and Eurostat about death causes projections in high income countries and death causes in UK.

With these considerations, I used Exasol to see if the data available for Dementia would be enough, filter the data I was interested in and analyze 4 main topics:

– Projection of death causes in high income countries by the WHO.
– Trends in number of deaths caused by Dementia in the UK and analysis by age.
– Total number of items prescribed for Dementia in England in the last years and the Net Ingredient Costs (NIC). With a deeper analysis by brand vs generics, by each concrete drugs and geographical analysis.
– Create a simple calculator to see if we will recover the investment made in a new Dementia treatment (and how much we will gain or lose) before the patent expires based on the investment to develop the product, the patent duration, product price and an average market share.



You can see the complete visualization in my Tableau Public Profile. Specially interesting is the impact of a patent expiration in the Net Ingredient Costs (see the case of Aricept, with a huge decline in NIC while the total number of items continues increasing after its patent expired. And also how some drugs are very geographically concentrated, specially brands compared to generics. It was a great experience to analyze a so huge dataset, I hope we have another project like this soon!