Day 2 of dashboard week – we were tasked to use the PDF results from a 2019 Diamond League Athletics meet
This was going to be a tough day as we have never parsed data from a PDF file and been told it is a ‘fun’ task! We were given a blog to follow and the hints below….
- Use lots of Regex
- There are alternating table structures.
I was tasked with extracting the results from the Diamond League in Stockholm. The requirements were to gather at least the Name/Nationality/Year of Birth/Result and not use a PDF connector.
First I had to make sure we had the R package tools installed for Alteryx. The R package pdftools can extract text from PDF. This allows us to include R code into our workflow.
My Alteryx Workflow –
There were quite a few tools involved to get the data in the format I wanted as each event had a different table structure. The first challenge was to be able to group each athlete to the event they competed in. To do this, I did a lot of filtering to get rid of excessive rows and then a multi-row formula tool. This assigned the event to each individual row, as seen below.
I then needed to ensure the data was all in the correct columns for each event. For example, some events there was a column ‘reaction time’ where as the others didn’t have this.
After some more cleansing I eventually got my output table! Success!
My Dashboard –
My final dashboard is quite simple again today! Unfortunately I found some issues with the format of the times in some events when I added my data to Tableau. It was a little too late in the day to go back into Alteryx to fix this but this is my dashboard from Day 2!