Dashboard Week - Day 2 - Diamond League Athletics PDF Parsing

by Georgie Grgec

Day 2 of dashboard week – we were tasked to use the PDF results from a 2019 Diamond League Athletics meet

This was going to be a tough day as we have never parsed data from a PDF file and been told it is a ‘fun’ task! We were given a blog to follow and the hints below….

  1. Use lots of Regex
  2. There are alternating table structures.

I was tasked with extracting the results from the Diamond League in Stockholm. The requirements were to gather at least the Name/Nationality/Year of Birth/Result and not use a PDF connector.

First I had to make sure we had the R package tools installed for Alteryx. The R package pdftools can extract text from PDF. This allows us to include R code into our workflow.

Snippet of PDF Results table

My Alteryx Workflow

There were quite a few tools involved to get the data in the format I wanted as each event had a different table structure. The first challenge was to be able to group each athlete to the event they competed in. To do this, I did a lot of filtering to get rid of excessive rows and then a multi-row formula tool. This assigned the event to each individual row, as seen below.

I then needed to ensure the data was all in the correct columns for each event. For example, some events there was a column ‘reaction time’ where as the others didn’t have this.

After some more cleansing I eventually got my output table! Success!

My Dashboard –

My final dashboard is quite simple again today! Unfortunately I found some issues with the format of the times in some events when I added my data to Tableau. It was a little too late in the day to go back into Alteryx to fix this but this is my dashboard from Day 2!

Avatar

Georgie Grgec

Sun 08 Sep 2019

Wed 21 Aug 2019

Mon 05 Aug 2019