For our last training day and as part of our dashboard week, we were expecting a nice and easy challenge to end on a good note but unfortunately we do not live in Alice in wonderlands world and the task of the day was difficult to say the least.
Description and Requirements: We were asked to retrieve data from a PDF file containing around 250 tables relating to Star wars movies polls. The polls group all the star wars movies and characters and asked their favourite movie or character to different demographics. For instance, age groups, political affiliation, gender or education. We needed to split into two groups, one for the movies and one for the characters. Then we had to clean the data and import them in tableau to create an insightful dashboard.
Challenges faced: After looking at the PDF we decided to split the teams from the get go and I was assigned to the movie analysis team. The main and obvious challenge was to get the data from the PDF file as we had never done that before. We used an online PDF to excel converter to get the tables but this methods didn’t come without its drawbacks. Indeed, the excel tables came into a strange format with multiple headers and sub-headers for the movie titles, the cells divided in a strange way and the number ‘1’ was imported as a symbol in all the excel sheets.
We first tried to clean the file in Prep by using the ‘clean with data interpreter’ option but we realized that it was probably not the way to go. Consequently we decided to take two different routes to clean and prepare the data in order to give us a escape plan if one of the method led to a deadlock. One route through a pre-cleaning in excel and a preparation in Alteryx and the other route using only Alteryx with Ollie’s PDF input tool to start the preparation.
I worked with Rahul on the first method with a preparation of the data that we initially got in excel. We first selected the tables needed for the movie analysis. The preparation took a while but eventually we were able to send a pretty cleaned into Alteryx for the final cleaning steps.
The Alteryx workflow was a challenge as well as surprisingly there were some discrepancies between the data that were in the excel sheets and what we had after importing the file in Alteryx. For instance the values in percent in the excel sheets appeared as decimal in Alteryx (ex: 10% -> 0.1), some parenthesis in excel became ‘-‘ and some values were concatenated in the process.
First we fixed these issues with a couple of multi-row formulas (for percentage and ‘-‘ signs) and used regex to parse the concatenated fields. Then we transposed the data to get our field names and values, we kept only the rows with the number of respondents and got rid of the percentage values and cross-tab to get our final cleaned table.
Dashboard creation: A problem that we encountered was that if we used all the movie tables that we had previously selected we ended up with more than 80 fields so we decided to used only one table to create our dashboard. We only kept fields relating to education, age, gender, political affiliation and ideology. At this point I had really limited time so I created a simple dashboard composed of 3 bar charts showing the favourite movie by age category, political affiliation and education level.
Final thoughts: This challenge wasn’t fun, Andy lied to me. I’m happy the dashboard week is over…