DS25 Dashboard Week Project #3: Star Wars Statistics

by Jesus Esquivel Roman

The Project

Demographics of Star Wars fans. 268 page PDF from Morning Consult. Inside, there’s some polling data about Star Wars including all kinds of demographics about the people that responded. There’s a section for each movie and for each character.

Requirements

  1. Split into two groups: half will look at the movies, half will look at the characters
  2. Get the data from the PDF
  3. Prep the data
  4. Write a blog post
  5. Creating something interesting to present

Steps

1. Identify a method to extract and read the data from PDF into Alteryx and Prep

2. Used an online converter to extract the data from PDF to Excel (https://www.ilovepdf.com/)

3. Downloaded and installed Alteryx PDF Input tool

PDF Input tool

4. Investigated the Alteryx PDF Input tool to make the workflow

5. Parse and tokenize useful data

Final Alteryx Workflow

Challenges

  • Downloading and installing 'pdftools' Alteryx tool. Had to ask ask for help installing the files as it was downloaded in a separate path
  • Input PDF tables and read them in Alteryx. Input PDF tool returns single rows with all text from the PDF the data as shown below
Output of PDF Input Tool
  • Different headers in the PDF tables and also splitted into two separate rows in some cases as shown in the image above, so it was challenging to dynamically adjust the headers
  • Parsing or tokenising data using RegEx