My task for our DS project this week was to use the Tableau Public API to create a dashboard of content for as many people that have been through the DS as possible (coaches included if possible). It’s fair to say I experienced a number of challenges along the way.
Firstly I found the tableau public API from here https://datavis.blog/2019/05/13/tableau-public-api/. This worked fine when using my own name. However I wanted more than just my name – I needed to find a way of getting the tableau public URL of all DSers. I decided to scrape the DS website in order to grab each DSers individual DS website profile page. From their individual pages, I knew I could grab their tableau public URL through a second web scrape.
However, my issues arrived at the web scraping. The DS website has been refreshed recently and currently some things are not available. Firstly I couldn’t grab the names or the DS URL for any DS’er. So I decided to grab a list of names from a tableau workbook I had found on our tableau server instead. I copied these into a simple excel document, tidied it up in Alteryx and also formatted the names so they were in the form of their DS URL page. From this point, I attempted my second web scrape – another issue! The tableau public profile link didn’t pull through into Alteryx. The LinkedIn and Twitter URLs did but not the Tableau public which was frustrating.
Plan C at this point was to re-format the names into a the Tableau Public API style URL so I could just grab each individual’s tableau public information like that. This worked but obviously names vary – some are double barreled and some might have accents, and some might not be written in the same way as they have their name on Tableau Public. This meant that some errors came out, which at this point I just decided to filter out. Another issue here is name duplication. For example, Simon Evans on our cohort didn’t pull through because a different Simon Evans who signed up in 2017 did. More errors that can’t easily be determined without going through each individually. Therefore my final product also had errors sadly.
Back to the API - the Tableau Public API is very useful and easy to use and looks like this:
https://public.tableau.com/profile/api/profile.name/workbooks?count=100&index=0. In this URL, you swap the profile.name with your tableau public name to pull through the information you want. The count is the number of dashboards it’ll return. At around 300, this seems to error so for a profile with more than 300 visualizations, you’ll need to set up a few URLs to grab. I set my count = 250. For my cohort of individuals, Andy Kriebel was the only one who surpassed 300 (he has 1041). What I would have done given more time, is created an iterative macro to pull through all his others but for the sake of time limits, I manually added a few extra rows for his data to ensure I could pull through all his visualisations.
To briefly run you through the workflow, to grab an API you require both the download tool and JSON parse tool in alteryx. The JSON parse tool needs to point at the DownloadData field and the rest is pretty much done for you. At this point I removed the errors I spoke about above (apart from the Name duplication issue). You can see below in the DownloadData fields where errors have occurred so I filtered those out.
Next, I used a text to columns tool to split out the number that precedes each field dimension. This will allow me to group on both the URL and the number when performing a cross tab to make each of the field names column headers. Having used the crosstab, I now have one row per visualisaton. The rest of my workflow involved a bit of cleaning up of field headers, and parsing out the name of each author so I had it as a separate field.
Outputting my data to a hyper file and bringing it into tableau allowed me to create the below!
A note that when you get into tableau, the date fields will be a mess – they are brought into alteryx as this huge number. The blog here helped me format calculations that magically turn the numbers into dates:
Calculation: date(dateadd('second', int(int([First Publish Date per Document])/1000), #1970-01-01#)).
Link to the tableau public workbook is here.