Importing a page code in Alteryx: Two methods

by Naledi Hollbruegge

Did you know you can import a website’s source code through a Text Input Tool?

I just finished an Alteryx project to download how MPs voted on various motions throughout the years. This extensive database is provided by www.publicwhip.org.uk in json format (see here) and you can also look at the data on www.theyworkforyou.com. This is a group of fantastic websites with really interesting data that they make available for free.

I used one of these URLs as a placeholder to figure out the structure of the workflow, parsing the json and getting it into a format that Tableau can process. The second step was then to get the other jsons as well, by finding a way to change out the number in the URL that defines the json (http://www.publicwhip.org.uk/data/popolo/363.json). There are over 100 URLs, so I didn’t want to copy them manually. The numbers also don’t follow a logical or sequential order and range from 363 to 6758. I had a look at the source code (ctrl+u) of the page where all these are listed  and figured if I can import that to Alteryx I could just strip out the json numbers and add them to the base URL.

Source code

Source code

I first tried copying the code (ctrl + c) and then pasting it to a Text Input tool (ctrl + v). This however just pasted everything to the first page break.

Paul Houghton gave me a few tips on how to do it properly. One solution can be to paste the text to sublime first, which will in some case deal with the line breaks.

An easier way is to use the paste option in the text input tool rather than ctrl + v. Now Alteryx displays the text correctly and splits it to multiple rows.

paste

Paste method

Another way is to just copy and paste the website URL. Then use a Download tool, which will get the entire source code. This is the approach that I took. I then used a RegEx tool to just extract the json numbers, as these appeared twice (…href=”811.json”>811.json=…) I did not just look for \d{3,4}.json but included the <a href=” as well, so my Regular Expression was (<a href=”\d{3,4}.json). With a Formula tool I then replaced the <a href=” part of the output with my initial input, which is just the base URL. The result is that I have 120 different URLs that I can now download and process. If new jsons are added to the website in the future this approach will get these automatically, which wouldn’t be the case if I had chosen to paste the data directly. However, in many cases the paste approach will be absolutely sufficient and might be better suited to your specific project.

source code

URL method