Web Scraping in Alteryx is so "simple"

Web Scraping is extracting information from websites. Sounds simple...

It is important to point out that for it to be legal the website needs to agree to share the data that has been published. Check FAQs, Terms of Use, Privacy Policy and Copyrights. As it turned out today, there are many websites that do not want to share its data, so simple task of finding a website with interesting data became slightly more difficult.

In the end, after few failed attempts, I ended up choosing a Wikipedia entry with a table.

Steps for Web Scraping:

Finding a website - that is happy to share its data
Drag Text Input on the Canvas and paste website's URL
Drag and connect Download Tool - it needs to point out URL that we paste into Text Input
Drag Select Tool - press Cache and Run
Add Data Cleansing Tool - get rid off all white spaces and tabs

Up to this part everything is fairly easy, however now starts the real fun. We need to transform the data that was parsed from the Internet.

My initial approach was to use Multi-Row Formula Tool to try add Headers for different parts of the parsed data. Few if and even more else if statements later, it turned out that my approach was wrong. Some parts of the data had identical structure but they were describing different information - different columns.

What I was supposed to do - pretty much, it didn't solve all my issues entirely, was explained by Robbin in his blog post, if you check his solution, you notice that it got pretty complicated.

All my efforts didn't go in vain, as I managed to practice Multi-Row Formula quite a lot but I didn't manage to scrape the website. The easy way to do it would be using Google Sheets and the function importhtml - it can easily import the table from the Wikipedia page.

Author:

Anita Longa

View Profile