A Beginner's Guide to Web Scraping using Alteryx

I want to start this blog with the disclaimer that before yesterday I had pretty much no idea what web scraping was, so certainly didn't know how to do it. When I say that this is a beginner's guide, I mean it! If you want an in-depth guide on how to web scrape with any degree of technical expertise, this probably isn't the right blog for you.

What is web scraping?

Web scraping is a way to extract data, specifically from websites. Before learning how to do it, I'd always imagined such a process to be complicated or intricate. If it was really easy, wouldn't everyone be doing it? Surely you can find some interesting insights from doing so. I was definitely wrong. With Alteryx's help, the process is actually quite easy.

You might have seen something similar to the below image before. I definitely had when accidentally clicking a button or pressing an unintentional shortcut on my keyboard, and the window appears to the right of your browser. In simple terms, you can think of this as the code that underlies the website you're interacting with. If you want to see a version of it for yourself, just right click on the screen now and click 'Inspect' at the bottom of the menu.

If this seems like gibberish to you then no worries, it is to me too!

If you want to go deeper still (which we do if we want to gain insights from web scraping), we can instead right click on the option above 'Inspect' which says 'View page source'. The page source from the same website we used for the above image (a website specifically designed as a demo for web scraping purposes), looks something like this:

This is the first 48 lines of the source code, which for this website is actually 2240 lines long. Despite not knowing the programming language, you can start intuitively picking up what certain lines are doing as you continue to explore the code.

So how do we do it?

When we want to download this source code, Alteryx has a tool which can specifically help us. By entering the requisite url for the website we want to scrape into the configuration of the Download tool, the source code will be downloaded as a string when we run our workflow.

Once this download is done, congratulations, you have successfully scraped data from the web! If you want to actually gain further insight from the extracted data, that's where you can use RegEx to parse the desired segments from the code.

Author:

James Charnley

View Profile