Alteryx WebScrapping - Groupon Top Activities #Alteryx

This blog outlines how to use webscrapping in Alteryx to find Groupon's top things to do. Below there is an overview of the workflow developed:

Alteryx Groupon Top Activities WebScrapping Workflow

Step 1: Decide how many pages from the website you want to analyse

In this first step, the URL of Groupon's top things to do webpage was added in a text input tool. A generate rows tool was used to create rows from 1 to 10 to represent the pages to look at. Both outputs were appended together and a formula tool was used to replace Groupon's base url with the corresponding page (e.g. for page 8, replace https://www.groupon.co.uk/browse/london?category=things-to-do&page=1 with https://www.groupon.co.uk/browse/london?category=things-to-do&page=8).

See outcome below:

Step 2: Download HTML

Use Alteryx download tool to download the HTML for each record. You do not need to configure this tool, but it is recommended to cache and run from this point to avoid refreshing the page every time that the workflow is run.

Note: You may not always get the entire HTML code if the webpage does not load the entire HTML code in time after the page is first refreshed. This might be due to JavaScript code in the back which triggers the HTML to load on the page after a certain event is initialised or called. Sometimes you may even have to adjust the throttle records in the connection tab (or use a throttle tool) to define the number of rows per minute that can pass to downstream tools (e.g. if the throttle is 60 rows per minute, at this rate, 10 records in this example will pass through the tool in about 10 seconds).

See outcome below:

Use the cell viewer in results or a Browse tool to view (or copy) the HTML code of the website

Step 3: Clean the data

Deselect the fields that you do not need and clean the data (in this case the HTML code from DownloadData field) to remove leading or trailing spaces, line breaks or nulls if any.

Step 4: RegEx and parse desired data values

In this example, since we want to visualise the top activities from Groupon's top things to do first 10 pages, we need to select and only display the data values that we are interested in. The first unlabelled RegEx tool is used to tokenize the data which contains these values of interest from the HTML code. The rest of the RegEx tools are used to parse these values of interest (e.g. activity, description, location, price and the link to the deal as shown in the image above).

Note: There might be more than one way of doing this step. Feel free to try different alternatives or improve your RegEx skills following this example.

See outcome below:

Outcome with desired information for each activity (after cleaning the data)

Step 5: Clean and output the data

Clean and output the data to obtain the results outline in the last image from the previous steps. Filter out the nulls, remove trailing and leading spaces, deselect undesired fields and lastly use an output tool to output the data in a desired format (e.g. csv, xls, hyper, etc.)

Author:

Jesus Esquivel Roman

View Profile