Today I had a humbling experience, trying to web-scrape just one table from a Wikipedia page. Although I eventually got the table in a format I could use, it has been close to three hours of bother. It has left me with a profound gratitude for APIs (faults and all), and any easily-available data sources.
Here's a basic list of things of how I failed, and what you should avoid in the future:
1. Choose any website you want
The choice of website should be based on the usefulness of the information, but but that is not the only consideration. Whether it is permitted to web-scrape, the accessibility of the HTML and whether the data is stored in the HTML are all questions to ask when looking for a source. If the data is stored in JavaScript, web-scrapping will not yield the data, and often websites will protect their data from web-scrapping if the information is interesting.
2. Don't cache and run
If you are using Alteryx to web-scrape, caching your HTML download request can potentially save you hundreds of calls to the website. Given that not every website is happy with web-scrapping, and may have protections to prevent DDOS attacks, this will ensure that you do not lose access to the data mid-development. Just make sure that the download is a success, and you can focus on parsing the HTML into a more useful format.
3. Try to process all the data at once
As the Alteryx tools pile high, you may consider (several times) whether there is a simpler way to process everything in one handy Regex tool. This is unlikely to be true. HTML is complicated, and may have been written in stages by different people. As such, irregularities are part of the charm, and it should be considered time well spent to find different methods to retrieve all the data within a field.
4. Never use a complicated Alteryx tool
In most cases, simplicity is the key to good data analysis. Occam's razor is a regular reference point for training complex models, with the belief that a simple solution that mostly answers a problem is better than a complex solution that totally answers the problem.
However, HTML is not simple from the start. Using tools that you are less comfortable with can make processing HTML much simpler. Multi-Row formula and Regex are vital to extract individual components from HTML, and Crosstab/Transpose are vital to ensure that the Multi-row formula can reference the right strings.
Hopefully this helps you! Just remember that web-scrapping is hard, and it's perfectly fine to ask for help.
