Today we had the opportunity of using RegEx once again. How have we been able to live so long without knowing & using it is a question for another day. Suffice it to say, for me, despite my aversion to cheese, the day was worth it just because of the data prep part.
We were debriefed in the morning: get these data and do something with them. What follows is my journey.
Steps I followed:
- Go to the page, download html and spend some time exploring it. These are my findings:
- The website has the option to list cheese names by country, alphabetically, etc.
- The alphabetical section contains a button for each of the initial letters. The code below shows consistent pattern (that sounds promising).
- When you click on one of the letter-buttons, you are taken to another page with a list of pictures of N cheese names (N can be set in the url)
- Within each of the pages, this is how the individual cheese pages and names are displayed (and, below, my RegEx to parse them)
- Within each of the cheese pages, there is some useful information. We were mostly interested in the characteristics of each cheese and, luckily, they also followed a consistent pattern:
- So, to summarize:
- One landing page with a list of alphabetical links
- Within each of those alphabetical links, X pages with N cheese links in it.
- Within each page Y number of characteristics.
Alteryx & RegEx:
Armed with this information I went to Alteryx to…
- Obtain the list of cheese pages:
- For each individual letter (that immediately sounds like a BATCH macro to me), download the page and keep running WHILE there are pages left (that “while” should immediately make you think of an ITERATIVE macro!),
Warning! Remember to update the iterative limit and the URL (I was missing 400 records for the longest part of the morning until I realized why –and only noticed because @George mentioned it!)
Everything ready for…
All that effort to download Cheese information. I detest cheese with all my might. Just thinking of spending some hours analyzing information about textures, colours, aromas…dreadful! To save my sanity, I decided to sidestep all those characteristics and focus only on “Name” and country of fabrication… And here it is, my particular version of “Choose your own Cheese” 3*(RegEx+1):
When I say “By Character” I mean that. Literally.
Total RegEx = 8