A hard start to dashboard week. DS15 were set to web scrape ASOS’ website for data on their new items for both male and female clothing lines. (See the blog post!)
Today, I have learned lots about web scraping. Particularly, when using an iterative macro, be aware early on that your iterative macro MAY not return data from all pages nicely when trying to use pagination to collect all data across several pages. There may be some inconsistency in the HTML that will throw out your workflow, I have fallen foul to this today. Building your flow with this at the front of your mind will help you in building something which is dynamic and versatile.
Another very important yet easy to oversee aspect of this: make sure your iterative macro output is feeding back data through the macro input in a way it can digest. Field names and data structure must be the same.
Final major stumbling block. Make sure your iteration mechanism is actually working. I was seeing iterations, but not getting any data unioned properly at the end. Check that you have configured everything properly -especially input and outputs
Here is my final iterative macro for scraping all pages of new male clothing from ASOS:
I used a mixture of text to columns and regex to work through bringing in details of clothing brand, style, price and general info. Not all of these field were returned correctly. This was essentially the main time sink for the day – checking my regex and scratching my head. “Why do i get more fields on later iterations?” “Why do I get seemingly different fields in the same column?” Eventually I ironed these problems out and developed a macro which was working for all pages.
Once I’d made the male clothing macro I simply switched out the URL in the initial text input simply to the female version, which was, non-surprisingly, structured in the same way – it worked well.
I then brought these two macros together into a single workflow where I union them and output as a csv file.
I am missing a couple of fields, as well as having some data being in the wrong fields, on around 5% of the products. Not ideal. I’m sure this can be cleaned up, however, it will require taking the macro apart and and sorting out the use of regex.
Below, the resulting dashboard. Its quite cool to see that the male clothing is comprised of a larger number of cheaper ticket items, judging particularly by the histogram and the BANs. I look forward to presenting this tomorrow and getting onto the next challenge!
Thank you to Jonathan, Hesham, Tom and anyone I’ve missed for their help today!