Day 3 of dashboard week required us to web scrape data from Wikipedia to collect a list of UNESCO heritage sites and their associated countries. We were then required to visualise this data in Power BI. Easy right?
Of all the days of dashboard week, web scraping was the day i feared the most. While i find the concept of web scraping pretty cool, i am not the most confident person at regex and knew that this would be a difficult challenge. The web scraping was split into two sections – first to attain a list of the countries and their UNESCO heritage site counts, and then to individually get a list of the sites within each country (this data would be taken from the linked country Wiki page.
Andy helpfully reminded us that there had been many blogs posted regarding web scraping Wikipedia, and i found a very helpful one by Robin Vernooij that dealt with a similar topic. By following this, i managed to successfully attain the first table of data regarding the country list and sites. The flow for this is below:
To get the second part of the challenge, i first input the country names created from the above flow into a formula tool to dynamically update URL names (for each individual country). Then i tried to implement a revised version of the flow on the downloaded information from these URLS. However this is where i ran into trouble – i had site names mixed up into different columns, locations where my site names should be and a whole load more junk. See below:
I tried to fix this (using every tool OTHER than regex it seemed) by filtering, trying to clean the data and by moving data from null columns into its correct place. However, after a significant portion of time spent on this i would not have time to get a complete list that would be useful. In addition i had duplicates of UNESCO sites across multiple countries and had no idea why this had occurred.
Knowing i then had to take this data into PowerBi i decided to stop here and start my visualisation. My final flow looked as such:
For my visualisation, i knew i wanted to bring in additional data to add some more depth to the analysis. I decided to bring in tourism statistics and country size (size later scrapped) to see if there was any relationship to the number of UNESCO sites a country has. This data i cleaned and joined onto my main UNESCO dataset by country.
Using PowerBi, i didnt actually find it that awkward – there were things (like using parameters) that seemed to have no obvious clear solution, and due to time constraints there were certain things i couldn’t do. While visualizing the data i saw there appeared to be some correlation to higher tourism in countries with more UNESCO sites and so this is the story i went with. There were some interesting outliers like India and Russia. I realise that this apparent correlation may be skewed by the population size of a country, and if i had more time i would created a weighted tourism value to compare. My finished dashboard is below: