DWD3: Captains log

by Chris Meardon

9:30

Previous failures:

  1. Not finding a direction to take the project in quick enough
  2. Spending time on things not needed (this is basically 1.)
  3. Not writing a title of the blog
  4. Not adding a feature image of the blog
  5. Failing to embed the dashboard in the blog
  6. Spending the middle 5 hours of the day stressed and unhappy
  7. Failed to parse data correctly into categories (day 1, look at the SIC and NAIC codes)
  8. Got over ambitious (resulting in something rough around the edges in terms of typos, bad tooltips and poor labeling)
  9. Not listening to all of the presentations because I was writing this

Previous wins:

  1. I did some successful web-scraping quickly and efficiently (even if I couldn’t use it in the end)
  2. Created a fun dashboard (day 2)
  3. Used a dark map with bright green highlight colours which people can get behind (day 1)
  4. Accessed the API with relative ease
  5. Got a good idea what comes out of the API
  6. Used a font that isn’t a tableau one!

First of all, let’s talk about titles. This blog is not a tutorial into how to do anything, nor is it a blog about the topic of Unesco Heritage Sites, but it is a log of my thoughts on dashboard week day three. We are told to write a blog and this is the style I’m going for. It does not feel worth the time to write it in a way that is valuable to others and so I am not going to write it that way. Hence why the title is not all that descriptive.

9:35 – Call/ presentations with Andy over

9:55 – We have the challenge

Web scraping and PowerBI on data from wikipedia on unesco heritage sites.

10:30 – I’ve got some juicy web scraping happening now

Currently I’ve got a flow creating a row in alteryx for each row in the wikipedia table. It is dynamic in the sense it will adapt to any number of rows and columns on the wikipedia table.

We take the URL in the text input

https://en.wikipedia.org/wiki/World_Heritage_Sites_by_country

Download it with the download tool. This brings back an HTML of the page.

Now we need to find the wikipedia table and parse it out. Regex can be utilised here to efficiently do this. The expression used below will find all of the content enclosed in table tags and importantly, do this lazily so that we get each table. I used tokenize to split each of these tables onto new rows. Notice at this stage I have gotten rid of anything non-tabular.

There are a few tables that come back so I filtered out anything that isn’t the right kind of table based on that css class of it.

Then we can do some regex to split the table into each of the rows.

10:45 – Having some regex trouble breaking up these rows into several columns

10:47 – Ahh don’t worry I was just missing something

I have a third regex tool parsing out each of the original columns in the wikitable so that I have a column for each of these in alteryx. I just grab everything between the td tags as this is one cell in the table.

10:56 – Table headers

I have done something similar for the table headers, treating them separately, creating a new column for each one and now I’ll make then the headers of the table in alteryx.

10:58 – Oh wait a minute, what are we doing

Stay focused on the task chris
Just having a quite read through the challenge blog post again to check what we are meant to be doing because I have already forgotten.

All of the information about Unesco world heritage sites can be found on Wikipedia. They will need to scrape the list of countries from this page plus scrape the information about the sites within each country from their respective links.

Once they have all of the data, they will need to create their dashboard in PowerBI. If they want, they can explore the data in Tableau, but their final viz needs to be in PBI.

Andy
  • So I have got all the information from that table on wikipedia ✔
  • Scrape info from the links about each country
  • Vizualize in PBI

Okay I’m on track.

11:22 – Getting the links for the second web scrape almost done

12:29 – Had a fantastic lunch from Leon for £3.40 (50% off with an app)

Now back to creating a macro to download the information about each of the countries.

13:07 – The team are looking stressed

There’s a lot of struggle with obtaining the ‘second’ table. The different countries have different formats and they’ve done some odd stuff with their tables. I’m making progress but it is kind of slow. The idea of having to make something quickly in PBI is not something I am looking forward to.

13:26 – noticed that some of the countries’ lists of sites are aggregated for regions.

For example, the African countries are all in one table but the UK has it’s own. This may not be an issue as I could remove duplicates later, but we will see.

13:33 – Yikes, this is hard to make dynamic enough to work for all of the tables.

I feel there is a decision to make to either not complete the task and ignore the second table to instead focus on a good vizualisation in PBI, or struggle through this macro, maybe finish the data side of the task and then have a smaller amount of time on PBI.

Neither are great decisions and I’m not sure what outcome would be most preferred by the boss. I guess I could ask them.

(13:40) holdup

I re-read the challenge and it’s very clear that only once all the data is gathered should a dashboard be created. Hence I will struggle through this nightmare workflow.

14:11 – Things areeeeee bad

It is turning out to be very challenging to dynamically download all of the tables for each of the regions/ countries. I will keep going until 3 and then give up. Two hours on PBI may not be enough time to make something fantastic, but I’ll try make something.

14:32

Turns out there are lists of all the sites on each continent, which would be way easier to use than each countries format, but this is outside the task. Maybe I should just change the task slightly and extract all those? It’s quite late in the day though so it may not be worth doing? Also there’s no latitude and longitudes for the sites which is sad, but this could be collected by following the link for the site.

14:45 – feature image?

Yeah so I’ve made a feature image, which was fun. In other news I think I will give up with alteryx and made something in PBI.

This is what I have made so far:

A workflow to get all the counties on the first list dynamically (i.e.e works if new columns or rows added). It ends in running the macro mentioned below.
A macro to go to a wikipediapage and download all the content in the table. It doesn’t work that well but I got pretty far with making it nice and dynamic for all the different table formats.

I’m basically binning the macro and going to make something with the data from the first workflow (and removing the macro from it).

PowerBI and I do not understand each other – 15:13

Our personalities are clashing and we are struggling to move past this. PBI wants clear direction which I just don’t have, but we will have to hold it together for the sake of the numbers.

My neck hurts
… again

15:27 – an analytical masterpiece

I have no idea what I’m doing in this alien software

16:04 – PBI makes a little more sense but not much

The wonder Alice Haslett gave me some advice and ideas on what to do. I’m going to keep working on this and try make it look better

16:26 – It’s something

16:43 – It’s something with a title

It’s simple and it’s simply done

Let’s reflect

Did I overcome any previous failures today?

  1. Not finding a direction to take the project in quick enough
    • Hmmm I suppose I didn’t do this today
  2. Spending time on things not needed (this is basically 1.)
    • I don’t think I did this today
  3. Not writing a title of the blog
    • A* Success
  4. Not adding a feature image of the blog
    • Nailed it
  5. Failing to embed the dashboard in the blog
    • N/A it’s PowerBI
  6. Spending the middle 5 hours of the day stressed and unhappy
    • Today I was a bit happier but spent the last 2-3 hours unhappy and my neck started to hurt towards the end of the day.
  7. Failed to parse data correctly into categories (day 1, look at the SIC and NAIC codes)
    • I also had some parsing issues today but I did way more successful parsing with 14 or so regex tools.
  8. Got over ambitious (resulting in something rough around the edges in terms of typos, bad tooltips and poor labeling)
    • I don’t think I was over ambitious today reeeealllly but I did not finish something I did. It would have been hard to forsee the challenges though and I don’t think it causes me many issues.

Today’s wins:

  1. Got some lovely regex written and working
  2. Scraped the web and felt comfortable doing it
  3. That’s about it
  4. Created a blog with more reflection in it

New failures:

  1. A dashboard that has very little analytics
  2. Could not perfect a macro dynamically processing wikipedia tables
  3. Hurt my neck, again
  4. Maybe could have done with more time to learn Power BI

Thanks for coming

Avatar

Chris Meardon

Fri 27 Mar 2020

Fri 27 Mar 2020

Tue 24 Mar 2020