Today was our second day of #Dashboardweek and the task this time was to look at the historic results of the London Marathon website. Because the results span all the way from 1981, Andy limited our search to only those entries that start with the first two letter of my surname -PA.
Whilst the data prep yesterday focused on using APIs, today we had to do web scraping. The challenges that the site had were 1) a dropdown for each different year 2) multiple pages. However, the rest of the data was nicely structured in a table.
I must confess I struggled a bit at first, since it has been a few weeks since my last web scraping but with a bit of revision and Gregg’s help I was able to get back in track without losing too much time. Compared to yesterday, the workflow I created uses a similar structure but adds a few Regex Parse and Tokenize to extract the different results.
Building the dashboard
Having finished with the dataprep I started thinking about my dashboard. The first though I had in my mind was to create a scatterplot that reassemble a race, using a technique I had used during one of our client projects. To achieve this, I would have to plot all the points and jitter them, so I can see them spread around according to their time. I thought I had to add some more context and decided to look for the men’s winning time for each year to add it as a reference line. However, I noticed some of my points were going over the line and I noticed those were from the “wheelchair” race and I filtered them.
Because runners didn’t have a unique ID I wasn’t able to track their performance over the different years and posed a problem because there were many runners with the same name in the same race. Therefore, I had to create a unique identifier by mixing the name with their respective finishing place.
At the end I decided to add some extra information regarding the average speed of the racers. Also, I included a small section with the fastest record, the most repeated name and the racer with the best finishing position. This was the result:
If you have any doubts or comments, feel free to use the box below or contact me in Twitter @DiegoTParker