DS23: First project and presentation

by Joselito Bondoc

After a couple of days of being introduced to data prep. and using Alteryx, we received our first project. Carl tasked us with:

  • finding a new dataset to supplement our original application to the Data School
  • sketch out the dataset structure we want to create
  • produce a list of changes required to obtain it
  • add the new data source to the original data set using Alteryx
  • enhancing our visualisation for bonus points

In our presentations, we needed to show our:

  • plan
  • Alteryx workflow
  • improved analysis

My original viz application

Figure a. The first part of my original viz. Tableau public link here.

My original application was on the Covid-19 outbreak, specifically focussing on South Korea. The dataset used was a snapshot of a Covid-19 dataset on 8th March 2020, which contained the running total of confirmed, recovered and deceased cases for different countries.

My ideas and supplementary data

In Figure a. we can see that South Korea had a steep increase in the number of confirmed cases near the end of February and had the highest number of cases by 8th March 2020 (excluding China). I wanted to find out how the trend of cases changed after this time period. For example, does South Korea still have the highest number of confirmed cases? I was also interested in finding out how vaccinations are affecting the recovery of countries from the virus.

The datasets

Plan and Alteryx workflow

Figure b. Desired dataset structure

Plan for Covid-19 cases dataset

  • convert date to Date data type
  • filter rows with null Countries
  • convert required data values to Double data type
  • aggregate (sum) data values at the Country and Date level (as some Countries had no data at the State/Province column)
  • validate data types

Plan for Covid-19 vaccinations dataset

  • convert date to Date data type
  • drop unnecessary columns
  • validate data types

A full join is then used using both datasets to achieve the structure shown in Figure b.

Figure c. Alteryx workflow

Issues encountered

  • Some rows in the vaccinations dataset had null values in the Daily vaccinations column, these rows were removed using the Filter tool
  • My Full Join wasn't working as I wanted initially, which was because the Country and Date columns of the two datasets did not have the same field names. This meant the Union tool was misaligning the columns when using auto config by name. This was fixed by manual configuration.

Improving the viz

Figure d. Improvements to my original viz. Top chart: Running total of confirmed cases for the majority of 2020. Bottom chart: Running total of people vaccinated.

Using the supplementary datasets, we can see that South Korea was able to control the spread of the virus more effectively throughout the rest of 2020, which we can see from the relatively flat line. By the end of 2020, South Korea no longer had the highest number of recorded confirmed cases. From the bottom chart (Figure d.), we can also see that South Korea started vaccinations around 2 months later compared to other countries.

What I learnt

  • How to do a Full Join in Alteryx and a better understanding of the Union tool
  • After receiving feedback from Carl and Andy: I learnt that, depending on context, using raw numbers (ie. actual number of cases) is the more correct choice for analysis, rather than post-processing those numbers (ie. % difference, which was what I had initially in the top chart of Figure d.)

What I can improve

  • Find out the average number of vaccinations done each day
  • How does this number relate to the country's population (how long would it take to vaccinate the whole country?)
  • Is there a trend between the number of confirmed cases and average number of daily vaccinations?

Thumbnail photo by Markus Spiske on Unsplash.