Tackling Chinese Pollution - A Data School Application

by Peter Gamble-Beresford

As the deadline for DS4 applications approaches, I thought it would be fitting to revisit my own data school application to blog about the idea and how I ended up with the finished piece. In the first week of the Data School, we did makeovers of our applications, so in the next blog post I’ll cover what changed from this version in the makeover.

The application process is quite simple (build a tableau public viz and send it in), but it’s advisable to spend a bit of time refining it to produce something really good – it is a job application after all. My application took me the best part of a weekend to complete, and most of that time was spent finding and cleaning up the data. Top tip – if you are short on time or ideas, use a ready-made dataset, there are tons here (under sample data sets) which should give you ideas and free up time to spend creating things in Tableau. It would have saved me a considerable amount of time (!) but I was pretty focused on one idea and wanted to make it work.

The Idea
I had decided on investigating air pollution reports in Chinese cities, as my girlfriend spent a while living in China, and she reported that the Chinese government tends to under-report air pollution statistics while an American report of air pollution is more trusted. I thought it would be interesting to see if the data supported this theory at all.

The Data
I was able to source US Embassy Chinese pollution data directly from its website, unfortunately the Chinese data was not so simple to acquire. After a few hours of frustration I was able to use a VPN to access the Chinese Ministry of environmental protection’s website, and used import io to scrape the data I needed from many pages of web data. Import io isn’t designed to work on a VPN, so there were some issues leaving my data set was about 70% complete. Rerunning the process got me another 10% but I admit the final 20% of the data could only be collected by copy and pasting from the webpage into Excel!

Thankfully I had two complete data sets, but they differed in that the Chinese data was given in a daily Air Quality Index score (AQI) and the US data represented the hourly absolute level particulates of pollutants in μg/m3. I found the formula that is used to convert these figures into an hourly AQI index for the US, applied this in excel, and then took a daily average to make the two datasets comparable. After many hours work I had only just managed to get the datasets looking similar and had no idea if there was even a decent story to be told!

The Visualisation
As soon as I had the data in tableau and started to build some basic visualisations, the story was clear. Over two years of data there are clear discrepancies between the two reports of pollution, and I now needed a good way to show it. Here it is (click here or on the image to see the interactive version):

Chinese Air Quality

I wanted the visualisation to be very interactive. The first chart lets the user check every data point, comparing each pollution measure for every day of data available, with the added context of the pollution level shown by the point colour.

The lower chart summarises all the data points to show the conclusion clearly, that the Chinese pollution reports are almost always below the American reports. Finally, the right hand side has the practical elements: a filter which changes the charts for the selected city, and some shapes which use tooltips to explain the methodology, the data sources and some contact information too.

It’s not perfect! There is a null in the filter, the colours in the legend aren’t ordered properly, and tons of other things jump out at me as not great now, but as a first attempt these things don’t matter too much as they can be resolved after some feedback. Overall I was satisfied with my first Tableau Public visualisation, as I had been using Tableau in my previous job to do very basic things, so this was a big step up and I learned tons doing it. Moreover, I learned lots from revisiting this and re-doing it – more on that in the next blog post.

Finally, some advice if you are applying or considering applying:
1) If stuck for ideas, or you can’t get the data you need, use data that’s already available, there’s plenty to choose from and it will save you valuable time
2) Ask for help and feedback! I left it too late before the deadline to fully take advantage of this – anyone at the Data School will be willing to help you!
3) Use tableau public to grab ideas, I downloaded about 100 workbooks during the whole application process. Though some may be super complex, many will help you learn something new no matter how small, so get familiar with how other people have achieved certain things in Tableau and you’ll get ahead faster.

Feel free to get in touch! Good luck!