Today, DS22 were tasked with using Snowflake for the first time. We had to retrieve data from the World Bank’s API, parse it successfully, upload the data to Snowflake, and create a dashboard...all by 3pm.
In terms of data prep, working out how to configure the API was particularly challenging since there was an issue of pagination i.e. all the records I needed were split across several web-pages.
For my first API call, I chose to look at the amount of hospital beds per 1000 people for each country. No authentication was required, and the base URL looked something like this:
This was an example of finding different populations for different countries, however by changing the ‘SP.POP.TOTL’ to a different indicator code, I was able to retrieve a list of countries and their hospital bed statistics throughout the years.
To combat the pagination I did the following:
Each API call states the number of pages there are, so I parsed out the maximum number of pages using Regex.
I then generated rows from 1 to the maximum number and manually created the URL for each subsequent API call, which I then fed back into another download tool.
This then gave me a separate chunk of XML for each page of data, which I could then parse out using the XML parse tool:
I converted this flow into a macro and repeated it for different indicators i.e. Nurses/1000 people and Doctors/1000 people - I also joined on some GDP data.
Connecting to Snowflake
Connecting to snowflake required me to read a few blogs and get a bit of help from some colleagues but this blog is the most helpful: https://www.theinformationlab.co.uk/2019/08/23/reading-and-or-writing-to-snowflake-from-alteryx/
- To upload something to snowflake you essentially need to 1) create an account, create a database in snowflake 2) download the snowflake driver 3) create an ODBC DNS connection 4) output your data using the bulk upload option
Creating a ODBC DNS Connection
Go to your windows menu and select ODBC Data Sources, click Add > SnowflakeDSIIDriver
Fill in your username, password, the name of the database you just created, Schema should be set to PUBLIC, and warehouse should be set to COMPUTE_WH. Your server is the link you see in your search bar when you’re logged into snowflake on the web i.e. XXX.snowflakecomputing.com without the https://
Uploading as Bulk
In your output data tool, choose Data Sources and click Bulk underneath the Snowflake heading, you should then be prompted to set this up:
Just log in here as usual, and click run. The flow should then post the table to your Snowflake database.
I decided to look at the top 10 countries in terms of GDP and how they performed amongst each other in terms of the availability of hospital beds, with the ability to see insights for doctors, and nurses too: