I recently used Import.io to scrape some data on US craft beer brewers. It’s a great tool for scraping data without having to know a single bit of code, so I thought I’d share it with you guys. Although this particular example does contain some data that Import.io is unable to extract, but we’ll look into that more later on in the post.
1. Copy and paste the site URL https://www.brewersassociation.org/statistics/by-state/ into the white box and click ‘Try it out’
Import.io automatically picks up the different elements of the underlying site code and turns it into a table of results. But you’ll notice that the headers are not that useful for identifying what the numbers mean. So we’ll need to clean those up.
2. Select the website view and click on one of the column names that need updating and see what part of the website is highlighted. You can then use this to update the column to something that will makes sense out of context.
3. Now all the columns are updated. Click done and you can run your extractor and select download as CSV.
***Wait there’s some data missing***
After consulting a number of people with more coding knowledge than me, it became evident that extracting the data from the source code would be fairly difficult and given that there are only 50 states to get the data for it would be quicker to type out the data manually. I’ll be adding this in at the end.
4. The data needs cleaning as we need to remove the word “(Ranks ” from the columns containing rank values.
I did this in Alteryx uisng the following workflow.
- First I input the CSV file that we have downloaded from our Import.io extractor.
- Then I have used the select tool to delete some columns I decided I didn’t need and renamed a column containing the names of the State to state
- Next I used the data cleansing tool on all the fields that contain the ‘(Rank ‘
- Setup the configuration to remove anything that wasn’t a number
- And then output the data as a Tableau Data Extract.
Don’t have Alteryx?
You can also do the same thing in Excel using the the Right function for each Rank column. Say column C contains your Rank for the number of breweries your formula would be =RIGHT(C2, 2) and you’d need to copy that down for all the rows of data.
Once you’ve done that you ready to go and get creative in Tableau!