(This blog post assumes you are familiar with download tool and web scraping basics. If there is interest in explaining web scraping, please let me know and I’ll try to write a sensible blog on that as well.)
Hi everyone! In my latest post, I described how to schedule Alteryx workflow so that the data used in Tableau is automatically updated. I figured it out during a project Andre set up for DS13. Another part of that project was to actually get the data from KNMI (a meteorological institute in the Netherlands). Let’s take a look at how the website looks like:
After some scrolling, you can notice there is a table with some links. If you right-click on the web page you can inspect it (at least in Chrome):
This will open developer tools panel on the right-hand side where you can inspect the underlying HTML of the website:
At the very top of developer tools there’s a button that triggers finding html code based on the mouse location: if you hover on element, they will be highlighted in the code on the right.
Using this we managed to identify the underlying links which turned out to be links to zip files.
How do you get all of the files uploaded as a data source without manually downloading them? Well, let’s see.
Step 1. Download one of the files
First of all, you need to find out what the file looks like inside – what kind of data it holds and how is it stored. What is inside that might be of interest? Simply click on one of the links, it should trigger downloading of the underlying file. Then unzip this file and have a look inside.
In our case, it turned out to be a single csv file per folder. The name of the folder and the file inside was etmgeg_[station number]. From this I knew I needed to have the station number and the links.
You can also download one or two more just to make sure the format is the same.
Step 2. Get the links and other needed information
This requires a bit of data parsing in Alteryx. When you use download tool to get the underlying HTML it will be downloaded as a string (or at least configure download tool to do so):
The parsing of that string depends on the website that you are accessing. In my case, I used RegEx tool and tokenized (split to rows) parts of the HTML code that were between <tr> and </tr> using:
Then, I parsed the station number:
This resulted only in a three-digit numbers. If my new field with station number had Null value, it meant I didn’t need it because it didn’t contain the data I needed.
Finally, I parsed the zip URL:
OK, so now I have pretty much all I need from the web HTML.
Step 3. Download the zip files as temporary files
To get that, I used the links I parsed in Step 2. and a download tool with output configuration to a temporary file:
The output of the download tool now returns the filepath to the temporary file it save the zip file to. Instead of normal DownloadData field there is a DownloadTempFile field. This allows to create an exact filepath of the csv file within the downloaded zip file.
Step 4. Create filepaths for the files within downloaded zip files
For this, just use a formula tool. In my case the formula looked as following:
[DownloadTempFile] + '|||etmgeg_' + toString([Station]) + '.txt'
DownloadTempFile is the location of the temporary file, ‘|||’ specifies location within that zip files (you can use the same syntax for accessing specific excel sheets in Alteryx) and is followed by the name of the file inside that you want to access. In my case, it was etmgeg_[station number].txt.
Step 5. Dynamic Input
This tool is so helpful, you can also use it to upload multiple excel sheets.
In the configuration you need to set up the file Template. The file you downloaded earlier will come in handy now. Make sure the folder is unzipped and that you can easily access the file(s) inside.
On your left, in the configuration pane, you should see something similar:
Hit Edit… in the ‘Input Data Source Template’ section. This will open a pop-up window like this:
At the top, you need to specify a file with the exact same format as the files you will be uploading. If the files are different in terms of number of columns, they will be skipped entirely. The template file needs to be in an unzipped folder and easily accessible.
Then you can go on with configuring the options as normal. In our case, we had to start data import at 48th line to skip the headings preceding the data itself:
Hit OK and continue with the configuration. Make sure you have Read a List of Data Sources checked. In the Field drop down you should choose the field with the previously created file path and the action should be set to Change Entire File Path.
That’s it! Now run your workflow to make sure everything works. It may be that you need some corrections. It all depends on how the data is structured.
Just so that you know how such workflow may look like, here’s a part of mine:
You could use this technique even when you’re not downloading the zip files. Use Directory tool to get all the file names and continue from Step 4. Create filepaths. (I haven’t tried it myself that way but it seems it should work.)
Thanks for reading and let me know if you have any questions. Byeee!