A Typical API Alteryx Workflow

by Philip Mannering

API stands for Application Programming Interface and is used ubiquitously to send and receive data. In practice, when using programmable APIs the request (or ‘call’) is simply a URL and the response is often in JSON format. Alteryx makes short work of obtaining data using a web API and parsing it into something useful. Furthermore, the workflow (sequence of tools) is very similar when using any API.

For example, the workflow in obtaining last month’s Earthquake data might look like this,

…and the workflow for finding aeroplanes looks like,

…and to search for movies,

As you can see, they all use roughly the same tools. The only difference is the order of tools (which is largely inconsequential and mostly comes down to preference) and a few tools at the end of the workflow used for data specific formatting, for example in changing headers or filtering out certain metadata.

Here is the generic tool-by-tool process involved in obtaining nicely formatted data. In this example I use The Open Movie Database API from http://omdbapi.com/?

  1. Text Input

The first (and often the most difficult) step is to copy and paste the correct URL that will be used to request for data. All API URLs differ and it is often necessary to ‘hack around’ for documentation and examples on how to use it. Moreover, some APIs require an API key which may or may not be free, but often requires some sort of sign up. I recommend using an API that is well documented if you are starting out. The omdbapi.com API is reasonably straight forward to use and many of the calls can be made without an API key.

For example, you can search by film title,

http://omdbapi.com/?t=bambi to return data on the movie Bambi (1942) in JSON format. You can quickly test this by copying and pasting the URL into a browser.

Or you can search by IMDb ID (the unique identifier for movies listed on IMDb),

http://omdbapi.com/?i=tt0137523 to return data on the movie Fight Club (1999).

And if you wanted the longer description of the movie and to return the result in XML format you could use,

http://omdbapi.com/?i=tt0137523&plot=full&r=XML

Notice how the root URL is always the same: ‘http://omdbapi.com/’ and all the name=value parameters come after a ‘?’ and are separated by a ‘&’. This is important. Some parameters are optional. In this case, the user can choose whether to include the plot length and format of the return. Omitting these will default on a short plot summary and JSON format.

Finally, you can use,

http://omdapi.com/?s=batman to search for all movie titles containing ‘Batman’ in the title.

Whichever URL we decide on will be simply copy and pasted into the first row of text input. Rename the field name ‘url’. The configuration for Text Input should look like,

This is still just a text string. Use the download tool to retrieve the data.

  1. Download

In this case, we only have the one field, ‘url’, which will be selected automatically. An API key can be incorporated in two ways. Either it can be included in the url field in the previous step (something like http://www.omdapi.com/?s=batman&api_key=dx2f39r298yh2r9) or can be included as a separate column with column header ‘api_key’ and value equal to your key. Then you need to tick the api_key box on the Payload tab as so,

After running the download we’ll have all the data. Except it will be crammed into one single cell. We can expand it out into a tabular format by using the JSON Parse tool.

  1. JSON Parse

The JSON Parse tool is straight forward to use. Just select the field that contains all the JSON data – this is the text with all the nested curly braces. Unless manually renamed, the field will be called DownloadData. Select ‘Output values into a single string field’ unless you want to separate out all the data types in to different columns. After running this we our data will be vertically aligned in one single column like,

There are often other columns of repeated data like the ‘url’ and ‘DownloadHeaders’, but we are more interested in these two columns shown above. From here on it simply a matter of formatting to get our data in a logical table format.

  1. Text to Columns

We can see that we have several useful fields all joined together in the ‘JSON_Name’ column. The easiest way to separate these is to use a text-to-column tool. Select JSON_Name as the field to split, use a full stop as the delimiter, and choose how many columns you wish to split to. This will be the maximum number of delimiters in a single row +1. For example, we would to split ‘Search.0.Title’, ‘Search.0.Year’, etc. into three columns, which then automatically get named JSON_Name1, JSON_Name2 and JSON_Name3 (unless specified otherwise in the configuration). Now we have columns for the data/metadata, groupings and headers.

  1. Filter

I’ve used a select tool to get rid of some of the non-essential fields.

Next filter out some of the metadata. Our JSON_ValueString may have a combination of the data results and the metadata such as the success or failure of the API call, or the number of results returned. Your metadata and data should be distinguished by one of the columns. In this case, JSON_Name1 is ‘Search’ for the data and something different for the metadata. So the Basic Filter is simply ‘JSON_Name1 Equals Search’ (or [JSON_Name]=’Search’ as a custom filter).

This filter is very useful as not only does it separate out the main data, but the metadata will be channelled out the F tab. This can be useful when keeping track of page numbers or total number of records when running iterative macros.

  1. Cross Tab

The crosstab is a crucial part in putting our data in a logical table format. At the moment we have,

And we want all the matching data to be in their own separate columns. The crosstab tool gives the user the option to select three different fields: Group Data by these Values, New Column Headers and Values for New Headers.

Sometimes I fill out the last two first, as they are easier to place. The ‘New Headers’ must be JSON_Name3. This should be clear because every row will have a Title, Year, imdbID, Type and Poster. These are the headers! The Values for New Headers are the entries such as ‘Batman Begins’, ‘2005’ and ‘tt0372784’. Again this should be clear because these are our data values. The Group by these Values can be a little more confusing. These are what determines the row. As shown in the image above, the zeros (coloured red) all contain data in the first row, the ones (coloured purple) all contain data for the second row, etc. If you want to keep other fields from disappearing through this process (like the url column) this can also be ticked in the ‘Group by these Values’. Alteryx will group on the most granular grouping from the fields that have been ticked.

After applying the cross tab we get,

Which has provided us with exactly what we wanted. Also note that the number of records has been reduced by a factor of five since that’s how many columns headers were originally stacked on top each other.

  1. Clean-up

Finally, the data just needs a bit of clean up before browsing or exporting. I like to get rid of some extraneous fields using the Select tool, use an Auto Field to automatically set datatypes and then sort by ID number or Year. A Data Cleansing tool could also be advantageous though most web APIs provide clean datasets.

8. Output Data

Finally export as a spreadsheet or a .tde file and you’re ready to viz it in Tableau!

Although some of these last steps might vary depending on your API, but broadly speaking all Alteryx workflows that acquire data via an API follow the same general process: 1. Set the URL, 2. Download, 3. Parse the received JSON, 4. Split up, 5. Filter out metadata, 6. Restructure to wide table, 7. Format for use and 8. Export.

 

Avatar

Philip Mannering

Tue 23 May 2017

Fri 21 Apr 2017

Wed 19 Apr 2017