Fuzzy Matching in Alteryx

The Fuzzy Match Tool in Alteryx can be a very powerful tool when you try to match two data streams but the matching condition is not always met hundred percent and therefore a join doesn’t work. With Fuzzy Matching it is possible to find matches that are similar to each other but don’t match exactly. This can be very helpful when you try to join two data streams with a metric like company names, addresses, etc. Especially when the data comes from different sources, the names of the same thing may differ.

Let’s take a look at an example:

Imagine you had two tables containing company names, one containing “apple” and the other one “apple inc.” and you want to join these together. We can quickly see that the same company is meant here, but a regular join would not work since the characters do not exactly match. With Fuzzy Matching, we can make this work.

Unlike the Join Tool, the Fuzzy Matching Tool can only use one data stream and therefore the data has to be unioned beforehand. It also always requires an ID, to identify the different records. If there no ID is present it has to be added with the Record ID Tool before unioning the data streams.

Figure 1: Joining similar names together using Fuzzy Matching. The three additional Join Tools are required since the Fuzzy Match Tool only outputs the IDs and not the Company Names.

The setup for the Fuzzy Matching Tool can be very tricky as you are able to modify the matching process in many different ways. But luckily Alteryx has some standard options to choose from, which are optimized for certain situations. In our case setting the matching style to “Company Names” is an easy option. Here Alteryx is using word frequency statistics of data about US companies to identify which words should match. If we are using non-American companies the matching results tend to be not that good.

The Merge/Perge Modes are similar, the Merge Mode just uses two different IDs instead of a single one. For our example, we can work with the Purge Mode that uses just one (unioned) ID. At the “Match Threshold” setting we can edit the matching accuracy that is required for the Tool to accept a match. The standard 80% is often a good fit, but you might want to modify it if you notice that you lose too much correct data or that you end up with too many wrong matches.

This is also the biggest problem when using the Fuzzy Matching Tool. You might get the wrong results. If a certain error margin is ok for your use case you just need to modify the match threshold so you are happy with the results. But if your data needs to be perfectly merged you have to manually check the output to control for errors. This can be quite annoying, but you are still much faster than trying to combine the data manually.

Often much data preparation is required to improve the Fuzzy Tool’s performance. Therefore removing punctuation, and abbreviations like “Inc.” or “AG” can be a good idea. Via trial and error, you can identify which method works best for your dataset.

With this short example of just one single company, we can see the benefits of working with Fuzzy Matching quite easily. Now imagine how powerful this tool is when you have a table with thousands of companies, all written differently.

Author:

Frederik Egervari

View Profile