Alteryx Week Tip #1: Using Formula instead of Data Cleansing to clean your data

by William Guicheney

Hello, everyone!

As my teammates and I get our hands dirty with Alteryx for the first time, I thought this would be the perfect time to start a blog series dedicated to short tips you can use on the fly. I have to say I am brand new to Alteryx – I’ve spent more time on it today than I have ever before – so I’m not promising anything out of this world, but hopefully, the newbies out there will find it useful too! That being said, let’s get started.

The Problem

 

Today, we’re going to be looking at two tools: Data Cleansing and Formula. The Data Cleansing tool is absolutely amazing: you can throw pretty much any ugly dataset at it, full of blank spaces and random punctuation, and it will immediately remove them for you and pop out a beautiful set of clean numbers. After using it a couple of times today, however, I ran into two simple problems:

  • Data Cleansing is a very heavy tool that demands quite a bit of processing power, making it quite slow. I found that even simple workflows working with small datasets were grinding to a halt (a minute feels very long when you’re staring in silence at the computer screen).
  • Secondly, and almost more importantly, Data Cleansing is not a very fine tool. In the example we will be looking at today, you will see how using the Cleansing tool can sometimes remove punctuation that you might have wanted to keep – periods in decimal number for example – making your data useless by the end of the process.

Fear not, however, as you can easily get around these issues with a very simple tool: Formula.

The example we will be looking at in this case study is pretty straightforward. We will start off with a .csv file where all the data is stored in one column called Field_1, and progressively try and break it down into columns and rows in order to create a simple dataset that contains information on countries and their key development indicators. We will be focusing on the initial clean up part, which will play a key role in ensuring we obtain the correct dataset in the end.

Start by taking a look at the original data.

As you can see, all the data is currently stored in one column, Field_1. On the screenshot you can see the different variables we will be trying to extract: Country Name, Country Code, Indicator Name, Indicator Value. The main issue we will be dealing with is cleaning the data without removing all of the punctuation so that the decimal numbers inside the future Indicator Value column are not transformed into integers. If we were to run this workflow and use the Data Cleanser to remove the punctuation, our final output  – after separating the data into a series of rows and columns – would look like this.

Even though this dataset looks all clean, you will notice that all the punctuation in the Value column has been removed, even though many of these numbers should be decimals.

The Solution

An easy and quick fix for this issue, that will also speed up your workflow by a factor of 5-10, is the Formula tool. You can use it to remove all the unnecessary punctuation from your Field_1 column, but keep the punctuation you do need! All you’re going to want to do is replace the Data Cleanser tool with the Formula tool in your workflow and input the following command:

  • ReplaceChar ([Column Name], ‘ ” ‘, ‘ ‘)
  • This is simply going to tell Alteryx to remove all the quotation marks from the column you have specified. In my example, the column name is Field_1. Check the screenshot below to see how it looks!

If I run my workflow now with the Formula tool instead of the Data Cleanser, the final data output I obtain looks like this.

As you can see, the numbers in the Value column are not integers anymore, but decimals! I can finally export this dataset and see what I can learn from it using Tableau.

 

Thank you for tuning in, you can expect more great Alteryx tips coming soon!