Introduction to RegEx in Alteryx

DS32 spent their first day of week 5 going through RegEx (regular expression) in Alteryx.

So what exactly is RegEx?

In short, a RegEx is a sequence of characters that a regular expression engine uses to match strings or pieces of strings in a text field. Most patterns use normal ASCII - including letters, digits, punctuation and other symbols on your keyboard like %#$@! - but it can also use unicode characters to match any type of international text. RegEx is particularly useful in extracting information from text such as code, log files, spreadsheets, or even documents.

This blog post aims to explain the four main uses of RegEx in Alteryx – Matching, Parsing, Replacing, and Tokenizing. You can choose which method you want to use by changing the output method in the configuration window (see below) of the RegEx tool.

RegEx Match

This method of the RegEx tool returns a TRUE or FALSE statement (a Boolean value) – it returns a TRUE if the searched text matches the regular expression’s pattern and a FALSE if it doesn’t. I will go through an example below – in this case we are looking for those names that start with an M and those that do not end with a space.

The regular expression will look like this:

^M.+\S$

Definitions:

^ = start anchor

M is a specific character

. = any single character

+ = one or more

\S = not space

$ = end anchor

With these definitions in mind, this expression states that the string should start with a capital letter M, followed by one or more ‘+’ single characters ‘.’ and end with no space ‘\S$’.

In the Alteryx window below, we can see that this expression returns the rows that start with an ‘M’ and do not end in a space.

RegEx Parse

This method is probably the most common way of using regular expressions. This method allows you to extract and separate strings from the original text input into new columns – it also allows you to set the Name, Type, and Size of the new columns in the Output Columns table.

In the example below, we want to extract the age number from the approximate age field and change this to an integer data type.

The regular expression will look like:

(\d+)

Definitions:

(…) = capture group

\d = single digit

+ = one or more

This expression captures any digits '\d' that are one or more '+'.

In the Alteryx window below, we can see that the regular expression has extracted the age number only. To change this to an integer data type you need to change the type in the output columns window.

RegEx Replace

This method allows you to replace an expression or part of an expression that you searched for within your regular expression and replace it with another string (replacement text).

In the example below, we want to replace ‘Miss’ and ‘Mrs’ with ‘Ms’.

The regular expression will look like:

Miss|Mrs

Definitions:

|= OR operator

Miss/Mrs are specific characters

This expression finds 'Miss' or 'Mrs' and replaces it with 'Ms'.

In the Alteryx window below, we can see that the regular expression has replaced ‘Miss’ and ‘Mrs’ with ‘Ms’.

RegEx Tokenize

This method splits the incoming data using a regular expression - Alteryx searches the string input for each instance of the RegEx, placing each into a separate column or row. This option is similar to the Text To Columns tool, except instead of matching and removing what you do not want, you match on what you want to keep.

In this example, we want to create three columns ‘Title’, ‘First Name’ and ‘Surname’. The regular expression will look like this:

[A-Z]\w+

Definitions:

[A-Z] = letter from A-Z

\w = alphanumeric and _

+ = one or more

This expression states that all substrings should start with a capital letter [A-Z], followed by one or more ‘+’ alphanumeric characters ‘\w’. This will not capture the quotation marks and will therefore skip the nicknames.

Using this configuration and selecting the number of columns to split to 3, creates the 3 new columns we need.

Further resources:

The following websites are useful for furthering your understanding of RegEx:

www.regexone.com - an introductory course on RegEx
www.regex101.com/www.regexr.com - both good online testers of RegEx

Author:

Imogen Emmett

View Profile