Alteryx: Name Standardisation, Removal of Irremovable Trailing Whitespaces and Calculation of Z-Scores.

by Harry Cooney

For the final day of our first week at the Data School we had to improve our application visualisations and be ready to present by 3 O’clock. Having spent the first week learning the basics of Alteryx our main task was to find an additional (messy) dataset to join with our original. My original visualisation (found here) looked at the history of the Premier League. It was broad but provided few obvious insights. Therefore, I wanted to focus on a specific question in the updated version. Nick suggested a comparison between data from actual league results and the video game FIFA and I was immediately struck with the memory of a heated childhood argument. Everton (the team I support) had the season before finished higher than Liverpool (our rivals). Yet when FIFA 06 came out Liverpool were rated better. Scandalous! Granted, they had won the Champions league that year (the most prestigious non-international footballing competition), but I was determined to settle the dispute once and for all. I had my specific question to answer but first had to join the two datasets.

I ran into three main challenges in Alteryx when trying to join the datasets. The first was a lack of consistent team names both between and within datasets. For example Tottenham being called either ‘Spurs,’ ‘Tottenham Hotspur,’ or ‘Tottenham.’ This first challenge was painstakingly overcome using the formula tool with a multitude of IF statements. However, I now know that there is a fuzzy match function which is better suited to this task.

The second challenge in Alteryx came when 4 teams in one season would not join with each other despite renaming via the IF statements. This was because of trailing whitespaces making the names inconsistent between datasets. IF statements including the space did not succeed in removing it and neither did the data cleansing function. In the end the following formula was used to eliminate the whitespace: Regex_replace([Team],'[^\x20-\x7E]’,”). This formula removed the ‘%20’ used in URL’s which must have been left over in the data.

The final challenge was finding a way to compare the premier league data to the FIFA data. Points acquired in the season compared to FIFA’s overall team rating score are different measures using different scales, so had to be normalised. This was achieved by finding the standard deviation from the mean for each metric, allowing comparison on an equal scale. To do this on Alteryx I used opened the Z-score macro found in the directory: “C:\Program Files\Alteryx\bin\RuntimeData\Macros\Predictive Tools\Supporting_Macros.” Following the steps in the macro, the data was transposed then summarised to give a mean and standard deviation for each metric. This summarised data was then joined back with the transposed data. The mean was then taken from each value before dividing the result by the standard deviation using the formula function. This was then cross tabbed to give the data back its original structure but now including standardised scores for actual league performance and FIFA team rating to be compared by.

The Alteryx issues were resolved so I outputted the data and built the following visualisation. It seemed my suspicions were wrong and that FIFA were in fact not biased, with Liverpool and Everton being overrated and underrated both roughly 50% of the time. However years as a bitter Evertonian could not permit me to stop there. These findings were the result of a single mean and standard deviation for each dataset. Perhaps it would be more accurate to get a mean and standard deviation on a season by season basis. This was achieved by summarising by year when getting the means and standard deviations during the calculation of Z-scores. The findings from this analysis were more promising, with Liverpool being overrated 67% of the time and having a sum standard deviation from the mean across all seasons much higher than Evertons.

So what does this research prove? Is there serious bias on the part of the designers of FIFA? Am I an overzealous bitter researcher with an agenda? I can only say one thing for sure; I now know how to use Alteryx to quickly match similar sounding names across datasets, remove seemingly undeletable whitespaces and normalise two separate metrics to allow for comparison, and I can use this blog for future reference in case I forget.