A smart way to investigate your data in Alteryx

by Simona Loffredo

The project of this week was really nice, I had the chance to learn more about Alteryx and its tools. I had to prepare one hour training module in Alteryx about “CREATING DISTRIBUTIONS”.

So… I started investigating about this topic and I ended up structuring the training session into 3 parts:

  1. Data investigation tools – FIRST INVESTIGATION
  2. Predictive Grouping – CREATE CLUSTERS (Benedetta explained already in this blog post)
  3. Data investigation tools – INVESTIGATE YOUR RESULTS

But what are these data investigation tools? Actually, I never heared about them before this week and this is one of the reasons I really love Alteryx: every time I need to so something really specific I end up discovering new tools and features and it is amazing!

These tools are based on R and they are really useful for a first data investigation. Before a process or analysis takes place, users need to understand the details of their data.

Let’s look into some of these tools together:

data_investigation_1

  • Association Analysis: this tool generates a correlation matrix report (one static and one interactive), determining which fields in a database have a bivariate association with one another. It is possible to determine a target field (if there is a particular field of interest). I would suggest to investigate before a wider range of data in order to get a general overview of the correlation. In the example below, I just deselected Latitude and Longitude and use the Pearson product-moment correlation:

Association Analysis

Here the interactive report:

matrix1

If you click on one of the small square, it will update a scatter plot on the right side of the report as below:

scatter_plot

Red means strong positive linear correlation (~ y increases while x increases) while blue means strong negative linear correlation (~ y decreases while x increases).

  • Distribution Analysis: this tool allows to fit one or more distributions to the input data and compare them. It is based on the statistical significance (p-values). I selected the Mean Age field and all the distributions available, in order to see if at least one of them fits it or not:

distribution_analysis

Below the report:

distributionchi_square

If we look at the Chi-Square test, it is possible to deduct that Normal, Lognormal and Gamma quite fit the distribution.

  • Histogram: shows the empirical cumulative distribution of a single numeric field by showing the frequencies of records falling in a set of continuous value ranges. I selected the field Sum_Residents and break the histogram into 10 parts (it is possible to leave it also “auto”) and I plotted a smoothed density curve:

histogram

Below the view:

histo2

All these information can be processed in order to perform predictive analysis, such as predictive grouping (clustering). Below my workflow:

wf clustering

Below the settings of the K-Centroids Cluster Analysis (I selected all the fields apart from Latitude and Longitude, divide it into 4 clusters – do not forget to check the report and to use the Append Cluster tool):

clustering2

After a process or analysis takes place, users need to investigate the results. I found some of the data investigation tools very useful.

I put as input data the output of the previous predictive grouping workflow. I used a select tool in order to transform the field Cluster into a string (it was a double before). Why? To use the data investigation tools on it!

investigations2

  • Plot of means: the tool takes a numeric or binary categorical field as a response field along with a categorical field and plots the mean of the response field for each of the categories of the categorical field. I selected Mean_Age as response field and Cluster as categorical field to group by as below:

plot of means

It is possible to observe how the age fall into the 4 different clusters:

plot of means 2

  • Frequency table: the tool helps the user understand the contents of their data. For each field selected by the user, a frequency table is produced. The frequency table is a summary of their data with frequency counts and percentages for each value in a field. In my case, I want to look at Clusters:

frequency table1

Below the interactive report:

cluster

I hope you enjoyed it.

A presto!