Data investigation tool!

Today we look at data investigation in order to prepare for the alteryx advanced certification. Here are some short summaries about what we learned.

Field summary tool:

As the name suggests this provides a concise summary of the select columns. There are three outputs, the 'o' output will give you each column in your data as a row and provide you with remarks to help you clean up the data. The 'r' output is giving you a report of distribution. The 'i' output is an interactive version of that report. It's good to find data missing from a column, as well asl bolean fields as you can find fields with 2 distinct values. If the data is big this tool gives you the abilty to sample the data.

Frequency table tool:

This tool breaks up the data, going through every field and finds every unique value in the field, then in the output you will obtain values sorted by how often they appear in the data. This is a very slow tool, something to be careful of. I do not believe it is the most commonly used tool, however it is also useful to clean your data.

Pearson and spearman correlation tools:

This will allow you to find the correlation between two measures based on two different methods. It is important to know this tool will not help you find non-monotonic correlations, you need to have a single trend! This tool can also calculate covariance. Covariance will just prevent you from seeing the nature of the correlation between the two fields. Spearman is able to deal with non linear relationships such as exponential etc, giving you the ability to get possibly a more complex understanding of relationships between fields.

Association Analysis tool:

This tool is used in predictive analysis. You can select a field and how well correlated other variables are compared to the price of the house. You can use pearson, spearman or D-statistics, this gives you a correlation analysis against a target, including p-value for statistical analysis. There's also an interactive report to view a heatmap of correlation between fields.

I believe the correlation and association analysis are the most useful as they provide us with data we would not be able to obtain otherwise. The summary and frequency tools are to me useful only for data exploration if you have a huge amount of fields.  

Author:
Jules Claeys
Powered by The Information Lab
1st Floor, 25 Watling Street, London, EC4M 9BR
Subscribe
to our Newsletter
Get the lastest news about The Data School and application tips
Subscribe now
© 2025 The Information Lab