Student's t-test is one of the most common used statistical test for compare the mean of two data set. There is more detailed information on Wikipedia, where we will not go into depth.
Note: there is always the need to check validation of the test assumptions which we will assume all holds for the example.
The data set we will use will be some dummy data set created by me. All the numerical columns are created by random sampling from Normal distribution with mean equal 1, 10, 100 respectively and standard deviation of 1.
Let's have a quick look at the data set first. We will just concentrate on the columns Name and numerical_1. We use Summarize tool to check the mean and standard deviation of different product.
As we can see from the results table above, all Products have a mean of 1 and standard deviation of 1 as expected. We also notice they have different sample size, it is often the case in real life sampling as well.
With the results know, we can check how to use different tools for the analysis.
In Alteryx, there are a build-in tools available for the hypothesis testing (often macros based on calling external R connection). In this blog we will demonstrate the use of them.
As the name suggest, in the Predictive tools, there is a Test of Means tool that does the t-test for us.
The response field are the numerical measure that we want to compare, field for group identifier is the categories we want to compare. The t-test will perform pairwise test, so we will need to choose a reference level within the categorical variable. Here we choose Product_1 just for the purpose of demonstration.
Generally we can just check the p-value to see if it is less than 5% (significant) or not. In this case, the results indicator there is no sufficient difference between different product groups as we expected.
Note: The interpretation of p-value is: under the assumption there is no difference between Product_1 and Product_2, the probability of obtaining the result at least as extreme as the observed results is 56.8%. (i.e. quite common, not unexpected, so no need to challenge the assumption they are the same)
We can also visualise the date by using Plot of Means tool
After connecting the data, we can just use the default setting.
As before, we choose our response variable and group by categorical variable. For the Error Bars, we will choose Standard error, as we have different sample size, standard error will adjust for the sample size over standard deviation.
As we can see from the above plot, even though the means (black dots) are different for each product, but once we considered their variations, they have overlaps, hence statistically there is insufficient evident to suggest their different.
To demonstrate the existence of difference, we need to compare different numerical variables (reminder: they are generated with different means). Let's just consider Product_1 only first.
Once the data is pivoted, we have various numerical (1-3) measures that can be compared. We repeat the same process as before.
With p-value less than 5%, there is sufficient evidence to suggest that numerical_1 is different from numerical_2 for Product_1.
Visually we can see numerical_ 1 has mean of 1 and numerical_2 has mean of 2 as noted before, hence when they are compared, there is significant difference.
Note: all numerical measure are generated with standard deviation of 1, so to keep their variation the same for ease of comparison.
Note: when there are multiply levels within the group by categorical variable, it is often better to use ANOVA instead of pairwise t-test to compensate compounding Type I error or use some adjustment in p-value. Though those options are not available in Alteryx yet.
Hopefully this blog offers people some insight how to perform basic statistical test to enhance their data analysis skills. Some basic statistical knowledge will be helpful to understand all the Date Investigation and Predicative tools within Alteryx, but it is also useful to learn while practice.
Looking for more guides, tips and tricks in Tableau or Alteryx? Go check out the other blog posts from the Data School.