A quick introduction into principal component analysis (PCA) and a small data set example.
More and more data is being collected in various types of disciplines (science, engineering, marketing, environment, politics,etc).
Extracting correlations or trends from these data sets start to become difficult and therefore, in the long run, a lot of this data might go to waste. Data dimensionality reduction analysis such as Principal Component Analysis (PCA) allows you, as the name implies, reduce the data set to more easily spot correlations and trends!
PCA is a commonly used unsupervised analytical method to assess large data sets. In short, PCA reduces the data set in dimensionality by finding directions through the data set where the variation is greatest, where each direction is referred to as a principal component.
The first principal component will generally account for the majority of variation in the sample and following component for a decreasing amount of variance. Typically, you could describe a massive data set with 2 to 3 principal components, accounting for ~90% of the variance within the data set.*
You can then check how each data point within your data set sits on these principal components in order to find patterns or clusters, i.e., how your data points are correlated or anti-correlated.
I always like to refer to the example below to help and explain what PCA does. if you can imagine you have a data set with 3 possible axis (for example, country data every year on liquor consumption, life expectancy and heart disease rates).
Apart from plotting any of the 2 axis at the time, it becomes difficult to relate them to one another. What PCA does is; it will try to turn this 3D data set around to find directions through the data set (principal components) which can describe the whole data set with less points.
1. Principal Components in Alteryx
For this example I will use a small data set to walk you through the PCA in Alteryx. It is a data set published in Time Magazine, 1996 (Jan) and contains wine, liquor and beer consumption (L per year) as well as the average life expectancy and heart disease rates (cases per 100.000 people per year) for ten countries.
Try plotting this in a few ways (scatter plots, heat maps, side-by-side bars), you will see it isn’t all that easy to find immediate correlations.
Next step is to import it into Alteryx:
- Add the predictive tool: Principal Components (predictive tools have to be installed inside Alterys designer: options – download predictive tools).
- Select the fields to carry out the PCA on, i.e. which fields do you want to investigate for correlation.
- Applying a scaling factor is important to standardise the different fields.
- Select the number of principal components you want to add to the report as a plot, stick to two in this case.
Run the workflow and let’s start by having a look at the report Alteryx made. I would like to start with the component summary.
- Remember we talked about principal components capturing a certain % of variance in the data? This part shows you how much each principal component has captured! In this case PC1 captures 46% of the variance (proportion of variance) and the cumulative % of just two PC’s is 78%! Isn’t that amazing! You can see that Alteryx stopped after PC5 as this is the point all variance is described by the PCA.
- The Component Loadings, giving you a single value for each of our selected fields to carry out the PCA on per component (PC1, PC2, PC3 etc).
Next is when the magic happens, each country gets a score for each principal component, based on these loadings and is output by Alteryx to the dataset.
But, but, but… how?! (you might wonder).
Let’s take France, it has a PC1 Score of 1.395. This score is determined multiplying each PC loading with the original (scaled) value of the liquor, wine, beer consumption, life expectancy and heart disease rates in France and summing those up.
Thus, Score =
(scaled liquor consumption) * -0.346 (PC1 of liquor)
(scaled wine consumption) * 0.445 (PC1 of wine)
(scaled beer consumption) * -0.07 (PC1 beer)
(scaled life expect.) * 0.585 (PC1 life exp)
(scaled heart disease rate) * -0.578 (PC1 heart disease)
The same is done for each other country and repeated for PC2.**
2. PCA Data Interpretation
Now the fun starts, data interpretation!
I purposely haven’t shown the bottom half of Alteryx’s PC report yet, as I find it easier to go through it step by step rather than looking at a Biplot straight away.
For this reason I exported the Loadings and Scores into Tableau to quickly visualise the results. So first, the Loadings, how do life expectancy and heart disease rates relate to liquor, wine and beer consumption in these countries.
PC1 is shown in the x-axis and PC2 on the y-axis. Focusing on the PC1 first, life expectancy and heart disease rates are on opposite ends. This is an expected results as surely heart diseases aren’t great for ones life expectancy..
Interestingly, Wine sits close to life expectancy on the x-axis, whilst liquor leans closer towards the heart disease rates. Beer is close to 0 on PC1, but the variation is better described on PC2 (y-axis) where it sits on the opposite side of the axis compared to liquor.
This indicates a negative correlation between beer and liquor consumption in this particular data set! If a country drinks more liquor one might expect less beer consumption and vice versa. Life expectancy and heart disease rates almost have the same score on PC2, thus indicating that PC2 doesn’t describe any further correlation between these two (likely already captured by PC1).
Next, we take a look at the PC Score plots, where we visualise the countries (keep in mind that PC Scores have a different scale from the PC Loadings, as explained above).
The first thing I always tend to look at is clustering of samples. If points cluster it means that they these data points are similar. In this case there is no strong clustering observed (we have very few data points to start with), but we do see some nice outliers, like Russia, Czech and France. This means they show particularly different behaviour in the data compared to the other countries.
Simply put, the higher the score the stronger it correlates to that PC and if we put the Loadings and Scores side by side or in a bi-plot (dual axis) you can visually see the correlations. The loadings give you the direction of which the PC (wine, beer, liquor, etc) go. Meaning in the case for life expectancy, the higher the PC1 score of a country, the higher their life expectancy.
Below on the left, Alteryx outputs it in a bi-plot showing the directions of the PC Loadings, and the PC Scores of each country as a dot. I prefer to plot them as symbols in this case, to make interpretation a bit easier.
The top left quadrant indicates that Russia has a high liquor consumption and if we look back at the numbers, we see this is correct. Interestingly, Russia also scores low on PC1, the direction in which heart disease goes! Both these observations indicate that high liquor consumption and heart disease might be correlated.
Inversely, both France and Italy (with high wine consumption captured in the top right quadrant) score high on PC1, where life expectancy sits! Possibly confirming the old tale: ”a glass of wine a day, keeps the heart diseases away”
Lastly, Czech enjoys their beers as shown in the bi-plots, but goes less into the direction of the heart disease.
That’s it for now, I hope that this example using a small data set gives you an idea of what PCA does, what it is capable of and will hopefully inspire you to explore it for your own benefit!
Do note that the Principal Component tool is limited in terms of scaling and outputs (in case you are familiar with PCA). I reckon this is purposely done to provide an easy to use tool with the basics properly fixed. You can always import your own PCA script through the R extension in Alteryx or pre-process your data (mean-centre your data, etc) prior to using the tool and un-tick scaling.
**Details on PCA and scaling units:
Please be aware that the unit variance scaled numbers for each country are not outputted by Alteryx. Alteryx uses the PRCOMP function of R and it’s build in scaling factor (scale.=TRUE or FALSE).
Fellow Data Analyst, Tan Thiam Huat, Kelvin, kindly reached out to me to discuss this topic as he found that in Python the sklearn.preprocessing.scale() function resulted in slightly different PCA Loadings and Scores. This stresses the importance of documentation in this field of analytics. Feel free to check out his blog for more information on such topics.
*Further reading on principal component analysis:
- R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification, 2nd ed., Wiley-Interscience, 2000.
- I. Jolliffe, Principal Component Analysis, 2nd ed.,Springer, New York, 2002.
- M. Ringnr, Nat. Biotechnol. 2008, 26, 303–304.
- Scientific spectroscopy use case: https://onlinelibrary.wiley.com/doi/abs/10.1002/chem.201705349
- Or. Just. Google. PCA, Google inc., 2018.