Alteryx Predictive - Linear Regression

by James Charnley
Photo by Jose Francisco Morales / Unsplash

The concept of predictive modelling can be daunting when just beginning to learn about it and its underlying logic. At least in my case, the buzz word term of machine learning sent pictures of Bladerunner and rogue AI running through my mind.

When I first decided I wanted to work in data, I started with an R coding course that primarily looked at coding a number of these models, so when starting with predictive tools in Alteryx, I wasn't starting from nothing. I was also pleasantly surprised with how much of the heavy lifting is done for you. That's because the linear regression tool is a macro, the terrifying inside of which can be seen below:

Knowing how this works is irrelevant to being able to use the Linear Regression tool in Alteryx. But nested within the macro are tools that download R and actually build the model so you don't have to. All we have to do in Alteryx is know how to configure it.

The goal of linear regression is simply to predict the value of a dependent variable based on a set of independent variables. In Alteryx, the configuration pane that we interact with thanks to that macro looks like this:

This example configuration pane is working off maybe the most famous popular culture example of linear regression there is. I'm referencing the baseball example that would end up being made into a movie - Moneyball (incidentally one of my favourite movies). The story revolves around using linear regression on baseball stats to find the most important independent variables to predict a dependent variable - wins.

You can call the model name whatever you like (just don't use any spaces).

Our target variable is our dependent variable that we're trying to predict, so in our baseball example - that's wins.

To predict wins using our independent variables, just tick the relevant boxes. How do you know which independent variables to choose? That can be more tricky and requires further investigation. I'll write another blog on that soon.

The O output is your actual model that you can connect up to more tools in the future (such as the Score tool), the R output is a summary report that includes plots, and then the I output is another nice reporting output that will tell you further statistics such as the R squared and RMSE.