Alteryx stats tools for beginners
This is part three of a five part series introducing the stats tools available in Alteryx explaining the tools and indicating when their use is appropriate.
Time series analysis
Time series analysis is used to determine the trend over time in a measure and to predict how the trend will continue; it is a form of linear regression analysis.
Time series analysis requires a regular time point (i.e. a year, month, minute etc) and a measure that has values for each time point – if time points or data are missing it will not work properly.
If values are missing, careful consideration should be given to how best to fill these in, be it with time-period averages or simply a repetition of the previous time points measure value.
The Alteryx tool best used for this is the time series filler tool;
Make sure to configure it to the time-period being used and the correct increment (eg. every month or every six months).
It is best practice to summarise the data to the level of granularity of the time points being used and to sort the data to make sure the time points are in order.
The latest 20% of data should then usually be kept aside for testing the models against to see which models is most accurate (this is known as supervised learning).
This can be done by sorting the data according to time, ascending, giving the data a record ID and then filtering out the last few time points to use later for verification.
Choosing a model
The next stage is to build several models based on the 80% sample and to compare them to known data (the supervised learning part).
Within Alteryx, there are two models in the Time Series tool folder: ETS and ARIMA. See here for an in-depth description of how both models calculate time series.
Put simply: ETS models should be used if there is seasonality in the data, and ARIMA models should be used if the data appears to have a cyclical trend, but not in a seasonal fashion (i.e. the trend is of varying lengths at different time points).
Seasonality does not always have to correlate with seasons, FYI, data could be an the hourly or minute level and experience seasonal trends.
It is best to run both to see if one is superior to the other, but keep in the back of your mind the model you expect to explain the data best, based on the above criteria; this can help to spot if the configuration of the model might not be best suited to the situation.
Both models have identical configuration, select the target metric from the dropdown, and set the target field frequency – the time interval your analysis runs at.
To test which model is best, union the O outputs of both models and input this into the M input of the Time Series Compare tool, with the 20% verification data attached to the D input.
This will give an assessment of the mean error of the models and some graphical representations of the time series it has predicted. Choose the model with the lowest error value, or reconfigure the model you expect to be better until it has a better fit to the data trend.
The model comparison tool returns information about how well each model fit the verification data.
The O output gives a table of error measurements for each of the models:
The RMSE (root mean square error) is generally the best indicator of a model’s error rate, as such the model with the lowest RMSE should be chosen.
The R output gives a report style representation of the error measurements and a chart:
The I output gives an interactive graphical representation of the time series forecasts against the actual, with the ability to zoom in on a specific range:
Test the model
Once you are happy with your model, you can now use your model to predict future trends.
I like to create a new model using the entire training data set (remove the filter that reduces your sample).
Connect the O output to the TS forecast tool.
You can adjust the confidence intervals values here, and set the number of time periods into the future you want to forecast. The further into the future you forecast, the larger the confidence intervals will become.
The O output gives a table of data:
The R output gives a report style summary table of the data, with a plotted graph above:
The I output gives an interactive graph, allowing you to zoom in on a time period:
This is the final entry in a five part series of Alteryx stats blog posts. I hope you have found this entry, or the entire series, useful. Links to the other posts can be found above.