An Alteryx Tip for Big Data: The Sample Tool vs. Input Tool Sampling

by Jess Hancock

Working with a really big dataset? Swearing your way through run times that languish into the minutes? This was me on Tuesday, until I found out that the Input Tool has a sampling feature.



But how does this differ from the Sample Tool itself?

1. Sample from the start to reduce run times
This inconspicuous little line allows you to specify the number of records to bring into Alteryx in the first place, which means the data you’re working with begins small. The Sample Tool, as pictured below, requires the full dataset to be loaded in and then processed according to it’s configuration.



2. The sample is taken in order of appearance
As this is a process that doesn’t load in and look at your full data, you can’t specify a sample type. It skims off the top N records specified. This can become problematic when you want to work with a sample in your final outcomes, as this is much less likely to be representative of the whole than, say, a “1 in N chance” record from the Sample Tool.

Want the best of both?
You can always add a sample tool, then Cache and Run the Workflow. You may get one longer run-time, prior to caching, but then a local file will be used which references the sample exclusively, rather than bringing in all the data in afresh with each run.



Note that the Input Tool before the Sample Tool is also cached. The caching process can be thought to ‘freeze’ the data at the point in time of the cached tool, naturally referring to all the tools that came before it.

On this note: it’s worth mentioning that the Input tool itself has a caching option. Don’t use it for speed. For most cases, it makes no difference: all Alteryx will do it keep bringing in the data in order to re-cache it.



The conclusion?
The Sample Tool is safer and more versatile when you’re not battling with a big dataset, and/or the sample needs to be representative. Use the Input Tool sampling feature to create and test a flow that would otherwise be slow and difficult but remember to remove it before you start drawing any conclusions from your data, and do a final run to pull in the rest of the records. A cached Sample Tool can be a great compromise.

Happy Data-ing!