Data Prep and Analysis Series -Standardize vs Normalize Data Part 2 (Tableau+R)

‌Continue from part 1, this post will show how to do the calculation in Tableau. It is possible to do the calculation in one calculated field, though a good advice given from Andy Kriebel, break down the complicated calculation into easy steps, it is easier to follow and debug.

There is also optional section to show how to leverage the power of R to do calculation, with the help of different R packages, a lot of calculation can be done more efficiently through RServer.

‌Standardization:

For our data, we need to calculate two things before we can use the formula, the average/mean and standard deviation of the data.

Note: Numerical 1 is a column of random value follows N(1,1) distribution

Note: {} is used to fix calculation on all without being breakdown by other dimension, LOD calculation is outside the scope of this post

Note: the technical term for the calculated value is z-score/value

We can combine the previous calculation in one calculated field

It is always good practice to check the calculation in a crosstab to make sure it make sense.

Notice mean and sd is constant for all entries

Note: Check is {SUM(N_1_standardization)}, which should be 0 as the sum of z-score should 0 regardless the data, hence it is used here.

We will do similar steps for Numerical 3 and produce the following scatter plot (as shown in part 1 of the blog)

Notice Tableau auto adjusted the axis ranges for the axis

Normalization:

The method to normalize is going through identical steps as standardization, except with different calculated values.

Or we can do in one calculated field

Let's do a scatter plot with our normalized the data

We can see (upto a scaling of the axis), both produce the almost identical results for us here.

Note: Both variables are generated by normal distribution, hence no outliers, otherwise they may look very different.

Optional:

Standardization with R:

Want to know how to connect Tableau with R?

General structure: Output_real_value("R code", input)

Standardization is a very common process, hence there is already a built-in function scale, the two true conditions ask the value to be calculated with mean=0 and sd=1.

We see they product identical results.

Normalization in R:

Given it is extremely effected by outliers, statistician never really use it, hence there is no default function for it, though the user can define one function themselves to do normalization on the data.

It is always important to check outlier before doing scaling.

Once outliers are handled, then depends on the analysis to perform or algorithm to be used, scaling methods are flexible.

Within machine learning (collection) packages such as caret, tidymodels, there are options within the data prep steps for normalization.

What happen if the data is really skewed or there are outliers?

Go check other blog posts in the series!‌

Source data and R script for data generation:‌

Author:

Liu Zhang

View Profile