Data Prep and Analysis Series -Standardize vs Normalize Data Part 1

by Liu Zhang
Photo by Mika Baumeister / Unsplash

Can you really see it?

Scaling is a common step in data process, quite often numerical data are on different range of values, e.g. 250 days worked vs £25,000 paid, they differ in magnitude of order 2 (10^2), if we plot some collected data, we may likely to get something as follow

Scatter plot for un-scaled data differ in magnitude

as much as it is an accurate plot, it does not present the information well. Can you see in the plot, that in fact the y-axis values vary as much as the x-axis values?


A confusing name

With the advance in big data and machine learning in recent decade, (feature) scaling become an essential data prep step rather than optional in the past. Calculate numerical data in it's original scale significant impact the speed and memory requirement of the algorithm when the data set is like to involve millions or even billions rows. Therefore in preparation, the data are either standardized or normalized, two of the most common scaling method.

It doesn't help when there is a thing called standardized normal in statistics. Therefore in practice, the two concept of standardize and normalize are used interchangeably. They do differ in calculation and property, though in practice it is not as significant to choose one over the other in predicative model when data set are reasonable well structured.

Standardization:

We convert our data through a z-transformation (standard normal) to ranged mostly between -3 and +3, anything beyond indicate the possibility of being outliers.

Now we see both axis value vary at similar scale 

In some case we may not want to have negative values for input or data is highly skewed or values must be bounded within a finite range. In those cases, we have another choice.

Normalization:

The formula will convert all value to between 0 and 1, it is very easy to compute (much quicker than standardization), with less memory requirement and make no assumptions about data. Unfortunately it is extremely sensitive to outlier, one value out of a million is enough to throw off it's performance.

The good case - no outlier
The bad case - with outlier

As aforementioned, when the data is well structured, either method is fine. But what if they are not? Go check other blog posts in the series!


Source data and R script for data generation:

Dummy Table – Google Drive

Thu 22 Jul 2021

Fri 18 Jun 2021

Wed 14 Jul 2021