"What does ETL/ELT really mean?" On the Semantics of Data Integration

No seriously, what's the difference? Is loading a stage or a process? Can a data pipeline be cyclic, and if so, where would the "ELTLTLT..." end?

In the mid-2010's, there was a paradigm shift from traditional ETL processes to ELT, and since then our colloquial use of the terms "Extract," "Load," and "Transform" have gotten somewhat confusing.

Let's recap.

ETL = Extract, Transform, Load. It implies a data pipeline wherein data is extracted from some source onto a machine, transformed somehow on that machine, then loaded into storage. ETL was the standard for data engineering for a long time. It was relatively simple to do "on-prem," or on an in-house server owned and operated by the company or user themselves.

ELT = Extract, Load, Transform. Same thing, but instead of transforming the data first before storing, basically everything gets stored.

The key difference is about how much data you're okay with storing. Storing data costs money. As the world has become obsessed with data, so to have our species' data integration processes been obliged to keep up.

Data transformation usually involves filtering, de-duplication, and/or aggregation. These processes necessarily shrink the size of the data. Keeping the data relatively small was a necessity when you're working with your own servers on-prem. There's literally limited space: the number of servers you have in your office or home is however many it is, and you can't store more data than can fit on them.

But of course, businesses want to grow. As they grow, they collect more data. So, more servers are needed. Setting up an extra server, then another extra server, then another, plus all those security and maintenance costs... that's starting to get overwhelming. Not to mention expensive.

That's where cloud solutions come in. Because businesses like Google, Amazon, and Microsoft are so large, they're able to sell lots of storage for a much cheaper cost than it would take to do it yourself.

So, roughly, it's the fact that cloud storage became the standard that the data integration paradigm "switched" from ETL to ELT.

"Just store everything" became possible, so that's what we started doing, worrying about the transformation processes afterward. Many data storage platforms like Snowflake allow for transformation processes to occur right there on the platform, further eliminating the need to think of the transformation step separately.

Now, with that out of the way, let's talk about why these terms have become confusing to the point of being nearly useless.

Say I'm working on a pipeline that takes data from an API, saves it in an S3 bucket, then moves it over to Snowflake for storage and transformation. There has to be a machine doing a process at every step along the way. Say the API is queried via a python script that runs via an AWS Job. At the point after the data is has been taken from the API but before it's put into the S3 bucket, where is it?

It doesn't pop out of existence and then back in. It's being held somewhere in the interim - that's how things work. Some might call this "staging," but I disagree. "Staging" has a real-world use case, which is storing data somewhere easily viewable for safekeeping before it gets used further down the pipeline. What I'm talking about is a distinct machine process that seems to have no colloquial name.

When the data is in this place - on some hidden corner of an AWS server, it's there. It's nowhere else. So, when the process is about to complete and it's being moved from this spot into the S3 bucket... is that loading?

Or is it... still being extracted?

It's tempting to say it's being loaded. But this is still in the "extraction" step of the data integration process as we've defined it. So, what gives?

There's a lot to say about this particular topic. My colleague Jeffrey Thompson plans to write his own blog on just this topic - when he does, I'll link that here.

Author:

Lex Devlin

View Profile