dbt 1-Pager: What & Why?

Recently, some of us at The Information Lab US have been diving into dbt with a weekly lunch-and-learn. When I joined, I knew dbt was a buzzword in data engineering, but I didn't know what it was or why it was useful, so I did some research and created this 1-pager.

What is dbt?

dbt stands for Data Build Tool.

It’s an open source tool used in data engineering that uses SQL to transform data. It’s built to be compatible with large data warehouses such as AWS, Snowflake, Databricks, etc. and helps organize data transformation workflows with a focus on dependency management, reusability of modular code, version control, and documentation. It’s built to be compatible with ELT pipelines (in which data is loaded into the database before transformation), which are more modern than ETL pipelines (where data is transformed before being loaded into the database).

It can be installed on a computer and run in the command line or a tool like VS Code, or it can be used in dbt cloud, which offers a GUI (graphical user interface).

Why use dbt?

Uses SQL => low barrier to entry
Creates modular, reusable code
- Create “models” (SQL queries, bits of SQL code) that you can save, reuse, combine together
- Create dependencies between the models
Version control with git ensures changes are trackable and reversible
Automatic dependency management between models
Deployment and scheduling
Automated documentation and testing
Scales well for large data warehouses
Supports incremental models so only new data needs to be processed

Why not just use raw SQL or run Alteryx workflows?

Modularity of dbt makes it easier to change and reuse parts of workflows
Git version control in dbt makes it easier to collaborate
Automatic dependency management makes it simple to schedule workflows properly
dbt scales much better than Alteryx
dbt has built-in testing to ensure data quality, which is not native in raw SQL or Alteryx.

Where does dbt fit in the data lifecycle? dbt is used for data transformation, typically in an ELT context.The raw data is extracted from the source and loaded into a database, such as Snowflake. Then dbt connects to the database to transform and clean the data to prepare it for BI reporting tools such as Power BI or Tableau. The results of this transformation are also stored in the database.

Example

At TIL, we have raw data from FreeAgent concerning hours logged by each consultant for each client on different projects. We can load that raw data into our Snowflake database. Then we can connect dbt to Snowflake and use SQL within dbt to transform the data into a usable format. These transformed tables are also stored in Snowflake. Then we can connect Tableau to Snowflake to use these tables in a data visualization.

Author:

Cecily Santiago

View Profile