A core concept in data preparation is understanding data sources and architecture. A key part of this is understanding fact and dimension tables, which are used to build schemas like the star schema and snowflake schema. These models are designed to make data analysis faster and more intuitive by organising data in a structured way.
So, what are Fact and Dimension Tables?
In reality, data rarely comes from a single, perfectly prepared source. Instead data comes from various sources and has to be connected. Fact and dimension tables provide a framework for these connections, separating the data into two distinct types.
Fact Tables: These tables hold the records of the data set. Typically storing quantitative data, or metrics with keys that link to the dimension table. See example below
Fact Table: Sales
SaleID | DateID | ProductID | CustomerID | Quantity | TotalPrice |
---|---|---|---|---|---|
101 | 20230101 | 501 | 12 | 2 | 25.98 |
102 | 20230102 | 502 | 13 | 1 | 99.99 |
103 | 20230103 | 501 | 12 | 3 | 38.97 |
... | ... | ... | ... | ... | ... |
Dimension Tables: These tables hold the details about the categorical fields in the data set. The data is usually qualitative, or attributes that provide context to the metrics in the fact table. See example below
Dimension Table: Product
ProductID | ProductName | Category | Price |
---|---|---|---|
501 | T-Shirt | Apparel | 12.99 |
502 | Headphones | Electronics | 99.99 |
... | ... | ... | ... |
Why Use Them?
The separation of data into fact and dimension tables is used to build efficient data models. This structure helps to reduce data redundancy, improve query performance, and make it easier for people to understand and analyse the data.