If you’re new to data analytics, the terminology you come across may seem similar, but they all have very different meanings. Truthfully, there are a lot of different terms that start with the word 'data', which can get confusing, especially at the beginning of your data analytics career (I know it got a little confusing for me). So I decided to make this 'data dictionary' blog post, to explain the difference between the various terms that have the word 'data' in them, to help anyone who may be feeling confused or overwhelmed.
Why does understanding terminology matter?
In the data world, knowing the difference between concepts can help you make sense of how data is collected, stored and analysed. It also helps you improve your ability to work with data and communicate effectively with others in the field.
Data sources
Data sources are the starting point of the data lifecycle. It can be any system, file or device that records data, and provides raw information that can later be analysed or stored in different databases.
Data storage systems
Database - A database is a structured system for storing, organising and managing data. Unlike data sources, which just collect raw data, a database organises its data into tables with rows and columns, and makes it easy to retrieve, update and analyse.
Data Warehouse - A data warehouse pulls together multiple structured and organised data from different sources, making the data ready for analysis and reporting. Think of this as a warehouse that stores multiple databases.
Data Lakes - A data lake has multiple data sources pouring data into it, in various different formats. Data lakes are useful for situations where a company, for example, may not exactly know hoe they'll use data just yet, but they want to keep it got future analysis, as data lakes can handle massive amounts of data. To access this data, a data scientist would come and extract what they need from it.
Data Lakehouse - A data lakehouse is a hybrid between a data lake and a data warehouse. It allows us to store raw data, like a data lake, but also organises and processes data for analysis, similar to a data warehouse. It allows structured and unstructured data to coexist into one system, simplifying the architecture and allowing users to query data directly.
Making Sense of Data (terms)
Understanding the differences between data sources, databases, data warehouses, data lakes, and data lakehouse is really important for navigating the data landscape, as each term has its unique meaning and unique role in data storage and collection. By demystifying these terms, you’re better equipped to see how they fit together in real-world scenarios and understand how data flows through various systems; from its initial collection to its transformation and eventual analysis.