Data Lakes vs Data Warehouses

by Penny Richmond

A data warehouse is a central repository of integrated data from one or more disparate sources *. That is to say, a storage place of data from different databases, neatly processed and stored to be used for a specific purpose or pertaining to a specific subject. It's a place where companies store their data assets (eg. customer data, sales data, etc). The data is ready to be used for analysis. A data warehouse may contain historical data as well as current data.

A data lake, is large vat of structured and / or unstructured, raw data that has yet to be processed, and may or may not be useful at some point in the future. Data may be stored at any scale.

6 crucial differences:

  1. The state of the data
    In a data warehouse the data is already structured and processed, ready for analysis.
    A data lake can contain any sort of data; structured or unstructured and in whatever form it may be.  
  2. Schema
    A data lake is "schema-on-read"; a schema (structure / logical configuration) is applied to the data as we pull it out of the data lake.
    A data warehouse is "schema-on-write"; the data is put into a schema before it is written into the database.
  3. Users
    Data warehouses are more suited to business professionals that want to access pre-processed data in order to optimise their decision making.
    Data lakes are suited to those doing raw data analysis, usually require data scientists or data analysts.
  4. Cost
    Storing data in a data warehouse is expensive, because the software used by data warehouses is expensive and the maintenance of that software is expensive. However less money is spent on storage.
    Data lakes are cheaper but more money is spent on storage as the data lake needs to be able to handle vast amounts of unstructured data.
  5. Change-ability
    Data warehouses are non-volatile and in-flexible, once the data is in it is difficult to change or delete.  
    Data lakes are much more flexible; the data can be easily changed or manipulated.
  6. Security
    Data warehouses are more secure due to the fact that they have been around a lot longer.
    A data lake is created using open source technologies. Data lakes are also relatively new. They are less secure than than data warehouses.

*Wikipedia, soz.