With big data, data warehouses, data lakes, and all of the other fairly new technology terms, a lot of people are confused about the differences in some of them. Today, we are going to talk about the differences in a data lake vs. a data warehouse. Do you know which one you need?
First, let’s define the terms. James Dixon, the founder and CTO of Pentaho, coined the term data lake in 2010, with this description: “If you think of a data mart as a store of bottled water – cleaned and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake and the various users of the lake can come to examine, dive in, or take samples.”
Data Warehouse was coined by William H. Inmon in the 1970s. Inmon, known as the Father of Data Warehousing, described a data warehouse as being “a subject-oriented, integrated, time-variant and nonvolatile collection of data that supports management's decision-making process.” Now, let’s break down the differences between the two.
A data warehouse is a carefully designed data store that organizes data upon entry. This enables consistent and predictable analysis over pre-categorized structures. Data warehouses tend to emphasize organized or structured data over semi-structured and unstructured data. The data in a warehouse is usually organized using multi-dimensional schemas in order to streamline execution of queries, reports, dashboards, and running of advanced analytical models.
A data lake is a mix of structured, semi-structured or unstructured data. For example, transactions, spreadsheets, documents, images, and social media may all be stored in the data lake. The data lake may be fed using traditional-style batch jobs or by connecting the data lake to real-time data feeds. A data lake combines massive storage capabilities for any type of data in any format as well as processing power to transform and analyze the data. In other words, it is a free-for-all storage reservoir.
Data lakes and data warehouses each have their own jobs and they both do them very well. The best one for you is determined by your company’s needs. Data warehouses organize the data upon entry which enables steady and foreseeable analysis across categorized structures. Replicating standard queries and reports across uniform datasets are essential to many enterprises. Thus, data warehouses provide value that cannot be replaced by data lakes.
Getting a data lake to function like a reporting-friendly data warehouse is equally challenging. Open source tools, frequently associated with a data lake, are not as easy to use nor are they as sophisticated as more mature tools which were developed for structured data warehouses.
A data lake is not a data warehouse. They are both developed for different purposes with the goal to use each one for what they were designed to do. If your enterprise already has a well-established data warehouse, you may want to consider adding a data lake alongside it if you need it. If you need a data storage reservoir that can store any data – organized or unorganized – and keep it until you need it, then a data lake is the one for you.
With constantly evolving technology as well as advancements and developments in software specifically aimed at making data warehouses faster, more reliable, and more scalable, it will be very interesting to see what the future holds for data warehouses and data lakes.
I want to: