Let’s first define data lake.
James Dixon, the founder and CTO of Pentaho, has been credited with coming up with the term. This is how he describes a data lake:
“If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in or take samples.”
These problems are often referred to as information siloing.
PricewaterhouseCoopers mentioned that data lakes could "put an end to data silos.” In their study on data lakes, they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository."
Data warehouse was coined by William H. Inmon in the 1970s. Inmon, known as the Father of Data Warehousing, described a data warehouse as being “a subject-oriented, integrated, time-variant and non-volatile collection of data that supports management's decision-making process.”
Emmett Torney of DATUM said, "Smart devices, hyper connectivity, supercomputing and cloud are quickly changing the world we live in and the way companies conduct business. All of these technological drivers are being fuelled by one important asset: Data."
In her article "Data Lake vs Data Warehouse: Key Differences", Tamara Dull, Director of Emerging Technologies at SAS Institute shares key differences between data warehouse and data lake.
|DATA WAREHOUSE||Versus||DATA LAKE|
|Structured, processed||DATA||Structured, semi-structured, unstructured, raw|
|Expensive for large data volumes||STORAGE||Designed for low-cost storage|
|Less agile, fixed configuration||AGILITY||Highly agile, configure as required|
|Business professionals||USERS||Data scientists, et al.|
One of the essential parts of differentiation is that a data warehouse only stores data that has been modelled/structured, while a data lake takes all data in its original form and stores it all – structured, semi-structured and unstructured. Many companies are still analysing structured data. “Newer” data sources such as text data, streaming data and geospatial data are becoming part of an evolving data landscape. That includes data that would be useful to analyse today, in the future or perhaps never at all.
"Data performance issues caused by centralizing data in an enterprise data warehouse have led to the creation of data marts, which solve performance problems by spreading the BI processing across multiple data stores," said Colin White, president of DataBase Associates Inc. and founder of BI Research. "It is often quicker and easier to build a data mart than to incorporate additional data into the enterprise data warehouse and then build the data mart from the data warehouse."
The problem with data marts is that organizations often build them directly from business transaction databases, rather than the enterprise data warehouse.
First and foremost, we have to give a formal shape and structure before we can load data into a data warehouse. In fact, we need to model it. That’s called schema-on-write. However, with a data lake, you just load in the raw data as-is, and then when you’re ready to use the data – that’s when you give it shape and structure. That’s called schema-on-read. Two very different approaches. This also meant that the models had to be very well constructed, for if the model was not applicable, the final results could be worthless or even have negative consequences.
Processing technologies like open-source Hadoop allow managing far larger quantities of data. One of the primary features of big data technologies like Hadoop is that the cost of storing data is relatively low as compared to the data warehouse. There are two key reasons for this. First, Hadoop is open source software, so the licensing and community support is free. And second, Hadoop is designed to be installed on low-cost commodity hardware. Hadoop uses a computational paradigm named MapReduce (by Google) to divide an application into many small fragments, each of which may be executed on any computer node in a cluster. For example Visa was able to reduce processing time for two years’ worth of data (73 billion transactions) from one month to 13 minutes using Hadoop.
We know that data warehousing is highly structured, and schema is defined before data is stored. The quality of data that exists in a traditional data warehouse is cleansed, whereas typical data that exist in data lake is raw. While this makes it a powerful storage option, it makes changes within the data warehouse difficult. That’s why the increasing demand for self-service business intelligence and modern BI makes a data lake highly attractive.
A well-designed archive can enhance data protection, restore and ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems. A data warehouse is a highly-structured repository, but it can be time-consuming. As a flexible, open source data storage technology, Hadoop offers improved processing at just five percent of the cost of relational database technology. A data lake, on the other hand, lacks the structure of a data warehouse – which enables developers and data scientists the ability to easily configure and reconfigure their models, queries and apps on-the-fly.
Apache Hadoop, the grid technology, is increasingly popular for storing massive amounts of data. By default, Hadoop runs in non-secure mode. When service-level authentication is turned on, Hadoop end-users must be authenticated by Kerberos – the popular computer network authentication protocol. A data lake is not a data warehouse. They are both optimized for different purposes, and the goal is to use each one for what they were designed to do.
Data warehouses are made up of data that has already been integrated, but they are limited in that they have trouble hosting data from unstructured sources, such as data collected from product sensors, social media and other non-traditional sources.
Writer Amber Lee Dennis notes, "Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake. It should be noted, however, that there’s a significant effort being placed on security right now in the big data industry. It’s not a question of if, but when."
"For a long time, the rally cry has been BI and analytics for everyone," said Dennis. A data lake, however, is much more agile, making it ripe for modern BI systems. The structured data is easily ordered and processed within the data lake, resulting in an output of analyzed data that users can sift through quickly to gain insight. Data lakes also encourage self-service data discovery.
The BI ecosystem – made up of an enterprise data warehouse, a data lake and potentially a discovery platform across these to facilitate analytics across the architecture – will determine what data and what analytics are used and where it is executed.
In short, the data lake has nearly unlimited potential but requires transformations before achieving insights; a data warehouse requires significant investment in advance, yet in return delivers the ability to easily analyze everything and the skills that are required in order to query it. A data lake is a low-cost alternative for data storage for companies that want to utilize external data and can pull directly from hundreds, if not thousands, of external data sources.
These days, technology and skills become obsolete or redundant very quickly. Digital technologies have evolved from Web technologies but taken a form of enterprise applications at present. The role is necessary for integration, ERP, SCM, CRM, Ecommerce, Cloud, ETL, LOB, IaaS, social networks, mobile devices, and Internet of Things (IoT) to name a few.
What is bringing disruption to the space is automation. According to Cambridge Semantics, "Using tools from Oracle, IBM, Teradata and Microsoft, setting up, maintaining and evolving data warehouses has always required vast, expensive resources and infrastructure. Nobody ever really wants to create a new one."
Eran Levy of Sisense noted, "A data lake can be used for sandboxing, allowing users to experiment with different data models and transformations, before setting up new schema in a data warehouse. It can also serve as a staging area from which to supply data to a data warehouse to then produce cleansed data with known value." Moreover, a data lake can contain any type of data – clickstream, machine-generated, social media, external data and even audio, video and text.
Traditional data warehouses are limited to structured data. Presently, there is an urgent need from the business world for quick access to new data. We are talking about data coming from outside the organization and unstructured data which makes up approximately 75% of the information in an enterprise. Cambridge Semantics says this is essentially combined with "the decreasing costs of data storage in recent years and the emergence of big data and NoSQL toolsets, in which enterprises began turning to data lakes as an alternative to the challenges of creating yet another data warehouse."
Remember, a data lake is not a data warehouse. They are both optimized for varied purposes, and the goal is to use each one for what they were designed to do. By using each possibility suitably, enterprises and organizations can get the best of both solutions.