Demystifying Data Lake Architecture
According to Gartner, 80% of successful CDOs will have value creation or revenue generation as their Number 1 priority through 2021.
To create the maximum value out the organization’s data landscape, traditional decision support system architecture are no longer adequate. New architectural patterns needs to be developed to harness the power of data. To fully capture the value of using big data, organizations need to have flexible data architectures and able to extract maximum value from their data ecosystem.
Data Lake concept has been around for sometime now. However, I have seen organizations struggle to understand the concept as many of them are still boxed in the older paradigm of Enterprise Data Warehouses.
In this article, I will deep-dive into conceptual constructs of Data Lake Architecture pattern and layout an architecture pattern.
Lets start with the known first.
Traditional Data Warehouse (DWH) Architecture:
Traditional Enterprise DWH architecture pattern has been used for many years. There are data sources, data is extracted, transformed and loaded (ETL) and on the way, we do some kind of structure creation, cleansing etc. We predefine the data model in EDW (dimensional model or 3NF model) and then create departmental data marts for reporting, OLAP cubes for slicing and dicing and self-service BI.
This pattern is quite ubiquitous and has served us well for a long time now.
However, there are some inherent challenges in this pattern that can’t scale in the era of Big Data. Lets look at few of them:
- Firstly, the philosophy with which we work is that we need to understand the data first. What is the source system structure, what kind of data it holds, what is the cardinality, how should be model it based on the business requirements, are there any anomalies in data so on and so forth. This is a tedious and complex work. I used to spend at least 2–3 months in the requirement analysis and data analysis phase. The EDW projects span for a few months to a few years. And this is all based on the assumption that the business knows the requirements.
- We also have to make choices and compromises on which data to store and which data to discard. A lot of time is spent upfront on deciding what to bring in, how to bring in, how to store, how to transform etc. Lesser time is spent on actually performing data discovery, uncovering patterns, or creating new hypothesis for business value add.
Definition of Data:
Let us now discuss briefly how the definition of data has changed. The 4 Vs of big data are now very well known. Volume, velocity, variety and veracity. Let me put some context to these things:
- Data volumes has exploded in since the iphone revolution. There are 6 billion smartphones and nearly 1PB of data is created every year.
- Data is not just at rest. There are streaming data, IoT enabled connected devices. Plethora of data emanating from multiple fronts.
- It is also about the variety of data. Video feeds, photographs all are data points now which demand to be analyzed and exploited.
- With the explosion of data also comes the challenge of data quality. Which one should be trusted and which one should not be is a bigger challenge in big data world.
In short, the definition of analysable data has changed. It is not just structure corporate data now but all kinds of data. The challenge is to mash them up together and make sense out of them.
Moore’s Law:
Since 2000 there has been tremendous changes in the processing capabilities, storage and the corresponding cost structure. It has been subjected to what we call as Moore’s Law. Key points:
- Processing capabilities has increased by around 10,000 times since 2000. This implies that the ability to analyse more data efficiently has increased.
- The cost of storage has also come down quite considerable. Since 2000, the cost of storage has come down over 1000 times.
The Data Lake Analogy:
Let me explain the concept of Data Lake using an analogy.
Visiting a large lake is always a very pleasant feeling. The water in the lake is in its purest form and there are different activities different people perform on the Lake. Some are people are fishing, some people are enjoying a boat ride, this lake also supplies drinking water to people living in Ontario. In short, the same lake is used for multiple purposes.
With the changes in the data paradigm, a new architectural pattern has emerged. Its called as the Data Lake Architecture. Like the water in the lake, data in a data lake is in the purest possible form. Like the lake, it caters to need to different people, those who want to fish or those who want to take a boat ride or those who want to get drinking water from it, a data lake architecture caters to multiple personas. It provides data scientists a avenue to explore data and create hypothesis. It provides an avenue for business users to explore data. It provides an avenue for data analysts to analyze data and find patters. It provides an avenue for reporting analysts to create reports and present to stakeholders.
The way I compare a data lake to a data warehouse or a mart is like this:
Data Lake stores data in the purest form, caters to multiple stakeholders and can also be used to package data in a form that can be consumed by end-users. On the other hand, Data Warehouse is already distilled and packaged for defined purposes.
Conceptual Data Lake Architecture:
Having explained the concept, let me now walk you through a conceptual architecture of data lake. Here are the key components in a data lake architecture. We have our data sources which can be structured and unstructured. They all integrate into a raw data store that consumes data in the purest possible form i.e. no transformations. It is a cheap persistent storage that can store data at scale. Then we have the analytical sandbox that is used for understanding the data, creating prototypes, performing data science and exploring the data to build new hypothesis and use-cases.
Then we have batch processing engine that processes the raw data into something that can be consumed by the users i.e. a structure that can be used to reporting to the end-used. We call it as a processed data store. There is a real-time processing engine that takes streaming data and processes it as well. All the data in this Architecture is cataloged and curated.
Let me walk you through each component group in this Architecture.
Lambda:
The first component group caters to processing data. It follows an Architecture pattern that is called as Lambda Architecture. Basically, Lambda architecture takes two processing path. A batch layer and a speed layer. Batch layer stores data in the rawest possible form i.e. raw data store and speed layer processes the data near real time. Speed layer also stores data into raw data store and may store transient data before loading into processed data stores.
Analytical Sandboxes:
Analytical sandboxes are one of the key components in data lake architecture. These are the exploratory areas for data scientists where they can develop and test new hypothesis, mash-up and explore data to form new use-cases, create rapid prototypes to validate these use-cases and realize what can be done to extract value out of the business.
Its the place where data scientists can discover data, extract value and help to transform the business.
Cataloging and Governance:
Data cataloging is an important principles that been constantly overlooked in traditional business intelligence. In the big data landscape, cataloging is the most important aspect that one should focus on. Let me first give an analogy to explain what is cataloging. I do this exercise with my customers to get the point of cataloging across.
When I ask my customers to guess the potential cost of the painting without providing the catalog information, the answer ranged from $100 to $100,000 dollars. The answer to much closer to the actual when I provide the catalog information. By the way, this painting is called as the ‘The old Guitarist’ by Pablo Picasso created in 1903. Its estimated cost is more than $100 million.
Data catalog is very similar. Different data nuggets have different value and this value varies based on the lineage of the data, quality of data, source of creation etc. The data needs to be cataloged so that a data analyst or a data scientist can decide for themselves which data point to use for a specific analysis.
Catalog Map:
The catalog map provides potential metadata that can be cataloged. Cataloging is a process of captures valuable metadata so that it can be used to determine the characteristics of data and to arrive at the decision to use it or not. There are basically two types of metadata: Business and Technical. Business metadata is more to do with the definitions, logical data models, logical entities and so on whereas the technical metadata is to capture the metadata related to physical implementation of the data structure. It includes things like the database, quality score, the columns, schema etc.
Based on the catalog information, an analyst can choose to use a specific data point in the right context. Let me give you an example. Imagine that a data scientist wants to do an exploratory analysis of Inventory Turnover Ratio and the way it is defined in ERP and an inventory system is different. If the term is cataloged, the data scientist, based on the context can decide to use the column from ERP or from the Inventory system.
Key Difference between Data Lake and EDW:
Here is an explicit slide that tries to explain the difference.
- First, the philosophy is different. In a data lake architecture, we load data first in raw for and decide what should we do with it. In traditional DWH architecture, we must first understand the data, model it and then load it in.
- Data in a data lake is stored in the raw form where data in DWH is stored in a structured form. Remember the Lake and the distilled water.
- Data lake supports all kinds of users.
- Analytics projects are really agile projects. The nature of these projects are that once you see the output, you think more and want more. Data Lakes are agile by nature. Since they store all data with their catalogs, it ensures that if new requirements emerge, it can be adapted to quite easily.
Data Lake Architecture on Azure:
Cloud platforms are best suited to implement the Data Lake Architecture. They have host of compose-able services that can be weaved together to achieve the required scalability. Microsoft’s Cortana Intelligence Suite provides one or more components that can be mapped to fruition the Data Lake Architecture.
Key Takeaways:
- Data Lakes is a new paradigm shift for Big Data Architecture.
- Data Lakes caters to all kinds of data, stores data in the raw form, caters to spectrum of users and enables faster insights.
- Meticulous data cataloging and governance is key for successful data lake implementation.
- Cloud Platforms offers end-end solution for implementation of data lake architecture in an economical and scalable way.