Making Data Lakehouse real yet effective
Once upon a time, decisions were supported by decision support systems called as the Data Warehouses.
It neatly designed structured data into different schemas: stars, snowflakes, and normal forms.
Then came the 2010s. Three things culminated in a perfect storm:
- The definition of data changed. There were more types of data. There were no good names for it. It was then called unstructured or semi-structured data.
- Cloud computing started to make innovations. It was innovative and was making economic impacts. The cost of storing data decreased.
- Processing capabilities rose. We could now process more data at a fraction of the cost.
Among all this development came the yellow elephant called Hadoop — a distributed computing paradigm that enabled storing and processing unstructured data.
Hadoop ecosystem evolved rapidly. A new paradigm to store and process unstructured data emerged. It was called the Data Lake.
Both the Data Warehouse and the Data Lakes co-existed. However, good things also come with bad things. There were challenges with this co-existence. The Data Warehouses were not able to cope with the demand for producing the required insights. The Data Lakes soon became data swamps due to the lack of structure and governance. The technological disparities were growing.
In early 2020, the experts pitched a new term. What if the best of both worlds are merged into a new Architecture pattern? It was called the Data Lakehouse architecture pattern.
This article will explore the components of this new architectural pattern.
What is A Lakehouse Architecture?
The figure above depicts the pros and cons of a Data Warehouse and a Data Lake.
The Data Lakehouse strives to combine the resilience of Data Warehouse with the flexibility of a Data Lake. Data Warehouses were created to support Business Intelligence. Churning out reports, populating downstream data marts, and supporting self-service BI are the critical use-cases for a Data Warehouse. The Data Lake was conceptualized to exploit data for data science. Creating and testing a hypothesis through rapid experimentations and exploiting semi-structured/unstructured data are the critical use-cases for a Data Lake.
What if one could combine the pros of a Data Warehouse and a Data Lake while minimizing the cons associated with each paradigm? It results in a Data Lakehouse paradigm.
There are five essential features in a Data Lakehouse paradigm:
- It supports the analysis of both structured and unstructured data.
- It is apt for both analysts and data scientists. It can support reporting and machine learning/AI-related use-cases.
- It is governable to prevent it from becoming a swamp.
- It has a robust security architecture ensuring the right access to the right stakeholders with proper security frameworks implemented around the data it stores.
- Lastly, it needs to scale cost-effectively.
Conceptual Data Lakehouse Architecture
Now that a Data Lakehouse features are defined let us now drill down into a Data Lakehouse’s conceptual Architecture.
The figure above depicts a conceptual Architecture of Data Lakehouse. It shows core components that weave together to form the new Data Lakehouse Architectural paradigm.
- The data can be structured or unstructured. It can help me at rest like in a relational database or in motion like constant feeds. The goal is to convert this data into insights.
- Data ingestion services, as the name suggests, ingests the data into the data lake. These services can be fulfilled by many types of ingestion patterns that cater to both batch and stream loading. As a thumb rule, the data should be ingested with no transformations.
- The ingested data is stored in the lake in a zone called as raw data zone. It is also called the bronze layer. The data structure follows the source data structure and de-couples the source from downstream transformations.
- The data processing service will process the data from the raw data zone. It will perform operations like cleansing, joining, applying complex business logic, and prepare the data in a format that can further used for downstream analysis, i.e., AI or BI.
- Data is also periodically deposited in an interim layer called the cleansed data zone a.k.a silver layer. Cleansed data zone eliminates the need for duplicating the processing multiple times. The final processed data is stored in the processed data zone a.k.a. gold layer.
- The data so far is in the lake; the data can now use it for many use-cases ranging from ad-hoc analysis, machine learning, and reporting. However, data lakes may not be conducive for structured reporting or self-service BI. Data Warehouses excel in such kinds of requirements. The serving data store does the role of a Data Warehouse.
- Data cataloging services ensures that all the source system data, data in the data lake and the data warehouse, the data processing pipelines, and the outputs extracted from the data lakehouse are appropriately cataloged. It prevents data lakehouse from becoming a swamp. Think of data cataloging services like the Facebook of data where you can get information of all the data lakehouse contents, the relationships between the data, and the lineage of transformations that the data has gone through.
- Analytics services are used for multiple purposes. The data scientists can create analytical sandboxes to perform experimentation and hypothesis testing. The data analyst can create sandboxes for firing quick queries and perform ad-hoc analysis on data. AI/ML services perform the task of running and maintaining the models. The BI services provide the users with self-service BI capabilities with rich visualizations.
- De-couple compute and storage: The first principle is to add de-couple and storage. Storage is cheap and persistent. Compute is expensive and ephemeral. De-coupling computes and storage gives the flexibility of spinning up compute services on-demand and scaling it.
- Have purpose-driven storage layers: Data will be manifest in multiple shapes and forms. The way the data is stored should have the flexibility to cater to different forms and intents of data. The flexibility includes having different storage layers(relational, graph, document-based, and blobs.) based on the kind of data and how it needs to be served.
- Modular Architecture: Taking a cue from Service Oriented Architecture (SOA), this principle ensures that the data stays at the core. Bringing the required services to data is the key. Different types of services like data ingestion services, data processing services, data cataloging services, and analytics services will be brought to the data rather than piping them to these services.
- Focus on functionality rather than technology: This principle embodies flexibility. Functionality is slowly changing. Technology changes faster. Focusing on one task that the component fulfills is essential. One can easily replace the technical piece as the technology evolves.
- Active cataloging: Active cataloging is one of the essential principles. Cataloging is the key to preventing Data Lake from becoming a data swamp. Having clear governance principles on cataloging will go a long way in ensuring that the data is well documented in the data lake. As a thumb-rule, catalog the following:
- Any source that needs to be ingested into the lake.
- Any data that is stored in the lake or the serving layer.
- The lineage of how data is transformed from source to the serving layer.
Data is complex and evolving. The business changes rapidly, requirements evolve and the Architecture needs to be flexible to cater to these changes. These five architectural principles make the pattern:
disciplined at the core yet flexible at the edges.
In the next part of this series, I will discuss how Microsoft Azure services fruition the Data Lakehouse Architecture.