Making Data Lakehouse Real on Azure

Data Lakehouse Conceptual Architecture

  1. Data Lake Stores: The components that store the data effectively.
  2. Serving Data Stores: The components that serve the data effectively.
  3. Data processing services: The components that process the data.
  4. Data cataloging and curation services: The components that prevent the data lakes from becoming a swamp.
  5. Data security services: The components that secure the data in the Lakehouse.
  6. Analytics services: The components that help to transform data into insights.

Data Lakehouse on Azure

1. Data ingestion services

Azure Data Factory (ADF) is the preferred Azure service that one can use to ingest data. It is a fully managed, serverless data integration solution for consuming, preparing, and transforming all your data at scale. ADF can use this service to ingest data from multiple data sources into the cloud. ADF supports 90+ data sources out-of-the-box through well-integrated connectors. ADF is a platform-as-a-service implying that it has built-in HA, can scale on-demand, and the infrastructure layer is abstracted from the user.

2. Data Lake Stores

Azure Data Lake Storage (ADLS) is the preferred service to be used as the Data Lake store. ADLS has enterprise-grade features including durability (16 9s), mechanisms for protection across data access, encryption, and network-level control, and, in theory, provides limitless scale. ADLS is also optimized to store big data implying that it can store any data (structured and unstructured) and offers integration with visualization tools for standard analytics. ADLS can also store data in Delta Lake format. Delta Lake is an open format storage layer that delivers reliability, security, and performance on the data lake — for both streaming and batch operations. Providing support for ACID transactions and schema enforcement is another salient feature of Delta Lake. With the support for the ACID transactions, Delta Lake provides the reliability that traditional data lakes lack.

3. Serving Data Stores

Serving data stores can serve multiple purposes. The technologies employed to perform the data depend on how the data needs to be served, the audience to which it is served, and the underlying non-functional requirements that satisfy the users.

4. Data processing services

The data processing services entail two types of processing. The first type is batch-based processing that processes the data in spurts of batches. The second type is stream processing that processes the data without landing it on the disk, i.e., real-time. There are five key processing engines in Azure that help to process data.

5. Analytics services

Analytics services entail a group of services that are used to turn data into insights. It is a broad category that includes business intelligence, machine learning/artificial intelligence, and ad-hoc data analysis using good old SQL. Depending on the use-case, the targeted stakeholder, and their technical skills, they can use different types of analytical services. These analyses can be sourced either from the data lake or from the serving data stores. The goal is to get the data to the right stakeholder in an optimal time.

6. Data cataloging services

One cannot emphasize the importance of data cataloging more. Active cataloging is the key to ensure that the Data Lakehouse doesn’t turn into a swamp is to have non-negotiable rules on data cataloging. Five fundamental tenets for data cataloging that one could employ are:

  1. Any data source that publishes the data into the data Lakehouse needs to be cataloged.
  2. All artifacts that store data need to have technical and business metadata cataloged. It may include various attributes, including sensitivity and data classifications.
  3. Data transformation lineage should be cataloged and should depict data transformation from the source to the downstream.
  4. The cataloging information should be easily searchable and accessible to the right stakeholders.
  5. Azure Purview is the governance tool on Azure that enables active cataloging of data. Azure Purview is a software-as-a-service (SaaS) offering with rich data governance features, including automated data discovery, visual mapping of data assets, semantic search, and location and data movement across the data landscape.

7. Data security services

In this era of sophisticated cybercrimes, data security should be a default and not an afterthought. Data in the Data Lakehouse should be as secure as it can be. A proper security governance framework needs to be established to ensure that the data is stored safely and compliant. Access is provided to the right stakeholders using the appropriate authentication and authorization strategies. Advanced cybersecurity principles are applied to keep the data protected.

Key Takeaways

Following are the key takeaways from the two-blog series:

1. The Data Lakehouse paradigm is an evolving pattern.

2. Organizations adopting this pattern must be disciplined at the core and flexible at the edges.

3. Cloud computing provides the scalable and cost-effective services that can fruition the Data Lakehouse pattern.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store