Making Data Lakehouse Real on Azure

Pradeep Menon
7 min readMar 15, 2021

--

In the article, Making Data Lakehouse real…yet effective, the concept of Data Lakehouse was introduced. Its Architectural paradigm was discussed and the components that weave the Architecture together were explained.

In this follow-up article, the Data Lakehouse Architecture will be fruition using Microsoft Azure services.

This article will explain various Azure services that will fit into different components of the Data Lakehouse Architecture.

Data Lakehouse Conceptual Architecture

As a recap in the precursor article, the conceptual Data Lakehouse Architecture was introduced. The seven components that form the conceptual Data Lakehouse Architecture are as follows:

  1. Data ingestion services: The components to get the data in.
  2. Data Lake Stores: The components that store the data effectively.
  3. Serving Data Stores: The components that serve the data effectively.
  4. Data processing services: The components that process the data.
  5. Data cataloging and curation services: The components that prevent the data lakes from becoming a swamp.
  6. Data security services: The components that secure the data in the Lakehouse.
  7. Analytics services: The components that help to transform data into insights.

The Architecture depicted above is an evolving Architecture. These components satisfy specific functionality and are guided by key architectural principles that make this architectural pattern effective.

These components, when realized for deployment can manifest in many products. Let us zoom into the Azure cloud services that can fruition this Architecture into a reality.

Data Lakehouse on Azure

Cloud services innovate rapidly, and Microsoft’s Azure services are no exception. The figure above maps the current and most apt Azure services that fruition the Data Lakehouse Architecture. Let us investigate each of the components.

1. Data ingestion services

Azure Data Factory (ADF) is the preferred Azure service that one can use to ingest data. It is a fully managed, serverless data integration solution for consuming, preparing, and transforming all your data at scale. ADF can use this service to ingest data from multiple data sources into the cloud. ADF supports 90+ data sources out-of-the-box through well-integrated connectors. ADF is a platform-as-a-service implying that it has built-in HA, can scale on-demand, and the infrastructure layer is abstracted from the user.

2. Data Lake Stores

Azure Data Lake Storage (ADLS) is the preferred service to be used as the Data Lake store. ADLS has enterprise-grade features including durability (16 9s), mechanisms for protection across data access, encryption, and network-level control, and, in theory, provides limitless scale. ADLS is also optimized to store big data implying that it can store any data (structured and unstructured) and offers integration with visualization tools for standard analytics. ADLS can also store data in Delta Lake format. Delta Lake is an open format storage layer that delivers reliability, security, and performance on the data lake — for both streaming and batch operations. Providing support for ACID transactions and schema enforcement is another salient feature of Delta Lake. With the support for the ACID transactions, Delta Lake provides the reliability that traditional data lakes lack.

3. Serving Data Stores

Serving data stores can serve multiple purposes. The technologies employed to perform the data depend on how the data needs to be served, the audience to which it is served, and the underlying non-functional requirements that satisfy the users.

Azure Synapse, massively parallel processing (MPP), SQL-based engine is apt for Data Warehouse-like serving layers. Azure Synapse has a broad range of capabilities that range from being a Data Warehouse, to running Big data processing using Spark, and providing BI and AI capabilities out-of-the-box. However, its key function remains to be a Data Warehouse that is optimal for reporting and SQL-based queries.

If the data needs to be stored and served in real-time, and it requires an API-based interface then Cosmos DB serves as an apt candidate to serve such data.

4. Data processing services

The data processing services entail two types of processing. The first type is batch-based processing that processes the data in spurts of batches. The second type is stream processing that processes the data without landing it on the disk, i.e., real-time. There are five key processing engines in Azure that help to process data.

Databricks is a company co-founded by the creators of Spark. Microsoft tightly collaborated with them and created Azure Databricks. Azure Databricks is a cloud-based powerful data processing engine that delivers Spark’s strengths, a unified developer experience, and the ability to process data at scale. Azure Databricks, as I like to call it, is Spark on steroids. The enhancements made to the Databricks Runtime have made it up to 10x faster than open-source Apache Spark™. It leverages Azure Data Lake Storage to process both streaming and batch data at scale with almost zero maintenance. Azure Databricks is scalable, reliable, with millions of server hours running per day and processing exabytes of data every month across over 30 Azure regions. Azure Databricks also can process streaming data as well by making use of Apache Spark’s structured streaming architecture.

When it comes to the Hadoop ecosystem, there are a lot of services to choose from. Kafka for messaging, Storm for stream processing, Hive for data warehousing are a few examples of this rich open-source ecosystem. These products are meant for big data processing at scale and managing their processing on multiple clusters can be a maintenance nightmare. HDInsights runs these popular open-source frameworks on the cloud. These services are customizable and deliver enterprise-grade service for open-source analytics. HDI can effortlessly process massive amounts of data and get all the benefits of the broad open-source project ecosystem with the global scale of Azure.

As mentioned earlier, Azure Data Factory(ADF) is the preferred Azure service that one can use to ingest data. ADF can also be used to perform data transformation. The data flow activities are visually designed data transformations in Azure Data Factory. Data flows allow data engineers to develop data transformation logic without writing code.

Azure Synapse was introduced as the serving layer. Azure Synapse can also use its MPP capabilities to transform data using Synapse SQL.

If the data needs to be processed on a real-time basis then, Azure Stream Analytics can also be used. Azure Stream Analytics is the real-time analytics service that is designed for mission-critical workloads. It can be configured as an end-to-end serverless streaming pipeline with just a few clicks. It supports SQL and has built-in machine learning capabilities for more advanced scenarios.

5. Analytics services

Analytics services entail a group of services that are used to turn data into insights. It is a broad category that includes business intelligence, machine learning/artificial intelligence, and ad-hoc data analysis using good old SQL. Depending on the use-case, the targeted stakeholder, and their technical skills, they can use different types of analytical services. These analyses can be sourced either from the data lake or from the serving data stores. The goal is to get the data to the right stakeholder in an optimal time.

If a data scientist wants to experiment with a machine learning model rapidly, it can be done using Azure ML services. If the data scientist is comfortable with Spark or an expert in Python, they can use Azure Databricks or Azure ML SDK to serve the need.

A data analyst who is comfortable using SQL can spin up on-demand Synapse analytics and perform SQL-based analysis.

A business user who wants to dabble into visualization and self-service BI can use Power BI to serve the reporting requirements.

6. Data cataloging services

One cannot emphasize the importance of data cataloging more. Active cataloging is the key to ensure that the Data Lakehouse doesn’t turn into a swamp is to have non-negotiable rules on data cataloging. Five fundamental tenets for data cataloging that one could employ are:

  1. Any data that is ingested or landed back in the data Lakehouse needs to be cataloged. The scope of cataloging includes data lakes, serving layers, transformation pipelines, and reports.
  2. Any data source that publishes the data into the data Lakehouse needs to be cataloged.
  3. All artifacts that store data need to have technical and business metadata cataloged. It may include various attributes, including sensitivity and data classifications.
  4. Data transformation lineage should be cataloged and should depict data transformation from the source to the downstream.
  5. The cataloging information should be easily searchable and accessible to the right stakeholders.
  6. Azure Purview is the governance tool on Azure that enables active cataloging of data. Azure Purview is a software-as-a-service (SaaS) offering with rich data governance features, including automated data discovery, visual mapping of data assets, semantic search, and location and data movement across the data landscape.

7. Data security services

In this era of sophisticated cybercrimes, data security should be a default and not an afterthought. Data in the Data Lakehouse should be as secure as it can be. A proper security governance framework needs to be established to ensure that the data is stored safely and compliant. Access is provided to the right stakeholders using the appropriate authentication and authorization strategies. Advanced cybersecurity principles are applied to keep the data protected.

Azure has a suite of services that ensures that the cloud data is well protected and the access to it is controlled. For instance, data transiting into Azure through the Express route is encrypted end-end. Data stored in ADLS is secured using 256-bit AES encryption. The key used to manage the encryption can be Microsoft managed key or Bring Your Own Key (BYOK). Data stored in Azure Synapse can secure the data at multiple levels. It has security features that range from network-level security to row/column level encryption. Proper identity management is achieved using Azure Active Directory (Azure AD). Azure AD is Microsoft’s enterprise cloud-based identity and access management (IAM) solution.

Key Takeaways

Following are the key takeaways from the two-blog series:

1. The Data Lakehouse paradigm is an evolving pattern.

2. Organizations adopting this pattern must be disciplined at the core and flexible at the edges.

3. Cloud computing provides the scalable and cost-effective services that can fruition the Data Lakehouse pattern.

--

--

Pradeep Menon

Creating impact through Technology | #CTO at #Microsoft| Data & AI Strategy | Cloud Computing | Design Thinking | Blogger | Public Speaker | Published Author