Part 1 — Architecting a Hybrid Data Mesh on a Hyper-scale Cloud Platform: Realizing the Domain Nodes
The last blog of this series, A Three-Step Framework for Implementing a Hybrid Data Mesh, focused on creating a framework for architecting a hybrid data mesh.
The next three-part blog series will focus on realizing the architecture of the hybrid mesh in a hyper-scale cloud. The best way to showcase how one can realize the hybrid data mesh is through examples and scenarios.
I will choose Microsoft Azure as the hyper-scale cloud platform and architect hybrid data mesh using the services provided by Azure.
Each of the blogs will focus on one critical element of implementing a hybrid data mesh:
- Architecture patterns to realize domain nodes and Azure services that one may use to realize them.
- Architecture patterns to realize data cataloging and Azure services that one may use to realize them.
- Architecture patterns to realize data sharing and Azure services that one may use to realize them.
So, let’s get going.
Logical Architecture of the Hybrid Data Mesh
Recall that the first step in creating a hybrid data mesh is to define the domain. Defining the domain is not a technical construct but is dictated by the functional requirements.
A domain is any logical grouping of organizational units that aims to fulfill a functional context subjected to organizational constraints.
- The functional context implies the task that the domain is assigned to perform. The functional context is the raison d’être for the domain.
- The organizational constraints can be business constrained imposed on the domain like regulations, people and skills, and operational dependencies.
Typical examples of domains are:
- A department like marketing or sales focuses on a specific function within a business.
- A product group that focuses on creating a specific product or service.
- A subsidiary of a parent company.
The following diagram shows a logical architecture for a hybrid data mesh. It has two networks. One network follows the hub-spoke pattern, and the other implements a data mesh.
As I mentioned, this blog will focus on realizing the domain nodes on a hyper-scaler like Azure.
Architecture Patterns for Domain Nodes
As discussed in the previous blog, we briefly skirted the idea of the domain node in the last blog. However, each domain requires technical capabilities that need to be addressed. A domain node fulfills these capabilities. As an example, for fulfilling the technical capability of a decision support system, a node can have components like an Operational Data Store (ODS), a Data warehouse, a Data Lake, or a Data Lakehouse along with its peripheral components like data ingestion, data processing, machine learning, etc. The following figure depicts the potential components of a domain node. With this understanding, one can imagine how to realize the domain nodes.
Each domain (including the hub) has different requirements in a Hybrid Data Mesh architecture. A specific set of components can meet each domain’s requirements. It is not a one-size that fits all strategy. The domain nodes are tailored to the domain’s requirements.
For better clarity, let us define the requirements for each domain. Of course, the requirements may vary from Domain to Domain, and this example provides how one could implement it. The following table depicts the typical requirements for each domain (including the hub).
One can use multiple services to realize these requirements on a hyper-scale like Azure. The subsequent diagram depicts how Azure services can realize these requirements in a hub-spoke pattern.
Let us drill down into it:
Hub: As mentioned in the preceding table, the hub has decision-support requirements with large consolidation datasets. It must have the capability for self-service BI with high reports performance. The potential components that can fulfill this requirement are:
- Analytical Data Store: The hub’s analytical data store requirement can be fulfilled using a Data Warehouse. A data warehouse is a central repository for storing and managing large amounts of data, typically from various sources. It’s designed to be used for analytical and reporting purposes, allowing users to gain insights into their data and make better data-driven decisions. The Data Warehouse component in Azure can be realized using Azure Synapse Analytics, a cloud-based data warehouse service. Azure Synapse Analytics (formerly SQL Data Warehouse) is a cloud-based data integration and analytics platform that combines the power of a data warehouse with the flexibility of big data. It allows one to easily access, analyze, and visualize data from various sources, including structured and unstructured data. With Azure Synapse, one can quickly and easily create a data warehouse that can handle terabytes of data and use familiar tools like Azure SQL, Azure Stream Analytics, and Power BI to analyze and visualize data. Azure Synapse is a robust platform that can help organizations manage and analyze their data and make better data-driven decisions.
- Visualization: The visualization component can be realized using Power BI, a cloud-based visualization tool. Power BI is a business analytics service by Microsoft that allows you to visualize, analyze, and share data. It includes a range of tools and features that will enable you to create interactive dashboards, reports, and charts based on your data. With Power BI, one can easily connect to a wide range of data sources, including Excel, SQL Server, and cloud-based services like Azure Synapse, Salesforce, and Google Analytics. One of the key strengths of Power BI is its ability to allow you to share your data and insights with others. For example, one can publish the dashboards and reports to the web or share them with specific individuals or groups within your organization. Power BI also includes collaboration features that make working with others on your data projects easy. Overall, Power BI is a powerful tool that can help organizations of all sizes to gain insights from their data and make better data-driven decisions.
Domain 1: Domain 1’s requirement is simplistic. The data storage requirements are large, and it requires a data store that can store both structured and unstructured data that one can use to fulfill basic reporting requirements. A lower cost at the expense of performance suffices in this domain. The Azure services that can potentially realize these requirements are:
- Raw Data Storage: Domain 1’s data storage requirements can be fulfilled by creating a data lake. A data lake is a central repository that allows you to store all your structured and unstructured data at any scale. Data lakes are designed to handle large amounts of data, including data of different types and from various sources. Domain 1 can use Azure Data Lake Storage (ADLS) to build the Data Lake. Azure Data Lake Storage (ADLS) is a cloud-based file storage service that allows storing of large amounts of data, including structured and unstructured data. It is designed to handle big data workloads. It can store and process data from various sources, including Hadoop, Azure Stream Analytics, and Azure Machine Learning.
- Visualization: As explained for the hub, the visualization component can be realized using Power BI.
Domain 2: Domain 2’s requirements entail creating operational reports with simple visualization capabilities.
- Operational Reporting: The requirement of an operational report can be fulfilled using an Operational Data Store (ODS). An Operational Data Store (ODS) is a database designed to support the day-to-day operations of an organization. It stores real-time data extracted from various operational systems, such as transactional systems. It is typically used for operational reporting, real-time monitoring, and decision-making. ODS allows organizations to make informed decisions quickly and effectively by providing up-to-date data to users and systems. Domain 2 may use Azure services like Azure SQL Database, Azure Postgres SQL, or Azure MYSQL to support the ODS requirements. Azure SQL Database is a fully managed relational database service based on the Microsoft SQL Server database engine. It allows you to easily create and manage SQL databases in the Azure cloud. In addition, it provides built-in backup, security, and monitoring capabilities. Azure PostgreSQL is a fully managed relational database service based on the open-source PostgreSQL database engine. It provides a highly available and scalable service and can be used for various workloads.
- Visualization: As explained for the hub, the visualization component can be realized using Power BI.
Domain 3: Domain 3, in this example, has complex analytical requirements. It had big data sets that required raw and processed data to be stored. In addition, it demands fast and scalable processes support its business needs and requires Machine Learning capabilities. Similar to the hub, a comprehensive component needs to be used to realize these requirements.
- Big Data Storage: Like Domain 1, Domain 3 can fulfill big data storage by building a data lake. Domain 3 can use Azure Data Lake Storage (ADLS) to build the Data Lake.
- Big Data Processing: A high-performing Apache Spark engine can fulfill the requirement of a fast and scalable big data processing engine. Apache Spark is an open-source, distributed computing system for large-scale data processing. It is a powerful tool for working with big data and can process data in various formats, including structured, semi-structured, and unstructured data. The requirement for a powerful Apache spark engine can be fulfilled using Azure Databricks. Azure Databricks is a fully managed platform for data engineering, machine learning, and analytics built on Apache Spark. It provides a collaborative, web-based environment for data scientists and engineers to work with big data. It also includes a wide range of features and functionalities such as SQL, Python, R, and Scala notebooks, a cluster manager, an interactive workspace, and an API for integrating with other Azure services such as Azure Data Factory and Azure Synapse.
- Machine Learning: The machine learning requirement for Domain 3 can be fulfilled with Azure Databricks. Besides big data processing, Azure Databricks also includes data processing, model training and deployment, and data visualization capabilities, making it a one-stop-shop for data processing and analysis needs. Azure Databricks is a powerful platform that can help organizations of all sizes to build and deploy machine learning models and to use machine learning to gain insights from their data. Alternately, like the hub, the machine learning requirements are fulfilled using Azure ML.
- Visualization: Like the other domains and the hub, the requirement for the visualization component can be fulfilled using Power BI.
Now that we have seen how domain nodes can be realized in the Hub-Spoke pattern, a similar Architecture can be realized using a data mesh pattern. The following diagram shows how domain nodes can be realized using Azure services.
A crucial point to note here is that in a data mesh, there is no hub. Instead, domains communicate with others in a governed manner.
Conclusion and Next Part
In the first part of this blog series, we established three key components that need to be implemented to realize the hybrid mesh. The three components involve realizing the domain nodes, realizing the data catalog, and realizing data sharing.
The key emphasis is the following:
Each domain (including the hub) has different Hybrid Data Mesh architecture requirements. A specific set of components can meet each domain’s requirements. It is not a one-size that fits all strategy. Instead, the domain nodes are tailored to the domain’s requirements.
Finally, we discussed numerous examples of the domain’s requirements and the key Azure services that the domains can use to realize them.
The next part of this blog series will focus on realizing the data cataloging in a hybrid data mesh.