A Three-Step Framework for Implementing a Hybrid Data Mesh
The last blog of this series, The Data Mesh and the Hub-Spoke: A Macro Pattern for Scaling Analytics, focused on the conceptual implementation of the hybrid mesh. The blog started by establishing the need for these macro patterns. Then it discussed the fundamental concepts of the domain and the governance-flexibility spectrum. After discussing these concepts, we discussed the conceptual architecture of hub-spoke and data mesh patterns. Finally, the blog concluded with a framework to place a domain within hub-spoke or a data mesh pattern to form the hybrid mesh. This blog is the second part of the hybrid mesh series. This blog will focus on the three-step process for implementing a hybrid mesh. First, let us begin with a recap.
A Recap
The previous blog of this series introduced the concept of a hybrid mesh in the previous blog, The Data Mesh and the Hub-Spoke: A Macro Pattern for Scaling Analytics. The following diagram revisits the conceptual architecture of a hybrid mesh.
The hybrid mesh infuses the concepts of both hub-spoke and the classic data mesh pattern. The hybrid mesh is a more pragmatic approach for implementing the concept, as organizations are not simplistic entities.
Large organizations are evolving organically and are complex; hence, a hybrid approach works best.
Let us now focus on the steps an organization may take to implement a hybrid data mesh.
The Three-step Framework
The hybrid mesh implementation is a complex endeavor. For a fruitful implementation of this concept, there needs to be a confluence of technical excellence and organizational discipline. The three steps discussed below provide framework organizations can consider for implementing the hybrid mesh.
Each of the steps in the framework strives to answer a series of questions that provides better clarity on the step’s objective. Let us have a look at this framework in detail.
Step 1: Define Domain
The step defining the domain strives to answer the following questions:
- What are the parameters that define the domain? e.g., Departmental, product based, geo-based
- Where is the domain placed in the governance-flexibility spectrum?
In the previous blog of this series, an organizational domain was defined. Let us recap that definition.
A domain is any logical grouping of organizational units that aims to fulfill a functional context subjected to organizational constraints.
- The functional context implies the task that the domain is assigned to perform. The functional context is the raison d’être for the domain.
- The organizational constraints can be business constrained imposed on the domain like regulations, people and skills, and operational dependencies.
Typical examples of domains are:
- A department like marketing or sales focuses on a specific function within a business.
- A product group that focuses on creating a specific product or service.
- A subsidiary of a parent company.
Once the domain is defined, the next step is to determine the functionality of the domain node.
Step 2: Determine Domain Node
The step of determining the domain nodes strives to answer the following questions:
- What are the capabilities required for the domain node?
- Which component fulfills the decision support capabilities for the domain?
The previous blog briefly skirted the idea of the domain node. Each domain requires technical capabilities that need to be addressed.
A domain node fulfills the technical capabilities of a domain.
As an example, for fulfilling the technical ability of a decision support system, a node can have components like an Operational Data Store (ODS), a Data warehouse, a Data Lake, or a Data Lakehouse, along with its peripheral components like data ingestion, data processing, machine learning, etc. The following figure depicts the potential components of a domain node.
The flexible components are those technical components that can be implemented based on the needs of the domain. The flexible components are tailored to the domain’s requirements. For example, a sophisticated domain can have a Data Lakehouse to fulfill its decision support requirement. In addition, it can be armed with sophisticated AI/ML components that extract maximum value from the underlying data. Another example can include a less sophisticated domain focusing only on reporting systems to cater to its decision support requirements.
On the other hand, the must-have components, as the name suggests, are required to fulfill the essence of a hybrid data mesh. These three components ensure three key aspects of a hybrid data mesh:
- Data Security: It ensures that the data stored in the domain node is secure.
- Data Catalog: It ensures that the data in the domain node is well cataloged and curated for meaningful discoveries.
- Data Sharing: A robust data sharing mechanism is employed in the domain node to share data between the domains securely.
Once the domain node is defined, and its components are well established, the next step is establishing key roles and responsibilities that ensure governance in the hybrid mesh.
Step 3: Establishing Governance Framework
The step of establishing the governance framework strives to answer the following questions:
- Who are the key stakeholders required for the domain? e.g., domain owner, data steward, etc.
- What are the skill sets required to manage the domain?
- Which data will be cataloged?
- Who will get access to which data?
- How will the access control be implemented?
Data governance is a significant topic. A holistic data governance framework encompasses the governance objectives, policies, and components that materialize data governance. In the context of hybrid data mesh, the governance framework has three key aspects.
- Roles and Responsibilities
- Data Cataloging
- Data Sharing
Let us discuss each one of them in some depth.
Roles and Responsibilities:
Establishing roles and responsibilities for a hybrid data mesh is an arduous task. The traditional technical roles, like data engineers, data scientists, developers, project managers, etc., are given for any technical data implementations. However, successful implementation of a hybrid data mesh demands creating roles that ensure proper governance. The five key roles that make it happen are:
- Executive Sponsor: This role has the authority and budget and is accountable for establishing data governance. Typically, this role is a CXO-level role tasked with the overall ownership of data.
- Data Governance Lead: This role has the overall accountability and responsibility for implementing the data governance program. Data governance needs a program-level focus if it has yielded the right benefits.
- Data Owners: This role comes with authority and budget for overseeing the quality and protection of data within a domain. The role also decides who has the right to access and maintain that data and its usage.
- Data Steward: This role oversees the definition and usage of data within a domain. This role is typically an expert in a specific data domain and works with other data stewards across the enterprise. In addition, the role ensures that the data quality is maintained.
- Data Publishing Manager: This role is responsible for quality assurance, checking, and publishing them for internal and external data sharing.
The next aspect of data governance is data cataloging.
Data Cataloging:
One of the pivotal components of a hybrid mesh implementation is its data cataloging service.
Data cataloging is organizing the inventory of available data so that they can be easily identified and used.
This service ensures that all the source data, the data in the hub and spoke domain node, and the outputs extracted from domain nodes are appropriately cataloged. Think of data cataloging services like Facebook of Data. It is a place to get visual information on the domain’s contents. One can get information about the data, the relationships between the data, and the lineage of transformations the data has gone through. Some of the elements that one can consider for cataloging are depicted in the following diagram:
The next aspect of data governance is data sharing.
Data Sharing:
Data sharing between the domains needs to be structured, governed, and secure. Recall the discussion on the governance-flexibility spectrum in the previous blog. A refresher diagram can be found below:
The degree of relative domain independence determines how independent a domain is compared to other domains.
Five parameters determine the relative domain independence:
- The functional context of the domain within the organizational ecosystem.
- The people and skills that determine the smooth execution of the domain.
- The external or internal regulations that govern the domain.
- The operational independence of the domain concerning other domains.
- The technical capabilities possessed by the domain for implementing technology.
The placement of the domain in this spectrum determines whether a domain is a candidate for hub-spoke architecture or a data mesh architecture.
In a hybrid mesh, data sharing can occur in two flavors:
- Data sharing between the hub domain and the spoke domains and vice versa
- Data sharing between Data Mesh domains
Let us investigate each of these scenarios in detail.
Data Sharing Between Hub-Spoke Domains:
The first scenario is the data sharing between a hub domain and a spoke domain.
In this scenario, the spoke domain is dependent on the hub domain for key aspects of data. The diagram below depicts the data sharing workflow between the hub and the spoke.
Let us elaborate on the steps:
- Firstly, the hub data publishers, who have data ownership, publish the metadata of the hub domain node into its data catalog.
- The hub domain node steward reviews the published catalog to ensure it aligns with its governance framework.
- The steward then approves or rejects the published catalog contents. If approved, the catalog is updated with the metadata.
- When a spoke domain data requestor requires data from the hub node, the data requestor browses the hub data catalog to identify the data of interest.
- Once the data of interest is identified, the data requestor requests the data from the hub through the data share service.
- The request for data access is routed to the data publisher. The data publisher reviews the request and approves or rejects the request for data access.
- If the request is approved, the data publisher shares the data with the data requestor through the data share service that enables data sharing between the hub and the spoke nodes. The terms of data usage are also clarified.
- Finally, the data requestor reviews the terms of data usage. Upon accepting the terms, the data requestor can start consuming the data usage.
- The data publisher constantly monitors the data usage patterns through the data share service.
The workflow of a reverse scenario, i.e., a hub domain requesting data from a spoke domain, is depicted in the figure below:
Now that the workflow of data sharing between the hub and spoke is clarified let us investigate a scenario where data needs to be shared between two independent domains.
Data Sharing Between Data Mesh Domains:
The data sharing with the data mesh domains is slightly different as each domain has the independence and control of what data it can catalog and share. The diagram below depicts the workflow of data sharing between the domains of a data mesh.
- Firstly, the data publishers, who have data ownership, publish the metadata of the domain node into the data catalog.
- The enterprise data mesh steward reviews the published catalog to ensure that it aligns with the organization’s governance framework.
- The steward then approves or rejects the published catalog contents. If approved, the catalog is updated with the metadata.
- When a data requestor from another node requires the data from another node, the data requestor browses the data mesh catalog to identify the data of interest.
- Once the data of interest is identified, the data requestor requests the data from the hub through the data share service.
- The request for data access is routed to the data publisher. The data publisher reviews the request and approves or rejects the request for data access.
- If the request is approved, the data publisher shares the data with the data requestor through the data share service that enables data sharing between the nodes. As in the hub-spoke architecture, the terms of data usage are also clarified.
- Finally, the data requestor reviews the terms of data usage. Upon accepting the terms, the data requestor can start consuming the data usage.
- The data publisher constantly monitors the data usage pattern through the data share service.
Conclusion
A hybrid data mesh is a Macro Architecture pattern for harnessing data across multiple domains. The first part of this blog series focused on the conceptual underpinning of a hybrid data mesh. Next, this blog delves into its logical constructs. Next, it focuses on the logical components of a domain node, data cataloging strategy, key roles and responsibilities, and data sharing workflows. Finally, the next part of this series will focus on the technical implementation of this concept on a cloud computing platform like Microsoft Azure.