A Framework for Modern Data Architecture
Today is the age of data. Every organization is becoming a “data company.” Today, if an organization is not thinking about its data as a strategic asset, then it has already missed the bus.
Data has evolved over the past decade. The rate of evolution has been exponential. But have the Data Architecture practices managed to keep up with the same pace?
McKinsey recently published an article that formulates the building of modern data architecture to drive innovation. The article explains the foundational shifts for modern data architecture.
Being a practitioner in this field, I found Mckinsey’s view interesting. The goal of this article is two-fold:
- To provide a practitioner’s perspective of McKinsey’s report.
- To provide a technology-oriented view to fruition this Architecture.
The Foundational Shifts
The article elaborates on the six foundational shifts that enable modern data architecture.
The article elaborates on the six foundational shifts that enable modern data architecture.
Shift #1: From on-premise to cloud-based platforms
- McKinsey’s article points out the rise of cloud computing as one of the shifts. It touts cloud computing as a game-changer to fruition the modern data architecture pattern. The cost structures for a cloud-based solution enable a pay-as-you-go offering. This offering allows data to be processed at scale at a fraction of the cost compared to equivalent on-premise storage and compute. Cloud components that are platform-as-a-service (PaaS) offering provide these capabilities with minimal maintenance overhead, high scalability, and in-built high-availability.
- As a practitioner and a cloud computing professional, I can’t agree more. I see this foundational shift having an immediate impact on the customers. The effect is not just on cost but also on rapid innovation.
Cloud enables organizational agility.
Shift #2: From batch to real-time processing
- McKinsey’s article points out the change in processing patterns as one of the shifts. Over the years, the cost of stream processing has significantly come down. Messaging technologies like Kafka provide capturing events at scale and enable real-time analytics. This capability fundamentally changes how data is processed. Real-time processing is endorsed.
- As a practitioner, I find this view too simplistic. Real-time processing is an advantage for streaming data. However, a significant corpus of data in an organization is the traditional ERP, CRM, or other batch-based systems (commonly known as OLTP). These datasets are batch-based and hence require a batch processing engine as well. Converging data at rest and data in motion to seamlessly extract insight is the key.
Organizations will pivot from batch to real-time data ingestion and processing.
Shift #3: From pre-integrated commercial solutions to modular, best-of-breed platforms
- Mckinsey’s article points out the evolution of commercial solutions as one of the shifts. The world of commercial software has changed. From monolithic license-based packages, it has evolved into a modular bespoke offering. These offerings focus on addressing specific functionality. This shift is especially true in the world of cloud computing. The services in a cloud computing platform focused on addressing a particular functionality at a calibrated cost for usage.
- As a practitioner, I couldn’t agree more with this point of view. Cloud computing has transformed the universe of software solutions. It’s modular and to the point service offering gives the customers flexibility to focus on functionality rather than technology.
Cloud enables modular components focused on functionalities rather than technologies.
Shift #4: From point-to-point to decoupled data access
- Mckinsey’s article points out the changes in the data access patterns as one of the shifts. Organizational data needs to be agile, and it needs to be accessible to both internal and external stakeholders. McKinsey proposes to decouple data and expose them through APIs. They also suggest a data platform to “buffer” transactions out of the core systems.
- As a practitioner, I would take this shift with a pinch of salt. While exposing data as an API is a conducive method. At the same time, data needs to be served optimally. It will be a painstaking process if large volumes need to be integrated for processing. Both methods of point-to-point and decoupling-based data access needs to exist depending on the volumetric and the need for data serving.
Co-existence of point-to-point and decouple data access is required to get the right data to the right stakeholder and at the right time.
Shift #5: From an enterprise warehouse to domain-based architecture
- Mckinsey’s article points out the evolution of EDW into a domain-based repository as one of the shifts. Gone is the days when an enterprise had a large data warehouse (EDW) that served the entire organization’s needs. Data has many avatars, and it needs to be malleable and ductile to be shaped into various organization’s needs. A single EDW won’t be able to do justice to these needs. The domain-based architecture uses concepts of on-demand infrastructure, data virtualization, and data cataloging methods to create domain-specific repositories. These repositories are tailored to the needs of the business units.
- As a practitioner, I more or less agree with this shift. Domain-based architecture is the way to go. It provides the much-required flexibility to shape the data in the required form specific to the business unit. Data cataloging and curation is a must-have to prevent data lakes from becoming data swamps. However, data virtualization has its disadvantages. It increases the cost of data access and decreases the performance of data access. There has to be a balance between data virtualization and smart data duplication.
A single repository of Enterprise Data Warehouse impedes the required agility for transforming data into insights.
Shift #6: From rigid data models toward flexible, extensible data schemas
- McKinsey’s article points out that predefined data models from software vendors will not be relevant in the modern data architecture pattern. Organizations will pivot towards flexible and extensible data schemas. This pivot will be by using denormalized data models with fewer physical tables, thus reducing data complexity.
- As a practitioner, I have seen many projects failing due to this very same schema-less approach. Data models have their place if they are used in the right manner. Often Logical Data Models (LDM) are confused with the Physical Data Models (PDM). An organization should have its data blueprint encoded in the form of LDMs. The physical manifestation of these models (PDMs) can be flexible and conducive to the underlying technology.
Data models used in the right context serve as the blueprint for the modern data architecture.
Modern Data Architecture on Microsoft Azure
Having discussed these shifts, let us see which components in a cloud platform can enable these shifts. Microsoft’s Azure cloud platform will be used to exemplify the features. However, these capabilities can be mapped to other cloud platforms as well.
- Cloud-enabled data platform: Microsoft’s Azure is a comprehensive cloud computing platform with services focusing on Data and AI. It can very well fulfill the need for a cloud-enabled data platform.
- Real-time streaming: Azure provides specific services that are focused on stream ingestion and processing. Azure’s Event Hubs is a fully managed, real-time data ingestion service that is simple, trusted, and scalable. Azure Stream Analytics, Azure Databricks, and HDInsights provide various options for processing stream data at scale.
- Modular, best-of-breed platforms: Cloud enables modular components focused on functionalities rather than technologies. Azure’s data services are platform-as-a-service components. The need for hardware is abstracted, and the services are configurable to suit the user requirements. These features provide a modular functionality to the components. Azure services can weave together to form a scalable end-end solution.
- De-coupled data access: Data needs to be democratized responsibly. Azure’s API management platform proved a hybrid, multi-cloud management platform for APIs across all environments. Azure’s NoSQL innovative database technologies like CosmosDB provide in-built API to expose data safely and in a scalable manner.
- Domain-based Architecture: Multiple landing layers will be the norm in modern data architecture. Azure provides a plethora of flexible options fulfilling the storage needs of the organization. Azure Data Lake Store provides a massively scalable and secure data lake for your high-performance analytics workloads. Azure Synapse Analytics provides a limitless analytics service that acts as an apt analytics store for organizations. The recently launched service Azure Purview delivers a unified data governance service that catalogs data and creates a domain-based schema.
- Flexible, extensible schema: Azure Data Lake Store can be used to store both structured and unstructured data. It provides the required schema flexibility. Cosmos DB provides APIs that can store and access data in multiple formats (key-value, column-family, documents, and graph).
In conclusion, there are three key takeaways:
- Data Architecture needs to evolve to keep up with the changing business demands and data landscape.
- The framework for modern data architecture entails six foundational shifts that focus on bringing more agility, transparency, and data democratization.
- Microsoft Azure provides a holistic platform to enable this shift and lay the foundation of modern data architecture.