A Data Mesh is a methodology of using a collection of small databases at the edge managed by domain experts as a set of data products. A Data Mesh is a collection of small databases at the edge managed by domain experts as a set of data products. Data Mesh is a reaction to the inefficiencies of the data lake paradigm. Data Mesh represents a paradigm shift away from centralized data storage and hyper-specialized technology experts towards a paradigm of cross-functional, business domain experts managing the data in their expertise.
In a Data Mesh, analytical data is distributed throughout the organization, owned by the business experts who can best manage the data.
The Data Mesh includes these fundamental principles:
- Distributed Storage. The data is stored by those who have expert domain knowledge of the data.
- Central Data Catalog. Each data store is published into a central metadata repository. Consumers can explore the data catalog, and via self-service, consume data at its source.
- As a product. The data is exposed as APIs in an inter-operable way and using organizational standards. Each API has KPIs for up-time, quality, discoverability, and usage.
Unlike a Data Lake, the data is curated by the domain experts who can correctly identify and curate the data. Like a Data Lake, a central metadata catalog identifies data sources and allows other teams to consume the needed content.
In a Data Lake, teams are aligned around technology or location. By comparison, in a Data Mesh, teams are aligned around business domains and the storage technology is an implementation concern.
When moving from a Data Warehouse or Data Lake methodology to a Data Mesh methodology, the old Data Warehouse or Data Lake becomes one of the nodes in a mesh. In time, the data should get returned to the source and published directly as data products into the mesh.
Benefits and Drawbacks
A Data Mesh has many great benefits and some drawbacks compared to a Data Lake.
- Faster response time. In a Data Lake, teams often compete for priority in the data team’s backlog. By comparison, in a Data Mesh, the team can set their own priorities and publish the data products as soon as they’re available.
- Domain experts care for data. In a Data Lake, the data team is aligned based on technology, meaning data experts care for data outside their business understanding. By comparison, in a Data Mesh, the business experts who understand the domain model can care for the data products they manage. Consumers are often able to directly understand the data products or quickly clarify the concern with the business experts.
- Lack of central control. In a Data Mesh, each team must dedicate resources to caring for the data products they publish. This may lead to competing priorities, duplicate technology learnings, and a lack of focus towards the KPIs necessary for consistent organizational data.
- Easy to lose track of data. Because there is no central control, it’s possible to return to the Data Warehouse days, yielding siloed data stuck in disparate systems, or even losing track of new systems completely.
Comparison to Microservices
Much like microservices and domain-driven design (DDD), a Data Mesh keeps data in small data stores. In a microservice strategy, these are small services. In DDD, these are bounded contexts. In Data Mesh, these are small databases. These smaller systems offer these operational efficiencies:
- Easier to maintain. Small data stores have less content, schemas, and other moving pieces. This simpler architecture makes it easier for staff to reason about and maintain.
- Easier to upgrade and deploy. With large, monolithic Data Lake systems, to upgrade any piece requires upgrading all similar pieces. To deploy changes requires deploying new content throughout the system. This often leads to communication challenges and scheduling conflicts as different business needs have different priorities and schedules, different levels of quality tolerance, and different stakeholders and motivations. By comparison, when there’s lots of small data stores in the Data Mesh, each store can be updated independently, yielding operational efficiency and development speed.
- Easier to scale. In a monolithic system, to increase capacity or scale of one piece of the system requires scaling the entire system. When the system is in smaller pieces, one can scale up only the pieces under load, leaving the other systems at their existing capacity. This leads to more efficient hardware use.
When we separate data into smaller data stores, we gain operational efficiency, speeding business use of data.
Data Mesh is the next generation of data storage and sharing. Data Mesh is a reaction to an upgrade from the down-sides of a Data Lake.
Here’s the major evolutionary stages of data storage and usage:
1. A single Transactional database stored all business data. Transactional databases allowed businesses to create and edit data as needed. However, reporting was difficult because analytics often required different views on data. Querying across large amounts of data usually meant locking the table for the duration of the query, leading to performance degradation both for other analytics and business editing of the data.
2. Analytical data is separated from Transactional data. An Export-Transform-Load or ETL process would periodically refresh the analytical database with changes from the transactional database. The ETL process could transform the data from forms ideal for business editing to forms ideal for analysis, including data enrichment from outside sources, denormalization to speed reporting, and data cleansing to remove or adjust invalid values. The separation of transactional from analytical data stores allowed the business to continue using the transactional data while analysis was performed on data in bulk.
3. Data Warehouse formalizes the analytical database. The data warehouse is specifically for analytical use. ETL or CDC (change data capture) processes fill the data warehouse with cleaned, denormalized data ready for analysis. Often organizations would have a data warehouse for each business domain or team, leading to pockets of siloed data. This data was often not easily discoverable or consumed by other teams.
4. Data Lake centralizes data into one large, discoverable pool. The data lake is filled via ETL, CDC, and/or event streams into a central repository of data. The data may be stored in different forms or storage systems in the lake. A central team of hyper-specialized data experts facilitate onboarding new data sources, and transforming data into user-specific views. The data team is also responsible for handling ingest, availability, and data quality issues for all data sources in the organization. Though highly specialized in data technologies, the data team often knew little of the business domain and thus couldn’t help with identifying or deciphering data.
5. Data Mesh is a collection of distributed data sources managed as products throughout the organization. In a data mesh, business experts manage the analytical data, the process of ETL, CDC, and/or event streams to populate it, and make the data products discoverable as they publish the APIs to a central data catalog. In a data mesh, the teams that build the transactional data systems are also responsible for the construction, maintenance, content, availability, quality, governance, and discoverability of the analytical data.
Data Mesh is a new paradigm for storing data at the edge, managed by domain experts who craft and maintain the data products.
Did we take a step backwards from Data Lake back to Data Warehouse? No, in the new Data Mesh paradigm, we treat these data APIs as products. Much like a grocery store, we can peruse the catalog of items, and compare similar products. The grocery store didn’t bake all the varieties of bread; rather experts in baking created each product. The grocery store is then able to collect and distribute these products.
In a Data Mesh, much like a Data Lake, we publish content into a central metadata repository so consumers of these data products can compare and easily consume the necessary data. Unlike a Data Lake, the data is stored in its natural habitat, allowing domain experts to curate, care for, and document the data products.
The move towards Data Lake was to give visibility to the organization’s data, but the centralization of data yielded operational inefficiencies and competing priorities. By contrast, a Data Mesh preserves the discoverability of data, but facilitates domain experts, avoiding the central team and publishing their own assets.
Securing your Data Mesh with Cyral
On its own, a Data Mesh offers great benefits: a data catalog offering easy-to-discover data products, and distributed, team-designed storage so teams can publish their data products on their own schedule. However, with this distributed surface area, it’s often difficult to secure these data stores in a consistent and universal way throughout the organization.
Many data products may only be secured by shared service accounts or we may find data access is not logged and audited at all. This is where Cyral can help. Learn more.