For a data engineer creating analytics from transactional systems such as ERP (enterprise resource planning) and CRM (customer relationship management), the main challenge lies in bridging the gap between raw operational data and customer insight. domain. ERP and CRM systems are designed and built to fulfill a wide range of business processes and functions. This generalization makes your data models complex and cryptic and requires domain expertise..
Even more difficult to manage, a common setup within large organizations is to have multiple instances of these systems with some underlying processes in charge of transmitting data between them, which could lead to duplication, inconsistencies, and opacity.
The disconnect between operational teams immersed in day-to-day functions and those extracting business value from the data generated in operational processes remains a major friction point.
Imagine being a data engineer/analyst tasked with identifying best-selling products within your company. Your first step might be to locate the orders. Then you start investigating the database objects and find a couple of views, but there are some inconsistencies between them, so you don't know which one to use. Furthermore, it is very difficult to identify the owners, one of them even left the company recently. Since you don't want to start your development with uncertainty, you decide to go directly to the raw operational data. Does it sound familiar to you?
I used to connect to views in transactional databases or APIs offered by operating systems to request the raw data.
To prevent my extracts from impacting operational performance, I queried this data periodically and stored it in a persistent staging area (PSA) within my data warehouse. This allowed me to run complex queries and data pipelines using these snapshots without consuming any operating systems resources, but could lead to unnecessary data duplication if I was unaware that other teams were performing the same extraction.
Once the raw operational data was available, I then had to face the next challenge: deciphering all the cryptic objects and properties and dealing with the maze of dozens of relationships between them (i.e. general material data in SAP documented). https://leanx.eu/en/sap/table/mara.html)
Although standard objects within ERP or CRM systems are well documented, I needed to deal with numerous custom objects and properties that require domain expertise. since these objects cannot be found in standard data models. Most of the time I found myself launching “trial and error” queries in an attempt to align keys between operational objects, interpreting the meaning of properties according to their values, and verifying my assumptions with screenshots of the operational UI .
A Data Mesh implementation improved my experience in these aspects:
- Knowledge: I was able to quickly identify the owners of the exposed data. The distance between the owner and the domain that generated the data is key to accelerating further analytical development.
- Discoverability: A shared data platform provides a catalog of operational data sets in the form of source-aligned data products that helped me understand the state and nature of the exposed data.
- Accessibility: You could easily request access to these data products. Because this data is stored on the shared data platform and not on the operating systems, I did not have to align with operational teams during available windows to run my own data extraction without impacting operational performance.
According to the Data Mesh taxonomy, data products built on top of operational sources are called source-aligned data products:
Source domain data sets accurately represent raw data at the time of creation and are not tuned or modeled for a particular consumer. Zhamak Dehghani
Source-aligned data products are intended to represent operational sources within a shared data platform in a one-to-one relationship with operational entities and should not contain any business logic that would alter any of their properties.
Property
In a Data Mesh implementation, These data products must
strictly be owned by the business domain that generates the raw data. The owner is responsible for the quality, reliability and accessibility of their data and the data is treated as a product that can be used by the same team and other data teams in other parts of the organization.
This property ensures that domain knowledge is close to the exposed data.. This is critical to enabling rapid development of analytical data products, as any clarifications needed by other data teams can be handled quickly and efficiently.
Implementation
Following this approach, the Sales domain is responsible for publishing a data product 'sales_orders' and making it available in a shared data catalog.
The data channel responsible for maintaining the data product could be defined like this:
Data extraction
The first step in creating source-aligned data products is to extract the data we want to expose from operational sources. There are plenty of data integration tools that offer a user interface to simplify ingestion. Data teams can create a job there to extract raw data from operational sources using JDBC or API connections. To avoid wasting computational work, and whenever possible, only raw data updated since the last extraction should be incrementally added to the data product.
Data Cleaning
Now that we have obtained the desired data, the next step involves some curation, so that consumers do not have to deal with inconsistencies existing in the actual sources. Although no business logic should be implemented when creating source-aligned data products, basic cleansing and standardization is allowed.
-- Example of property standardisation in a sql query used to extract data
case
when lower(SalesDocumentCategory) = 'invoice' then 'Invoice'
when lower(SalesDocumentCategory) = 'invoicing' then 'Invoice'
else SalesDocumentCategory
end as SALES_DOCUMENT_CATEGORY
Data update
Once the extracted operational data is ready for consumption, the internal data set of the data product is incrementally updated with the latest snapshot.
One of the requirements for a data product is to be interoperable. This means that we need to expose global identifiers so that our data product can be used universally in other domains.
Metadata update
Data products must be understandable. Producers must incorporate meaningful metadata for the contained entities and properties. This metadata should cover these aspects for each property:
- Business Description: What each property represents for the business. For example, “Commercial category for the sales order”.
- Source system: Establish a mapping with the original property in the operational domain. For example, “Original source: ERP | BIC/MARACAT property of the MARA-MTART table”.
- Data characteristics: Specific characteristics of data, such as enumerations and options. For example, “It is an enumeration with these options: Invoice, Payment, Claim”.
Data products must also be discoverable. Producers must publish it to a shared data catalog and indicate how the data will be consumed by defining the output port assets that serve as interfaces to which the data is exposed.
And the data products must be observable. Producers must implement a set of monitors that can be displayed within the catalog. When a potential consumer discovers a data product in the catalog, he can quickly understand the status of the data contained.
Now, again, imagine being a data engineer tasked with identifying the best-selling products within your company. But this time, imagine you have access to a data catalog that delivers data products that represent the truth of every domain that shapes the business. Simply enter “orders” into the data product catalog and find the entry posted by the sales data team. And, at a glance, you can evaluate the quality and timeliness of the data and read a detailed description of its content.
This enhanced experience eliminates the uncertainties of traditional discovery, allowing you to start working with data immediately. But what's more, you will know who is responsible for the data in case you need more information. And whenever there is a problem with your sales order data product, you will receive a notification so you can take action in advance.
We have identified several benefits of enabling operational data through source-aligned data products, especially when owned by data producers:
- Accessibility of selected operational data: In large organizations, source-aligned data products represent a bridge between the operational and analytical planes.
- Reduction of collisions with operational work.: Accesses to operating systems are isolated within source-aligned data product channels.
- source of truth: A common data catalog with a list of selected operational business objects that reduces duplication and inconsistencies across the organization.
- Clear data ownership: Source-aligned data products should domain property that generates operational data to ensure domain awareness is close to the exposed data.
In my own experience, this approach works exceptionally well in scenarios where large organizations struggle with data inconsistencies across different domains and friction when creating their own analytics on operational data. Data Mesh encourages each domain to build the 'source of truth' for the core entities they generate and make them available in a shared catalog that allows other teams to access them and create consistent metrics across the organization. This enables data analytics teams to accelerate their work to generate analytics that drive real business value.
https://www.oreilly.com/library/view/data-mesh/9781492092384/
Thanks to my Thoughtworks colleagues Arne (twice!), Pablo, Ayush and Samvardhan for taking the time to review early versions of this article.