Metadata Day Round-Up: 5 ways to empower data producers (Warning: Data Contracts ahead!)

Photo by Nick Fewings on Unsplash
How time flies! In the previous edition of the Metadata Day Round-Up series, I briefly touched upon the implementation side of data governance with a smattering of insights and ideas from the Metadata Day expert panel that was held in the summer of 2022. Today, let’s get down to brass tacks to answer this important question: How do we help data producers in making data governance happen?
The data community has been really enamored with the ideas of data mesh as proposed by
Zhamak Dehghani and more recently, data contracts as proposed by Chad Sanderson and the principles behind these ideas were part of what we discussed as well. Here are 5 practical tips for giving your data producers the superpowers they need to make modern data governance a reality for your organization.
1. Shift Left, but prepare right
The essence of Shift Left is simple: Moving the source of truth for metadata to live as close as possible to the source of the data definitions themselves.
We need to go beyond schema at a physical level (columns and types) to a schema-plus-semantics approach to know what the data means, not just what its structure is — and the best place to add that information is on the “left”, where the data is produced.
This idea is something Josh and I have worked very closely on, and are deeply passionate about. Check out our detailed post: Shifting left on governance: DataHub and schema annotations.
While on the topic, something Nishant said, really resonated. Shift left, but prepare right.
This means that while you go far left to identify the data, ownership, etc., design your downstream systems and tooling to give the data producer some value for all their work upstream.
As an example, this could mean providing tools that provide certain guarantees or benefits when metadata is provided by data producers — say by handling extraction, preservation, encryption, anonymization, dissemination, etc. This way, engineers will see that their accountability/involvement ends with correctly identifying data and its risk — so they have more time to do the stuff they’re supposed to do.
2. Separate semantics from policy
So what should these data contracts contain? What information can we rely on data producers to be able to provide reliably? Schemas and “schema-attached” business metadata like classification tags are not enough, you should also be able to refer to quality related metadata like SLA-s, data distributions, data management related metadata like retention and entitlements, as well as higher level business context in your contract specification language.
While determining “what” the specification should contain, the panel felt very strongly that we need to separate the semantics of the annotations from the policies that they activate. This means that we need to focus on
- Standardized vocabularies
- Mappings
- Ownership, etc, in the schema
If you can cleanly define the semantics of data and separate them from policies; that will allow engineers to annotate their data with policy-relevant information — without being experts in the policy. This way, engineers need to know enough — in terms of the vocabulary and their meaning — so they can annotate at the right place and at the right time.
A very simple example of this would be: ask the producer to label if a particular field contains user-provided email addresses, but spare them from making the determination of whether this constitutes highly-confidential information or personally identifiable information. Let the centrally defined mapping layer handle that.
An important caveat that Josh called out here was that “as important as it is to make data governance practices easier for engineers, you also need a specialized team that’s thinking deeply about relating data policies to regulations — whether it is ontologists, schema design experts, etc.”
3. Automate, Automate, Automate
While providing data producers the tools to provide metadata at source is great for sustainable governance, there is often a bootstrapping problem. How do you get started on an initiative that requires you to change business processes? What should the contract even look like?
Automation is essential here; not only in auto-creating / suggesting the first version of the “data contract specification” for the data producer; but also in continuously validating that the reality of the data in production actually is aligned with what the producer has declared. In other words, contract monitoring is essential both in creating the first specification and also in continuing to assert that the contract is a valid specification.
This interaction between automation and humans will typically involve a “process automation” tool like JIRA to bridge the two worlds. Amol made a very strong case for process automation, as a powerful but as-yet under explored aspect of automation in data governance.
Use Process Automation Tools to bridge the gap
I have personal experience with this approach at LinkedIn, where we extracted metadata from DataHub, processed it in bulk to determine which data assets needed human intervention to get back to a healthy state (e.g. ownership is too low, classification tags are missing, retention specification doesn’t seem to correlate with retention observed, etc.). Assets failing these tests resulted in JIRA tickets being created to drive the remediation process. Finally, a global monitoring application monitored the progress of these JIRA tickets and raised alerts along the management chain if the issues continued to persist beyond their SLA.
We’re applying a similar approach, except not just batch, but in real-time, through the “Governance Tests” feature as part of the Acryl DataHub offering, and I’m really energized to see the large diversity of use-cases where our customers are applying this feature.
Automated Governance Tests: Acryl DataHub
4. Getting funding for automation: Link governance automation to business improvements
Even if you are convinced that automation is the way to unlock scalable governance at your organization, you will still need to convince the people that control the purse strings (a.k.a budget holders) that this is an area worth investing in. For far too long, the ask has been, as Nishant says, to ‘make room for data governance’ — but using automation at scale and getting governance done right, have tangible business benefits. We discussed a few anecdotes that can serve as strong business cases to be made for data governance automation.
Nishant shares an example of data governance leading to business improvements
In this section of the metadata panel above, Nishant shared his experience at Uber where the Privacy team added an element of automation to infer schemas within unstructured data sections within their event streams (e.g. being able to infer email address holder fields within json blobs). Engineers could decide whether they wanted to continue accessing these blobs as unstructured data or utilize the inferred schema available to them. In the course of time, besides helping correlate the data to future downstream policies using the AI model, this approach also provided business benefits like
- Lower data encryption and storage costs since more data was deleted
- Fewer instances of queries timing out due to large data sizes
So it ended up being a win-win situation, automation made data management easier and more accurate and gave time back to the engineers to be more productive with their time, resulting in a net business benefit!
5. An Unconventional Idea: Make business users part of data production
Traditionally, business users have always been looked at as consumers of data. However, one of the tenets of the data mesh proposal is that the business takes bigger ownership of data products.
So what role can business users play in the creation of a data product?
Teresa has had some great success in involving business users early on in the creation of data products. Watch the section of the video below to see what she had to share about this idea and domains where that has worked well.
Involving Business Users in creation of data products
In a nutshell, since business users are best placed to describe the value of the data and create the definitions that describe the data, they can provide the definitions for quality as it relates to those topics.
So interestingly, production of data itself should include both the “technical” persona and the “business” persona to create a complete specification or requirements spec for the data product.
Here’s what this could look like in action:
- Understand each cohort’s needs — who needs the data and why?
- Catalog these identified use cases and business glossary terms into a business or logical layer.
- Connect the business terms, the people, organizations, and the use cases to the technical layer that holds the physical data asset(s) which serve the business use-case.
If you’ve gotten this far, you now know about 5 things that you could be doing to enable your data producers to be an integral part of your data governance strategy.
In the next few posts in the series, we’ll explore decentralization, knowledge graphs, and the specifics of looking at governance as code.
Stay tuned!