Starting a Data Governance Journey
This post, part of our DataHub Blog Community Contributor Program, is written by Venkata Krishnan, a DataHub Community Member
Data Governance used to be a fancy topic roughly a decade ago (circa 2011/ 12), and when I first heard about it, my first question was:
Why do we need Data Governance at all?
This resulted from my mind’s voice asking: Are we being very strategic by funding a research project here?
I found my answers in these examples:
- In a retail business, a bad address or incorrect demographic information can cause parcels to return/make rounds, and resending them would result in a few more transits. This is a typical Master Data Management/ Data Quality issue that can result in a sizeable monetary loss.
- Imagine that you need communication with a customer for two different products/ lines of business (LoB). If we maintained the customer’s demographic details in two different places/ systems/, we could end up sending two different mailers that could — and should — have been consolidated into one. Say, if the cost of sending the mailer is $1, and if there are 100,000 customers in the system, we could be spending $200,000 instead of $100,000.
As is clear in the given scenarios, the simple starting point of a Data Governance journey is knowing where customer (master) data is, how many systems consume it now, and how many can potentially consume it in the future.
But an equally important question needs to be answered: How do we provide an authoring mechanism with a straightforward workflow (with a feedback loop) to ensure that the customer record is accurate? This also requires business knowledge of different customer touchpoints and how communications and interactions could be optimized.
The WHY of Data Governance
Why has the importance of Data Governance exploded in growth?
The earliest mention of Data Governance on Wikipedia dates back to 2006 (if you’re curious, check out the difference between then and 2022 here to put into perspective how far along we have come on.
In my view, the top reasons for this explosion are cloud computing and the SaaS business model — in addition to the operational and monetary reasons that always existed. While cloud computing has commoditized computing and storage, SaaS solutions have changed how we think about business. Additionally, Artificial Intelligence and Data Science have matured ever since Cloud computing came into the picture. These developments have led to an explosion in the number of data businesses generate.
And yet, even today, modern data stacks/platforms/systems work with silos of information, often relying on humans with business and operations knowledge and good data stewardship to arrive at optimizations.
Can we leave Data Governance to individuals? Can they decide how to use data most optimally?
It is humanly not possible (or even recommended) to master 10 LoBs in their depth. Too much could be lost in translation and the transit from the past to the future. It requires data and business literacy, and a huge dose of teamwork — especially in large businesses. So, a democratic approach to arrive at “data and business literacy” with a solid framework becomes critical — especially when you think about rapidly growing companies that often merge with, or acquire other companies,
Now that the “why” of Data Governance is clear, let us get into the “what” and the “how”.

Photo by John Schnobrich on Unsplash
What does a Good Data Governance Journey Look Like?
Google “Data Governance use-cases” in 2022, and the following top 5 use-cases come up
- Data discovery and data literacy provisions.
- Collaborative analytics or building new data products.
- Data privacy compliance.
- Create a centralized repository of all standardized business terms.
- Centralized data access management.
P.S: I have deliberately changed the order to bring ‘Data Discovery’ to the top. Sorry, Google Search!
When one of my project partners asked me, “What would be a good step to start a good Data Governance initiative?”, I was initially puzzled, though I have been a data professional for most of my career. But soon enough simple common-sensical question brought the answer to me:
“Without knowing what we are going to govern, what can we govern?”
The simple answer: Assets in the context of Data and Business (in line with “data and business literacy”).
Active Metadata Catalogs/ Discovery tools are essential in a modern data stack to manage all the Data Assets in near-real time.
As business and the underlying data structures keep changing dynamically and rapidly, metadata changes at source must be accommodated, without adversely affecting the data pipelines, analytical systems, and other BI systems downstream.
Do check out the Orielly articles linked in the References section for validation of this approach.
What Should be Part of a Good Data Governance Solution?
Data Governance is a set of policies and processes. The following are the key ingredients to Data Governance success over time:
- An Active data catalog
– Inventory of data assets
– Data lineage ideally with support for column lineage - Good data lifecycle management
– Data collection/ ingestion frameworks
– Data storage
– Data retention policies - Data pipeline orchestration
- Data security
– PII Data Management
– Data Access Controls (RBAC) - Data democracy for public data
- Data quality
– MDM
– Quality Metrics - Data monitoring and alerting
– Availability of data at the right SLAs
– Health
– Anomaly detection - Using Data Science as a tool for better governance
- Business intelligence
It all begins with identifying the right data catalog/ discovery tools.
How do we Crack this Data Governance Puzzle?
The Ideal Proactive Approach
This is rare, but it is the ideal case. If most of the data a company deals with are generated internally, we can be extremely paranoid about how we instrument the application(s) to collect the data we need to govern. Instead of creating a huge haystack of data from which to search needle-sized data, we could arrive at the metrics we need in advance and work backward from there.
Data governance is not about generating and managing huge volumes of data without a purpose. However, practically, when a company starts up, this level of maturity is rare as the founders/ engineering leaders of the company may be business or product experts, not data experts. Also, the cost involved in arriving at a good Data Governance strategy and the corresponding solution involves a considerable investment of time and money, something businesses may not be able to afford in the early days.
To make things harder, for external sources of data, it is not always possible to plan, given that business needs keep changing and evolving.
A More Pragmatic Approach
Guidelines for an ideal solution:
- Identify a good Data Catalog that integrates with existing and future/planned Data Systems. Ensure that it is Data Platform agnostic
- On-prem/ cloud agnostic
- Scalability
- Cost-effective, and offers democratic access to thousands of users
- Enables a data culture across the company
- Supports expansions of business, M&A, etc.
- Supports all/ most popular open-source data tools available
- Has a buzzing and helpful community
- Offers regional context and support (for global companies)
Key recipients:
There are multiple user personas in a company, who could benefit from such a solution, and these are typically (though not restricted to):
- CXOs
- Product Managers
- Data Analysts
- Data Scientists
- Data Engineers/ Engineers
- Data Architects
- Data Stewards
Or in effect, anyone looking to get value out of data.
Finding the Right Solution
A granular/ scientific approach in identifying all the features outlined above in a Catalog & Discovery & Observability tool is one part of solving the problem. However, finding one tool that solves all your unique problems is almost impossible!
Trust me: A tool just enables you to do the job, but a Data Governance journey is a tricky balance of people, policies, and processes — and realizing this is critical to any such journey:
My recommendation: Identify the use cases that are must-haves (critical), ideal candidates (high), good to have (medium) & anything else (low) based on the weights of their usefulness for a successful Data Governance strategy. It is critical to have a clear requirement document that describes all the use cases and with a level of detail that clearly shows what everyone eventually wants the solution to be.
The Timing For a Good Data Governance Initiative
Be it a Data Platform or Data Governance initiative or Data Products or Data Mesh implementation, everything hangs on business priorities and cost. Where do you want to spend time and money in the given context of business and operational needs? Is the proposed solution even good in the given business context?
A clear risk-free approach of dividing the problem into chunks and solving one at a time is crucial to a good Data Governance initiative in the longer run.
Data Governance in the New World
If you have read this far, I’m sure you have heard terms like Data Mesh, Data Contracts, Data Fabric, Data Virtualization, etc.
What do all of these have to do with Data Governance?
Thankfully, Data Governance cuts across all businesses/ domains and does not change with the underlying Data Architecture or Data Platforms. It is not a commodity service/ product, but a discipline (policies & processes) that the organization carries in the long term
While it can be a challenge to carve out the time, money, people, and resources for data governance, the sooner you do it, the better it is for the business and everyone involved.
Team Structure
Whatever the design, avoiding a single point of failure is crucial, and the members of such a team can be:
- Domain expert(s) of the LoB(s)
- Engineer(s) — Software developers and Data related
- Product Manager for any Data Products in specific
- Data Analyst(s)
- Documentation Experts
- Program/ Project Manager
- Data Steward(s) who can handle/ understand multiple LoBs/ Domains
Open Source Tools
There are several open-source tools for Data Cataloging/ Data Discovery, however, the most popular ones are DataHub, Amundsen, and Atlas in that order (in terms of rich features and Github stars and forks). I felt DataHub has great community support, but that is, of course, my personal opinion at this point.
Please reach out to me if you are interested in learning more at venkat@resolv360.com
References
https://www.oreilly.com/library/view/data-governance-the/9781492063483/ch01.html