Data Products: From Concept to Implementation

The argument for treating data as a product has already been fought and won: The industry agrees. Analysts have written the frameworks, conference talks have made the case, and most data leaders will tell you they’re bought in. And yet, most organizations still can’t point to a functioning data product in their stack.

Basically: The concept landed. The execution didn’t.

What happened is something we’ve watched play out across the DataHub community for years now: Data products got caught in the crossfire between data mesh theology, competing vendor definitions, and a gap between the people who design data products on whiteboards and the people who have to make them work in production. The result is a concept that everyone endorses, and almost nobody operationalizes.

What’s been missing isn’t another framework. It’s a model that connects to how people already work.

What do we mean when we say “data product”?

Let’s get the fundamentals out of the way so we know we’re sitting at the same table:

Definition: Data product

A data product applies product thinking to data assets. It has ownership. It’s discoverable. It’s documented. It’s governed. It has clear boundaries and serves a defined set of consumers. These are broadly accepted characteristics, and most modern definitions, from Gartner‘s to Zhamak Dehghani‘s original formulation within the data mesh framework, converge on these fundamentals.

Let’s get the fundamentals out of the way so we know we’re sitting at the same table:

Where it gets interesting is what happens next. Because the real question was never what a data product is. It’s what one looks like inside your actual infrastructure. That’s where most organizations get stuck.

At DataHub, we’ve arrived at a deliberately simple framing: A data product is a boundary drawn around existing assets (tables, pipelines, dashboards, topics, views) that makes their relationship, ownership, and purpose explicit. The assets already exist in your stack. The data product doesn’t create new infrastructure. It makes what’s already there legible and governable.

This distinction matters more than it might seem. It’s the difference between asking teams to architect something new from scratch and asking them to formalize what they’ve already built. One of those is a multi-quarter initiative. The other is something a team can do this sprint.

What data product initiatives repeatedly do wrong

We’ve spent years building alongside the DataHub open-source community—thousands of practitioners across industries working with real data stacks, real organizational constraints, and real pressure to show results. Across those conversations, the same question keeps surfacing: Why do data product initiatives stall between strategy and implementation?

The patterns of failure we see aren’t theoretical. They come up in community conversations, in Slack threads, and in the design feedback that shaped how we built data products in DataHub.

Three failure modes show up over and over:

1. Data products stay conceptual

Teams define data products in architecture decks and design documents, but they never connect to real assets in a real catalog. The definition exists in one world; the infrastructure exists in another. Without a connection between the two, data products remain aspirational—something the organization says it has, not something anyone can actually discover, consume, or depend on.

2. Governance gets bolted on after the fact

A team builds a data product: Defines the scope, assigns ownership on paper, ships some documentation. Six months later, when downstream teams start depending on the output, questions about quality, compliance, and stewardship arise. But the scaffolding isn’t there to answer them. Governance was treated as a follow-on initiative rather than something embedded in the data product from day one.

3. The business-technical divide never closes

Business stakeholders define what data products should represent. Engineers build the underlying infrastructure. But neither side participates meaningfully in the other’s process. Over time, the business definition drifts from the technical implementation. The data product that exists in the catalog doesn’t match what the business thinks it governs, and nobody has a single source of truth to reconcile the gap.

If any of these sound familiar, you’re not alone. They came up repeatedly when we designed the data product entity in DataHub, and they directly shaped what we built.

The DataHub approach: Drawing boundaries, not building new infrastructure

The failure modes above share a common root: They all treat data products as something you build on top of your existing stack rather than something you define within it. That’s the gap DataHub’s approach is designed to close.

Start with what already exists

Rather than asking teams to architect a new entity from scratch, DataHub’s model lets you draw a boundary around assets that already live in your stack. A revenue data product might include the pipelines your team runs, the tables they produce, and the dashboard that gets exported. The data product makes that grouping, and its ownership, documentation, and governance, explicit

Think of it as the difference between building a new house and drawing a property line around one that’s already standing. The structure is there. The data product gives it an address, an owner, and a set of rules.

Shaped by practitioners, not a product roadmap

When we set out to model data products in DataHub, we didn’t start with a spec. We started with a community design process: An open channel where practitioners shared how they thought about data products, what they needed, and how they expected to interact with them. People brought use cases, visualizations, and academic references. The input was diverse and sometimes contradictory, which was exactly the point.

What emerged was a model grounded in how people actually work, not how a vendor thinks they should. That matters because the practitioners who use DataHub are the same people who’ve been burned by over-abstracted frameworks that look clean on a slide and fall apart in production.

Managed as code, accessible to everyone

A defining feature of DataHub’s approach is the YAML-based spec that lets teams define and manage data products as code. Developers can define data products in YAML, check them into Git, and sync definitions with DataHub. Business users can collaborate on and refine those definitions without needing to live inside a developer toolchain.

This isn’t shift-left as buzzword. It’s a practical mechanism for closing the gap between the people who define data products and the people who build them, ensuring the definition and the implementation stay in sync rather than drifting apart in separate workflows.

From boundary to governance

Here’s what happens once you’ve drawn a boundary around a set of assets, assigned ownership, and documented what a data product contains: You’ve built the scaffolding for governance without launching a separate governance initiative.

Data products deliver a unit at which you can actually govern. Quality standards, compliance requirements, access controls, and stewardship attach to the data product rather than to individual tables scattered across your warehouse. When a downstream team depends on your output, the governance framework is already in place because it was embedded in the data product definition from the start.

This is where data products stop being an organizational nicety and start being infrastructure. They’re the enabling layer for everything that needs to happen at enterprise scale—data quality, access control, compliance, discoverability. The governance doesn’t come later. It’s there from the moment the boundary is drawn.

As data products scale across an organization and enterprise data flows between teams that depend on each other’s outputs, the questions that arise (Who owns this? What are the quality expectations? Who’s accountable when something breaks?) already have answers. That’s not a small thing. In most organizations, those questions are what stall data product adoption in the first place.

Miro: What this looks like in practice

Miro’s data engineering team faced their own version of the problems described above and their experience shines a light on what changes when data products are implemented rather than just discussed.

Miro had adopted Airflow as their central metadata hub for SLA validation, but the approach created significant friction. Data contracts lived in engineering-owned repositories and referenced internal task names that analytics users couldn’t interpret. Airflow alerts focused on pipeline statuses without providing business context. And because Airflow couldn’t see into downstream tools like Looker, the team had incomplete visibility into data product health. 

The gap between technical infrastructure and business understanding was wide, and it was growing.

When Miro implemented DataHub Cloud as their metadata management platform, the structural shift was concrete: They moved data product and contract definitions into their dbt repository, which meant analysts already familiar with the repo could contribute directly to product creation and quality standards by authoring YAML files aligned with the DataHub definition. Data products became fully discoverable in the UI, complete with contract details and readable SLAs.

“These initiatives not only build trust in our data but also empower stakeholders to make data-driven decisions with confidence, driving long-term business success in the dynamic data landscape.” 

– Ronald Angel, Data Products Manager, Miro

But what makes Miro’s story instructive isn’t just the outcome, it’s the mechanism. They didn’t build something new from scratch. They drew boundaries around existing assets, made ownership and quality expectations explicit, and gave both technical and business users a shared surface to collaborate on. That’s the model working as intended.

Data products aren’t the destination

Data products are not the end state. They’re what make the end state possible.

When data is bounded, owned, documented, and governed at the product level, everything downstream moves faster, whether that’s self-service analytics, cross-team collaboration, or regulatory compliance. This is especially true for AI. 

Organizations investing in AI initiatives are quickly discovering that models are only as reliable as the data they consume. Without data products (and without clear ownership, documented lineage, and embedded quality standards), AI systems are building on ungoverned, undocumented foundations. Data products help make data more AI-ready, not through a separate “AI readiness” initiative, but as a natural consequence of managing data the way it should have been managed all along.

We built DataHub’s data product model because practitioners told us they needed it and because we believe the gap between “data as a product” as a philosophy and data products as a functioning part of your stack shouldn’t take years to close. The tools exist. The model is proven. What’s left is implementation.

Get started with Data Products in DataHub Cloud → 

Explore the Data Products documentation → 

Join the DataHub Community on Slack →

Future-proof your data catalog

DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

Explore DataHub Cloud

Take a self-guided product tour to see DataHub Cloud in action.

Join the DataHub open source community 

Join our 14,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.

FAQs

A data product is a managed, reusable collection of data assets, such as tables, pipelines, dashboards, and views, that is owned, documented, discoverable, and governed. It applies product thinking to data: Treating data assets with the same rigor applied to software products, including clear ownership, defined consumers (whether analysts, data scientists, or business users), quality standards, and lifecycle management. In DataHub, a data product is defined as a boundary around existing assets that makes their relationship, ownership, and purpose explicit.

A data asset is an individual resource: A table, a dashboard, a pipeline, a dataset. It might be raw data or a transformed output. 

A data product is a curated grouping of data assets that belong together, with added context: Ownership, documentation, governance policies, and quality expectations. 

The distinction is similar to the difference between individual components and a packaged product. A data product makes a set of assets discoverable and consumable as a coherent unit, rather than leaving data consumers to navigate individual assets on their own.

Data products are increasingly recognized as a key enabler for AI readiness’. AI and machine learning models are only as reliable as the data they consume. Without clear ownership, documented lineage, quality standards, and governance, organizations risk feeding ungoverned data into production AI systems, and data teams end up fielding quality issues they have no framework to resolve.

Data products provide the structure that helps make data more AI-ready: Bounded, trusted, discoverable, and governed. Rather than launching a separate “AI data readiness” initiative, organizations that have implemented data products already have the scaffolding in place to support reliable AI workflows.

Data products are one of the four core principles of data mesh, the decentralized data architecture framework introduced by Zhamak Dehghani. However, data products are not exclusive to data mesh. Organizations can implement data products regardless of whether they’ve adopted a full data mesh architecture. The core principle that data should be managed with product thinking, clear ownership, and defined boundaries (rather than left in data silos organized by technology) applies whether your architecture is centralized, federated, or somewhere in between.

A data product can take many forms depending on the organization and domain. Common examples include: 

  • A “Customer 360” data product that bundles customer records from CRM, support, and product usage into a single governed, discoverable entity
  • A revenue data product that includes the pipelines a finance team runs, the tables they produce, and the dashboards used for reporting
  • A marketing attribution data product that combines event data, campaign metadata, and conversion metrics with documented ownership and quality SLAs. 

In each case, the data product draws a boundary around assets that belong together, makes them consumable as a unit, and ties them to measurable business value.

“Data as a product” is a mindset: The principle that data should be managed with the same rigor as a product, including understanding consumer needs, iterating based on feedback, and maintaining quality standards. 

A ‘data product’ is the concrete implementation of that mindset: A specific, defined entity in your data stack with ownership, documentation, governance, and clear boundaries—accountable to both data producers and their downstream consumers. 

One is the philosophy; the other is what it looks like when you operationalize it.

Widely accepted characteristics include: 

  • Discoverable: Consumers can find it without institutional knowledge
  • Owned: A specific team or individual is accountable for its quality and reliability.
  • Documented: Its purpose, contents, and usage are clearly described
  • Governed: It has access controls that restrict usage to authorized users, along with quality standards and compliance policies
  • Interoperable: It can be consumed across tools and platforms
  • Self-describing: Metadata and schema information are embedded, not stored separately
  • Addressable: It has a unique identifier that enables programmatic data access

DataHub supports creating data products through the UI or via a YAML-based spec. Using the UI, you can define a data product within a domain, assign assets to it, and attach ownership, documentation, tags, and glossary terms. Using the YAML spec, teams can define data products as code, check definitions into Git, and sync them with DataHub—enabling version control and collaboration between technical and business users. For a full walkthrough, see the DataHub Data Products documentation.