Context Preparation vs. Data Preparation: Why Agentic AI Needs Both

By:

DataHub with Expert Insights from Sat Duggal

May 4, 2026

TL;DR

Data preparation makes raw data usable for human analysts. Context preparation makes prepared data usable for AI agents. The two are sequential disciplines, not competing ones, and production-grade agentic AI requires both.
Data preparation quietly relied on a human analyst to supply meaning, ownership, and trust signals at query time. Agents inherit the data but not the judgment, which is why clean data alone produces hallucinated and inconsistent agent behavior.
Context preparation activates institutional knowledge (lineage, business glossary terms, ownership, documentation, runbooks, and curated queries) into a queryable context graph that agents can reason against.
66% of organizations report AI models generating biased or misleading insights due to low context maturity, and 83% agree agentic AI cannot reach production value without a context platform (2026 State of Context Management Report).

Most teams shipping agentic AI into production are running into the same wall: The data is clean. The pipelines are healthy. The warehouse is in good shape. And the agents still hallucinate, contradict each other, and return inconsistent answers to questions that should have one right answer.

It is tempting to read this as a data quality problem. It is not. It is a context problem, and the discipline that solves it (context preparation) is distinct from the one most data teams have spent the last decade mastering (data preparation).

The two are sequential, not competing. But conflating them is one of the most common reasons agentic AI fails to reach production value, and the fix starts with understanding why context management is a separate job from data management in the first place.

Quick definition: What is context preparation?

Data preparation cleans and structures raw data so human analysts can query it. Context preparation activates institutional knowledge (lineage, business glossary, ownership, documentation, runbooks, and curated queries) into a queryable context graph so AI agents can reason against it. Both are required for agentic AI. They are sequential disciplines, not competing ones.

What data preparation was built to do

Data preparation is a mature, well-understood discipline. The job is to take raw source data and run it through a data pipeline that makes it analytically usable. That means standardizing formats, removing outliers, correcting errors, enriching records, and combining datasets so the output is something a human analyst can query without wrestling the source.

The output of data preparation is clean, structured data. The consumer of that output, historically, has been a person.

That detail matters more than it sounds. Data preparation was built for the human-analyst era, where a data analyst sat between the warehouse and the business question. They wrote the SQL. They interpreted the result. And they brought a layer of judgment to the work that data preparation never had to encode, because the analyst supplied it at query time.

The assumption data preparation quietly relied on: A competent human

The implicit dependency in every data preparation pipeline is that a competent human will be on the other end of the query. The analyst knew:

Which table was authoritative and which was a stale copy
That “active user” meant one thing in the mobile product and something different in the web product
That last week’s engagement drop was a monitoring incident, not a real behavior change
Which dashboard the CFO actually trusted

None of that knowledge lived in the data warehouse. None of it was the job of the data preparation pipeline. It lived in the analyst’s head, in Slack threads, in runbooks, in institutional memory, and (when teams were disciplined about it) in the data catalog.

Clean data plus a competent human equaled useful answers. The arrangement worked for as long as the consumer was human.

What changed when the consumer became an AI agent

Agents inherit the data. They do not inherit the analyst’s judgment.

When an AI agent retrieves a clean, well-formatted revenue figure from a perfectly prepared table, it still does not know:

Which definition of revenue applies to the question it is trying to answer
Whether the table is the authoritative source or a derivative copy that was supposed to be deprecated last quarter
Whether a golden query already exists for this class of question, written by a senior analyst who has thought about the edge cases
That the metric was redefined three months ago

The agent has no institutional context unless someone has built infrastructure to give it some. This is the inflection point. The data is fine. The consumer changed.

(See our deeper treatment of context-aware AI agents for how this plays out in practice.)

What context preparation actually is

Context preparation is the discipline of activating institutional knowledge for machine consumption. It is a sibling discipline to data preparation, with its own inputs, its own outputs, and its own owners.

The inputs are not raw data. They are the artifacts that capture the domain knowledge humans have always relied on to make sense of data: lineage graphs that show where a column came from, business glossary terms that resolve what “customer” means in a given context, ownership records that identify who is accountable for a dataset, documentation that captures known issues and decisions, runbooks that encode standard operating procedures, and curated queries that represent the institutional answer to a recurring question.

The output is a queryable context graph. Not a cleaner table. Not a better pipeline. A structured layer of meaning that machines can retrieve from, the same way analysts used to retrieve from their own memory.

This is where the framing matters: Context preparation is not a feature of context engineering, and it is not a synonym for data preparation done more carefully. It is the upstream discipline that makes both possible. Context engineering is how prepared context gets assembled at runtime for a specific agent task. Context preparation is the work of building the assembled-from layer in the first place.

Dimension	Data preparation	➡️	Context preparation
Primary consumer	Human analyst	➡️	AI agent
Input	Raw source data	➡️	Institutional knowledge: lineage, glossary, ownership, docs, runbooks
Output	Clean, structured tables	➡️	Queryable context graph
What supplied meaning	The analyst, at query time	➡️	The graph, at retrieval time
What failure looks like	Bad reports, slow analysis	➡️	Hallucinated answers, inconsistent agent behavior
What it assumes about the consumer	Carries domain judgment	➡️	Has none

The last row is the one to sit with. Data preparation was built around the assumption that the consumer would supply the meaning. Context preparation exists because that assumption no longer holds.

Why most enterprises have done one and not the other

There is a maturity gap, and it is wider than most teams realize.

Data preparation has had decades of tooling investment, established patterns, dedicated teams of data engineers and data scientists, and clear ownership. Context preparation, as a discipline distinct from data cataloging or governance, is new enough that most organizations have not staffed for it, budgeted for it, or even named it.

The institutional knowledge exists. It is in the catalog. It is in the runbooks. It is in the Confluence pages no one has updated since the last reorg. What is missing is the activation layer that turns those artifacts into something a machine can consume.

The 2026 State of Context Management Report puts numbers on the gap:

66% of respondents report AI models generating biased or misleading insights due to low maturity of data infrastructure in providing sufficient context
57% say they find it challenging to identify authoritative data sources
83% agree that agentic AI cannot reach production value without a context platform

A telltale sign you have a context problem

If your agents are returning inconsistent answers and your first instinct is to invest more in data quality, the data is probably not the problem. Inconsistent agent behavior is almost always a context problem in disguise: missing definitions, missing ownership signals, missing trust indicators, missing curated answers. The fix is upstream of the warehouse.

The reframe matters. Teams that interpret these failures as data quality issues will keep investing in cleaner pipelines and shipping the same broken agents. Teams that recognize the pattern as a context preparation gap can start building the layer that actually addresses it.

What context preparation looks like in practice

Context preparation is a discipline, not a product. But it requires infrastructure, and the infrastructure has identifiable components. Each one corresponds to a specific failure mode that data preparation alone cannot address.

A unified context graph as the foundation

The foundational component is a graph that connects technical metadata (schemas, lineage, ownership, quality metrics, usage patterns) with unstructured organizational knowledge (runbooks, FAQs, policies, business definitions, decision logs).

Without this connection, context retrieval is fragmented: the agent might find the schema but miss the runbook that explains how to interpret it, or find the business definition but miss the lineage that shows the column was derived from a deprecated source.

DataHub‘s unified context graph treats both kinds of context as first-class nodes in the same structure. Pinterest arrived at the same architectural conclusion independently and built their text-to-SQL agent stack on top of DataHub’s context graph. Their internal write-up of the semantic backbone of enterprise data analytics agents is one of the clearest external descriptions of why this layer matters in production.

A business glossary that resolves meaning

The most common point of failure for an agent is the meaning of a term. Two teams say “customer” and mean different things. Three dashboards report “revenue” and calculate it three ways. A business glossary is the layer that resolves these collisions, mapping business concepts to specific datasets, columns, and definitions.

Without a glossary, the agent has no way to know which definition applies to the current task. With one, the agent retrieves not just the data but the canonical definition that governs how the data should be interpreted.

Semantic search so agents find context by meaning

Keyword search assumes the agent already knows the right term. Often it does not. Semantic search lets the agent retrieve context by intent rather than by exact match, making unstructured documentation useful at runtime rather than just searchable in a UI.

DataHub chunks and embeds text documents during ingestion, so semantic search runs alongside keyword search across the same context graph. An agent looking for the right runbook does not need to guess the title.

An MCP server so agents can actually consume the graph

A context graph that agents cannot reach is not context preparation. It is just a better catalog. The DataHub MCP Server, built on the Model Context Protocol (MCP), exposes the context graph to MCP-compatible AI tools (Claude, Cursor, Windsurf, and others) and supports both read and mutation tools, so agents can not only retrieve context but enrich it: tagging assets, updating descriptions, assigning ownership, managing glossary terms.

This last point is what closes the loop. Context preparation is not a one-time pipeline you run and walk away from. It is an ongoing discipline, and the agents consuming the context can also contribute to keeping it current.

Wrap up

The agents that fail in production are rarely failing because of dirty data. They are failing because no one prepared the context: no glossary to resolve meaning, no graph to provide trust signals, no infrastructure to activate the institutional knowledge that human analysts have always relied on.

Data preparation is necessary. It is not sufficient. The work of making data usable for machines is a different job, and it requires its own platform.

DataHub is the context platform built for this discipline. The unified context graph, business glossary, semantic search, and MCP server are not adjacent features bolted onto a data catalog. They are the infrastructure that context preparation requires, in one place, designed from the ground up to serve both data teams and the agents they are now building for.

If your agents are returning inconsistent answers and your data quality scores are healthy, you have a context preparation gap. See how DataHub closes it or read the 2026 State of Context Management Report to see how the gap is showing up across enterprises today.

Future-proof your data catalog

DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

Explore DataHub Cloud

Take a self-guided product tour to see DataHub Cloud in action.

Join the DataHub open source community

Join our 14,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.

FAQs

Context preparation is the discipline of activating institutional knowledge (lineage, business glossary terms, ownership, documentation, runbooks, and curated queries) for machine consumption. It produces a queryable context graph that AI agents can reason against, the same way data preparation produces clean tables that human analysts can query. It is a distinct discipline from data preparation, with its own inputs, outputs, and owners.

Data preparation makes raw data analytically usable for human analysts: standardizing formats, removing outliers, enriching records, combining datasets. Context preparation makes prepared data semantically usable for machines: activating the layer of meaning, ownership, and trust that humans previously supplied at query time. Data preparation outputs clean tables. Context preparation outputs a queryable context graph. Both are required for agentic AI to work in production.

Yes. The two are sequential, not competing. Data preparation produces the clean, structured data that context preparation then layers meaning onto. Skipping data preparation would leave the context graph pointing at unreliable underlying data. Skipping context preparation leaves agents with clean data they cannot interpret. Production-grade agentic AI requires both.

Because agents inherit the data but not the analyst’s judgment. When a human analyst queried the warehouse, they knew which table was authoritative, which definition of a metric applied, and whether a result was trustworthy. None of that knowledge lived in the data itself. Agents have no equivalent institutional memory unless context preparation builds the infrastructure to provide it.

The core inputs are lineage, business glossary terms, ownership records, documentation, runbooks, decision logs, and curated queries. The core output is a unified context graph that connects this institutional knowledge to the underlying data assets. Supporting components typically include semantic search (so agents can retrieve context by meaning), a business glossary layer (so meaning is resolved consistently), and an interface like an MCP server that lets agents both read and enrich the graph.

No. Context engineering is the runtime discipline of assembling the right context for a specific agent task: deciding what to retrieve, how to format it, and how to fit it within a context window. Context preparation is the upstream discipline of building the layer that context engineering retrieves from. Without prepared context, context engineering has nothing reliable to assemble. We cover the distinction in more detail in our piece on context engineering vs. context management.

The clearest signals are agent-side: AI agents that return inconsistent answers to similar questions, agents that confidently cite metrics calculated the wrong way, agents that retrieve clean data and still hallucinate the interpretation, or duplicated AI efforts across departments that produce conflicting outputs. If your data quality scores are healthy but your agents are still unreliable, the gap is almost certainly in context preparation, not data preparation.