Context Preparation vs. Data Preparation: Why Agentic AI Needs Both
TL;DR
- Data preparation makes raw data usable for human analysts. Context preparation makes prepared data usable for AI agents. The two are sequential disciplines, not competing ones, and production-grade agentic AI requires both.
- Data preparation quietly relied on a human analyst to supply meaning, ownership, and trust signals at query time. Agents inherit the data but not the judgment, which is why clean data alone produces hallucinated and inconsistent agent behavior.
- Context preparation activates institutional knowledge (lineage, business glossary terms, ownership, documentation, runbooks, and curated queries) into a queryable context graph that agents can reason against.
- 66% of organizations report AI models generating biased or misleading insights due to low context maturity, and 83% agree agentic AI cannot reach production value without a context platform (2026 State of Context Management Report).
Most teams shipping agentic AI into production are running into the same wall: The data is clean. The pipelines are healthy. The warehouse is in good shape. And the agents still hallucinate, contradict each other, and return inconsistent answers to questions that should have one right answer.
It is tempting to read this as a data quality problem. It is not. It is a context problem, and the discipline that solves it (context preparation) is distinct from the one most data teams have spent the last decade mastering (data preparation).
The two are sequential, not competing. But conflating them is one of the most common reasons agentic AI fails to reach production value, and the fix starts with understanding why context management is a separate job from data management in the first place.
Quick definition: What is context preparation?
Data preparation cleans and structures raw data so human analysts can query it. Context preparation activates institutional knowledge (lineage, business glossary, ownership, documentation, runbooks, and curated queries) into a queryable context graph so AI agents can reason against it. Both are required for agentic AI. They are sequential disciplines, not competing ones.
What data preparation was built to do
Data preparation is a mature, well-understood discipline. The job is to take raw source data and run it through a data pipeline that makes it analytically usable. That means standardizing formats, removing outliers, correcting errors, enriching records, and combining datasets so the output is something a human analyst can query without wrestling the source.
The output of data preparation is clean, structured data. The consumer of that output, historically, has been a person.
That detail matters more than it sounds. Data preparation was built for the human-analyst era, where a data analyst sat between the warehouse and the business question. They wrote the SQL. They interpreted the result. And they brought a layer of judgment to the work that data preparation never had to encode, because the analyst supplied it at query time.
The assumption data preparation quietly relied on: A competent human
The implicit dependency in every data preparation pipeline is that a competent human will be on the other end of the query. The analyst knew:
- Which table was authoritative and which was a stale copy
- That “active user” meant one thing in the mobile product and something different in the web product
- That last week’s engagement drop was a monitoring incident, not a real behavior change
- Which dashboard the CFO actually trusted
None of that knowledge lived in the data warehouse. None of it was the job of the data preparation pipeline. It lived in the analyst’s head, in Slack threads, in runbooks, in institutional memory, and (when teams were disciplined about it) in the data catalog.
Clean data plus a competent human equaled useful answers. The arrangement worked for as long as the consumer was human.
What changed when the consumer became an AI agent
Agents inherit the data. They do not inherit the analyst’s judgment.
When an AI agent retrieves a clean, well-formatted revenue figure from a perfectly prepared table, it still does not know:
- Which definition of revenue applies to the question it is trying to answer
- Whether the table is the authoritative source or a derivative copy that was supposed to be deprecated last quarter
- Whether a golden query already exists for this class of question, written by a senior analyst who has thought about the edge cases
- That the metric was redefined three months ago
The agent has no institutional context unless someone has built infrastructure to give it some. This is the inflection point. The data is fine. The consumer changed.
(See our deeper treatment of context-aware AI agents for how this plays out in practice.)
What context preparation actually is
Context preparation is the discipline of activating institutional knowledge for machine consumption. It is a sibling discipline to data preparation, with its own inputs, its own outputs, and its own owners.
The inputs are not raw data. They are the artifacts that capture the domain knowledge humans have always relied on to make sense of data: lineage graphs that show where a column came from, business glossary terms that resolve what “customer” means in a given context, ownership records that identify who is accountable for a dataset, documentation that captures known issues and decisions, runbooks that encode standard operating procedures, and curated queries that represent the institutional answer to a recurring question.
The output is a queryable context graph. Not a cleaner table. Not a better pipeline. A structured layer of meaning that machines can retrieve from, the same way analysts used to retrieve from their own memory.
This is where the framing matters: Context preparation is not a feature of context engineering, and it is not a synonym for data preparation done more carefully. It is the upstream discipline that makes both possible. Context engineering is how prepared context gets assembled at runtime for a specific agent task. Context preparation is the work of building the assembled-from layer in the first place.
| Dimension | Data preparation | ➡️ | Context preparation |
| Primary consumer | Human analyst | ➡️ | AI agent |
| Input | Raw source data | ➡️ | Institutional knowledge: lineage, glossary, ownership, docs, runbooks |
| Output | Clean, structured tables | ➡️ | Queryable context graph |
| What supplied meaning | The analyst, at query time | ➡️ | The graph, at retrieval time |
| What failure looks like | Bad reports, slow analysis | ➡️ | Hallucinated answers, inconsistent agent behavior |
| What it assumes about the consumer | Carries domain judgment | ➡️ | Has none |
The last row is the one to sit with. Data preparation was built around the assumption that the consumer would supply the meaning. Context preparation exists because that assumption no longer holds.
Why most enterprises have done one and not the other
There is a maturity gap, and it is wider than most teams realize.
Data preparation has had decades of tooling investment, established patterns, dedicated teams of data engineers and data scientists, and clear ownership. Context preparation, as a discipline distinct from data cataloging or governance, is new enough that most organizations have not staffed for it, budgeted for it, or even named it.
The institutional knowledge exists. It is in the catalog. It is in the runbooks. It is in the Confluence pages no one has updated since the last reorg. What is missing is the activation layer that turns those artifacts into something a machine can consume.
The 2026 State of Context Management Report puts numbers on the gap:
- 66% of respondents report AI models generating biased or misleading insights due to low maturity of data infrastructure in providing sufficient context
- 57% say they find it challenging to identify authoritative data sources
- 83% agree that agentic AI cannot reach production value without a context platform
A telltale sign you have a context problem
If your agents are returning inconsistent answers and your first instinct is to invest more in data quality, the data is probably not the problem. Inconsistent agent behavior is almost always a context problem in disguise: missing definitions, missing ownership signals, missing trust indicators, missing curated answers. The fix is upstream of the warehouse.
The reframe matters. Teams that interpret these failures as data quality issues will keep investing in cleaner pipelines and shipping the same broken agents. Teams that recognize the pattern as a context preparation gap can start building the layer that actually addresses it.
What context preparation looks like in practice
Context preparation is a discipline, not a product. But it requires infrastructure, and the infrastructure has identifiable components. Each one corresponds to a specific failure mode that data preparation alone cannot address.
A unified context graph as the foundation
The foundational component is a graph that connects technical metadata (schemas, lineage, ownership, quality metrics, usage patterns) with unstructured organizational knowledge (runbooks, FAQs, policies, business definitions, decision logs).
Without this connection, context retrieval is fragmented: the agent might find the schema but miss the runbook that explains how to interpret it, or find the business definition but miss the lineage that shows the column was derived from a deprecated source.
DataHub‘s unified context graph treats both kinds of context as first-class nodes in the same structure. Pinterest arrived at the same architectural conclusion independently and built their text-to-SQL agent stack on top of DataHub’s context graph. Their internal write-up of the semantic backbone of enterprise data analytics agents is one of the clearest external descriptions of why this layer matters in production.
A business glossary that resolves meaning
The most common point of failure for an agent is the meaning of a term. Two teams say “customer” and mean different things. Three dashboards report “revenue” and calculate it three ways. A business glossary is the layer that resolves these collisions, mapping business concepts to specific datasets, columns, and definitions.
Without a glossary, the agent has no way to know which definition applies to the current task. With one, the agent retrieves not just the data but the canonical definition that governs how the data should be interpreted.
Semantic search so agents find context by meaning
Keyword search assumes the agent already knows the right term. Often it does not. Semantic search lets the agent retrieve context by intent rather than by exact match, making unstructured documentation useful at runtime rather than just searchable in a UI.
DataHub chunks and embeds text documents during ingestion, so semantic search runs alongside keyword search across the same context graph. An agent looking for the right runbook does not need to guess the title.
An MCP server so agents can actually consume the graph
A context graph that agents cannot reach is not context preparation. It is just a better catalog. The DataHub MCP Server, built on the Model Context Protocol (MCP), exposes the context graph to MCP-compatible AI tools (Claude, Cursor, Windsurf, and others) and supports both read and mutation tools, so agents can not only retrieve context but enrich it: tagging assets, updating descriptions, assigning ownership, managing glossary terms.
This last point is what closes the loop. Context preparation is not a one-time pipeline you run and walk away from. It is an ongoing discipline, and the agents consuming the context can also contribute to keeping it current.
Wrap up
The agents that fail in production are rarely failing because of dirty data. They are failing because no one prepared the context: no glossary to resolve meaning, no graph to provide trust signals, no infrastructure to activate the institutional knowledge that human analysts have always relied on.
Data preparation is necessary. It is not sufficient. The work of making data usable for machines is a different job, and it requires its own platform.
DataHub is the context platform built for this discipline. The unified context graph, business glossary, semantic search, and MCP server are not adjacent features bolted onto a data catalog. They are the infrastructure that context preparation requires, in one place, designed from the ground up to serve both data teams and the agents they are now building for.
If your agents are returning inconsistent answers and your data quality scores are healthy, you have a context preparation gap. See how DataHub closes it or read the 2026 State of Context Management Report to see how the gap is showing up across enterprises today.
Future-proof your data catalog
DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

Explore DataHub Cloud
Take a self-guided product tour to see DataHub Cloud in action.
Join the DataHub open source community
Join our 14,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.
FAQs
Recommended Next Reads



