Part 2: How DataHub MCP Closes the Context Gap

In Part 1: What Is an MCP Server?, we covered what MCP is, how the protocol works, and why it’s emerging as the standard connectivity layer for AI agents. The takeaway: MCP gives agents a universal way to reach your data systems.

But reaching data and understanding it are two different problems.

Teams that move MCP-connected agents from prototype to production hit this wall fast. This is the context gap—and closing it is what separates an impressive demo from a system you’d actually put in front of your team.

DataHub closes it by giving MCP-connected agents access to the metadata they need to be useful in production: lineage, ownership, quality signals, business definitions, and usage patterns, all through a single MCP server.

What is the context gap, and why does it matter?

You’re a data engineer. Your MCP-connected agent is up and running—it can query your warehouse, pull schema information, and return results in seconds. It feels like a breakthrough.

Then someone asks it: “What’s the best table for customer revenue analysis?”

The agent returns four tables. One is a test schema a colleague abandoned last quarter. One is a backup from January. One was deprecated six months ago. And one is the certified, production-grade table your analytics team actually uses. The agent has no idea which is which.

This is the context gap in action: Your agent has connectivity without comprehension. It can reach your data, but it can’t tell good data from bad, current from stale, authoritative from experimental. 

In practice, that gap shows up in three ways that matter:

  1. Trust and governance: Without trust signals like certifications, quality, and freshness, an agent can’t distinguish a battle-tested production table from an abandoned test schema. Numbers from an uncertified dataset can end up in a board report before anyone catches the mistake.
  2. Impact and dependency awareness: An agent without lineage context will answer “What happens if we drop this column?” with a shrug, not a list of the 12 dashboards that will break. By the time you find out, it’s a production incident.
  3. Organizational knowledge: Every data team has institutional context that lives in people’s heads—naming conventions, standard filters, known quirks in specific tables. None of it comes through a raw database connection. Without it, agents generate queries that appear right but return wrong results, every time.

The connection works. The context is what’s missing.

How DataHub bridges the context gap

DataHub serves as a unified metadata foundation that exposes comprehensive context through a single MCP server. Instead of connecting agents directly to raw data systems, DataHub gives them access to the full picture—schemas, lineage, ownership, quality metrics, business definitions, usage patterns, and documentation—all queryable through natural language.

The architecture is built on an event-driven metadata graph that captures changes in real time. That speed matters. One of the most dangerous failure modes in AI-assisted data work is stale metadata. An agent that confidently reports “no downstream dependencies” based on last week’s lineage snapshot can cause exactly the production incident you were trying to prevent.

The result: When an agent asks “What’s the most reliable customer revenue table?”, it doesn’t get a list of tables with “revenue” in the name. It gets results ranked by usage, annotated with ownership and quality scores, filtered by certification status, and enriched with documentation that explains what each table actually measures.

The underlying principle is simple: The value of an MCP server is directly proportional to the richness of the context it exposes. Connected to a raw database, you get schema information. Connected to a metadata platform like DataHub, you get schema plus lineage, ownership, quality, usage patterns, and business definitions—all through the same protocol. The protocol is the same. The context behind it is what determines whether the agent’s responses are useful or dangerous.

Case study: From hours to minutes at Block

Block, the financial services company behind Square and Cash App, manages over 50 data platforms under strict financial compliance requirements. Before integrating their open source AI agent Goose with DataHub’s MCP server, engineers spent hours on what should be simple questions—searching internal documentation, checking Slack channels, tracing dependencies across platforms, and hunting for stakeholder information during incidents.

With conversational access to schema, lineage, ownership, and documentation through DataHub’s MCP server, those same questions get context-rich answers in seconds. No context switching between catalogs, lineage viewers, Slack threads, and internal wikis. Just natural language queries within the engineer’s existing workflow.

Something that might have taken hours, or days, or even weeks turns into just a few simple, short conversation messages.”

Sam Osborn, Senior Software Engineer, Block

 

DataHub MCP in production: Use cases

These three use cases show what MCP in production actually changes in day-to-day data work:

1. Data discovery for new hires

A new data analyst joins the team and asks: “Where’s all the customer data? I need to build a report but I don’t know where to start.”

They don’t know which platforms host customer data, what tables exist, which ones are actually used versus abandoned, or what teams own what. Without context, this person loses days clicking through hundreds of tables with no sense of what’s important.

With DataHub’s MCP server, they search the data landscape filtered by domain and sorted by usage. In two queries instead of 200 clicks, they see the five tables that 80% of analysts actually use, skip deprecated datasets automatically, and understand who owns what. Onboarding drops from days to hours.

2. Breaking change impact analysis

Engineering proposes dropping the phone_number column from the users table—they believe nobody uses it anymore. The table has over 200 downstream dependencies. Manually checking each one is impossible.

DataHub’s lineage tools show exactly which downstream assets depend on that specific column—filtered to just critical BI dashboards, with visibility one to three or more hops downstream. The team sees what breaks before deploying the change, identifies which dashboard owners need notification, and confirms whether the column truly has zero active consumers. It’s the difference between a safe deprecation and a 2 AM incident.

3. Root cause analysis for data quality incidents

A product manager reports that the revenue dashboard is showing $2.3 million this month. Finance says it should be $23 million. Something in the pipeline introduced a 10x error, but the data flows through five or more intermediate transformations across different tools. Debugging manually means inspecting ETL logs, dbt models, and BI queries in separate systems—a process that takes hours and often misses the root cause.

DataHub’s MCP server returns the exact transformation path from source table to broken dashboard, including the intermediate SQL at each hop. The team traces the full chain and identifies the bug: a missing zero in a currency conversion step three hops upstream. Hours of investigation across multiple tools, done in minutes.

Ready to see it in action?

The context gap is the difference between an AI agent that impresses in a demo and one you’d actually trust in production. DataHub closes that gap—giving MCP-connected agents the lineage, ownership, quality signals, and business definitions they need to deliver answers your team can act on.

DataHub Cloud customers are already running this in production. If you’re ready to see what MCP-connected agents look like with real metadata context behind them, book a demo and we’ll walk you through it with your data.

Book a Demo →

Explore DataHub Cloud

Take a self-guided product tour to see DataHub Cloud in action.

Join the DataHub open source community 

Join our 14,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.

FAQs

A hosted (or managed) MCP server runs as a remote service rather than on your local machine. Instead of installing and maintaining server software locally, you connect your AI tools to a URL endpoint managed by the provider. 

Hosted MCP servers offer lower operational overhead—no local dependencies, no version management, no infrastructure maintenance—and are typically better suited for team-wide deployment since everyone connects to the same endpoint. 
DataHub Cloud’s managed MCP server is an example: It exposes your full metadata graph through a single hosted endpoint that any MCP-compatible tool can connect to over HTTPS. The trade-off is that hosted servers require network connectivity and may have different latency characteristics than local servers running via stdio.

The DataHub MCP server provides six core tools that give AI agents structured access to your metadata graph:

  1. search: Structured keyword search with boolean logic, filters, and usage-based sorting
  2. get_lineage: Upstream and downstream lineage for datasets, columns, and dashboards with hop control
  3. get_dataset_queries: Real SQL queries that reference a dataset, showing joins, filters, and aggregation patterns
  4. get_entities: Batch metadata retrieval for multiple entities by URN
  5. list_schema_fields: Schema field exploration with keyword filtering and pagination
  6. get_lineage_paths_between: Exact transformation paths between two assets, including intermediate SQL

Beyond these read operations, the MCP server also supports mutation tools for tagging, ownership assignment, description updates, and glossary term management—enabling agents to not just read your metadata but enrich it as part of governed workflows.

DataHub offers two paths:

  • DataHub Cloud (managed): Available on v0.3.12+, the managed server provides a production-ready endpoint with no infrastructure to deploy or maintain. Connect it to Claude Desktop, Cursor, or any MCP-compatible tool using your hosted server URL. Authentication and updates are handled automatically.
  • DataHub Core (self-hosted): For open source users, the self-hosted server runs locally using uvx mcp-server-datahub@latest. You’ll need your DataHub instance’s GMS endpoint URL and a personal access token. Add it to your AI tool’s MCP settings and you’re up.

Both paths expose identical capabilities. The choice comes down to whether you want managed infrastructure or prefer to run your own. Check out our DataHub MCP Server documentation to explore deployment options.

Recommended Next Reads