Introducing DataHub Analytics Agent

The Open-Source Talk-to-Data Agent Built on a Context Platform

TL;DR: What is the DataHub Analytics Agent?

The DataHub Analytics Agent is an open-source reference implementation for builders who want to ship a context-powered conversational analytics experience. For end users (analysts, PMs, executives), it’s a web application where you ask a question and get a SQL-backed answer with auto-generated visualizations. For the data engineers and platform builders deploying it, it’s a documented, working example of how to build an AI agent that reads from and writes to a context platform.

Every talk-to-data tool faces the same failure mode.

A user asks a reasonable question. The agent generates SQL that is syntactically correct and semantically wrong. It picks the wrong table. It reads “revenue” as gross instead of net. It queries a dataset that hasn’t been refreshed in three weeks. The SQL runs, the number looks plausible, and the answer is wrong.

This is not an LLM problem. It is a context problem. The agent does not know what the data means, where it came from, who owns it, whether it is fresh, or how the organization defines its terms. Without that context, even the best model is guessing.

Today we’re releasing the DataHub Analytics Agent: the first open-source talk-to-data agent built specifically on a context platform.

Why most Text-to-SQL agents fail in production

Most open-source text-to-SQL agents try to solve the context problem by building their own context from scratch. You define a semantic layer, populate a vector store with DDL, or manually model business terms inside the agent itself. This works in demos. It breaks in production because the context becomes a separate artifact that someone has to build and maintain for the agent alone. It starts incomplete and quickly goes stale.

The DataHub Analytics Agent takes a different approach. Instead of building its own semantic model, it connects to DataHub via a Model Context Protocol (MCP) server and the Agent Context Kit to read the metadata your organization has already assembled: glossary terms curated by your data team, lineage computed automatically across 100+ connectors, quality signals monitored in production, and usage patterns observed from real query behavior.

And as conversations happen, the agent writes back to the context graph: surfacing metadata gaps, proposing glossary terms, and suggesting documentation. The context layer gets richer because the agent is using it.

What does the DataHub Analytics Agent do?

The DataHub Analytics Agent is a web application you can deploy alongside your DataHub instance. Users ask a question in plain English and receive a SQL-backed answer with auto-generated visualizations. Conversation history is preserved across sessions, and the UI supports multiple users out of the box.

Under the hood, every question goes through a context-enrichment pipeline that reads from DataHub via MCP and the Agent Context Kit before generating SQL. When the agent finds undocumented tables or undefined terms, it writes fixes back to DataHub rather than silently working around them.

Who is this built for?

The DataHub Analytics Agent is for builders who want to give their organization a context-powered analytics experience that anyone can use and trust:

  • The data engineer who wants to build a conversational interface for their team
  • The platform builder evaluating how DataHub context can power production AI agents
  • The tinkerer who reads the code to understand the patterns

The DataHub Analytics Agent is a reference implementation. The goal is to show the DataHub community and DataHub Cloud customers how to build a world-class talk-to-data experience on top of a context platform. The codebase is a tutorial as much as it is an application: every MCP call is commented and every context-enrichment step is explained.

This is not a BI tool. It does not compete with Tableau, Looker, or Hex. Those tools are good at what they do. What they don’t do is show you how to build your own conversational data experience on top of a rich context graph.

Why a reference app and not a product?

The context layer (how an agent reads and writes context) is a general problem with a general solution. The experience layer (how humans interact with it) should be specific to each organization. We’re shipping an opinionated UI that shows what becomes possible when an agent deeply integrates with a context layer. The UI is designed to be replaced. The context patterns underneath are designed to be kept.

How does it work?

The agent runs a context-enrichment pipeline on every query, reading from DataHub before generating SQL, and writing back when it finds gaps. Here is what that looks like in both directions:

Reading context: Making answers trustworthy

Most text-to-SQL tools send the LLM a schema and hope for the best. The DataHub Analytics Agent sends the schema plus everything DataHub already knows about it. In practice, these are the types of context the agent reads:

  • Schema grounding: Column names, types, and descriptions anchor every SQL generation step.
  • Glossary resolution: Ambiguous terms like “revenue” resolve to the canonical metric definition your data team has curated.
  • Lineage context: The agent traces where the data comes from before trusting an answer. If a table is two hops from a raw source with no documented transformations, it surfaces that.
  • Tags and ownership: The agent knows who owns each dataset, how it is classified, and what domain it belongs to, so it routes questions to the right data and surfaces the right caveats.
  • Search and discovery: When a question is ambiguous, the agent finds the right table by preferring well-documented datasets with active owners over undocumented alternatives.
  • Quality signals: Freshness, SLA status, and anomaly flags. The agent warns before querying stale data rather than silently returning a number that was accurate last Tuesday.

The difference is measurable: recent research shows that enriching LLM prompts with column descriptions, value distributions, and foreign key relationships improves SQL accuracy by 10+ percentage points on real-world benchmarks.

Writing context: Improving your context graph over time

The agent does not treat DataHub as read-only. When the agent encounters undocumented tables, undefined business terms, or missing tags, it writes back. Every conversation is an opportunity to close metadata debt:

  • Documentation drafts: The agent suggests column and table descriptions based on conversation context and query patterns, ready for a data steward to review and accept.
  • Glossary proposals: When the agent encounters an undefined business term, it proposes a new glossary entry for your data team to approve.
  • Tag recommendations: The agent suggests classifications and tags based on how data is being queried and by whom.
  • Gap detection: The agent surfaces metadata debt that would otherwise sit unnoticed. For example, “This table has no description. Want me to draft one?”

Architecture

The DataHub Analytics Agent is a standalone application in its own repository. It communicates with DataHub exclusively through MCP, skills, and the Agent Context Kit—the same way any external application would. No shared codebase, no internal shortcuts.

There are three different connection points:

  • DataHub for context (via MCP / Agent Context Kit)
  • A warehouse for data (via standard drivers)
  • An LLM for generation (via API)

The agent connects to Snowflake, BigQuery, Databricks, Redshift, and Postgres. For the LLM, you bring your own: OpenAI, Anthropic, Gemini, or any open-weight model.

Configuration lives in a single YAML file or environment variables. No additional infrastructure required.

The agent works with DataHub OSS and DataHub Cloud, connecting to any DataHub instance via MCP. When the instance is DataHub Cloud, richer context flows through the same interface, including usage popularity signals, quality alerts, and column profiling.

How does this relate to DataHub Skills?

DataHub Skills bring DataHub context into the coding tools your data team already uses, like Claude Code, Cursor, Copilot, Codex, and Gemini CLI. They are workflow recipes for developers: search the catalog, trace lineage, enrich metadata, review connectors.

The Analytics Agent brings DataHub context to everyone else. It is a deployed web application where an analyst, a PM, or an executive can ask a question about the business and get a SQL-backed answer grounded in the same context graph. Same MCP tools underneath. Different audience on top.

Get started with the DataHub Analytics Agent

The agent is open source under Apache 2.0 at https://github.com/datahub-project/analytics-agent. Deploy it, learn from it, build on it, and contribute your improvements back to the community.

Questions, ideas, and contributions are welcome on GitHub and in the DataHub Community Slack.

Future-proof your data catalog

DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

Explore DataHub Cloud

Take a self-guided product tour to see DataHub Cloud in action.

Join the DataHub open source community 

Join our 14,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.