How DataHub’s MCP Server Transforms AI-Powered Data Operations:
A Technical Deep Dive with Block

By:

Harshal Sheth and Sam Osborn

June 3, 2025

Your AI agents know everything—except your data

Your AI agents can write code, analyze trends, and generate insights—but ask them about your org’s data stack and they go silent. Who owns the customer_events table? What breaks if you rename that column? Which dashboards will go dark if this pipeline fails?

Without metadata context, even cutting-edge LLMs become useless for data work. At Block, managing 50+ data platforms under strict financial compliance made the problem even more painful.

The solution? DataHub’s Model Context Protocol (MCP) server working in harmony with Block’s open source AI agent, Goose.

In this post, you’ll learn how Block’s engineering team paired their open source AI agent with DataHub’s MCP server to streamline real-time incident response using metadata context.

We’ll also explore additional use cases made possible by the MCP server—including conversational data discovery, impact analysis, and developer tooling—and uncover how DataHub is evolving to support secure, AI-native operations across teams.

Model Context Protocol (MCP): AI just got a direct line to all your data

What is MCP?

Think of MCP as the USB-C for AI agents. Instead of writing custom connectors for every internal tool—your catalog, your lineage system, your warehouse—you have a standard plug-and-play protocol.

Before MCP: Developers built one-off connectors for every combination of tool and agent. Each agent needed its own custom connectors, making integration slow and hard to scale.

After MCP: Build once, use everywhere. With MCP, you create a single connector that any compliant agent can use out-of-the-box.

“MCP gives LLMs and agents the context they need to take real action.”
— Harshal Sheth, Founding Engineer, DataHub

What does DataHub’s MCP server do?

DataHub’s MCP server acts as a standardized bridge that gives AI agents direct, governed access to organizational metadata. It enables:

Entity search: GraphQL queries with complex filters across all entity types
Metadata retrieval: Full context including schema, ownership, and documentation
Lineage traversal: Upstream/downstream navigation with configurable depth
Query association: SQL queries mapped to specific datasets
Real-time access: Live metadata (not cached snapshots)

The result: AI agents can answer metadata questions instantly—no custom code required.

Diagram of the DataHub Model Context Protocol (MCP) architecture, illustrating how applications like Slack and Claude connect to an LLM, which retrieves metadata context from the DataHub MCP Server to support intelligent responses.

Block’s challenge: Scaling governance across 50+ platforms

Block’s data environment is vast, distributed, and highly regulated. With over 50 platforms in play—each with different ownership models, access controls, and compliance requirements—their teams need a way to centralize visibility without slowing down operations. The stakes are high: even small schema changes have the potential to impact customer-facing systems.

“Being a banking and financial services company, we’ve got a lot of data and a lot of data platforms to manage. Applying good data governance at our scale is complex.”
— Sam Osborn, Senior Software Engineer, Block

The problem

50+ data platforms across its organization
Complex regulatory requirements for financial services
Distributed ownership across multiple engineering teams
High stakes when schema changes can impact customer transactions

Block’s solution: AI Agent Goose + DataHub’s MCP server

Block combined their open source AI agent, Goose, with DataHub’s MCP server to create an automated metadata intelligence system. But they didn’t stop there. By integrating other internal MCP servers, like their service registry and incident management system, Block turned DataHub into the anchor of a broader, AI-ready metadata control plane.

Dive deeper into how the DataHub MCP Server works and explore practical use cases in our detailed overview blog.

Goose supports the MCP standard via custom extensions, allowing it to communicate directly and securely with any compliant server. By integrating multiple sources of metadata context, Block created a unified interface for operational intelligence.

Each server handles a different slice of context:

DataHub MCP server: schema, lineage, ownership, documentation
Registry MCP server: service ownership and escalation paths
Incident MCP server: live incidents, alerts, and response status

Instead of hopping between tools, engineers can now interact with multiple systems through a single interface, all while using natural language. For example:

call_get_lineage — trace downstream impact via DataHub
call_get_dataset — retrieve context
search_datahub_entities — explore assets
call_get_registry_app — integrate service catalogs

Four real-world use cases

1. Real-time incident response

Scenario: An engineer is investigating an issue with a Snowflake table named all_hammers (a fictional dataset used for demonstration purposes)

The goal: Verify the table, assess impact, and notify the right stakeholders—fast.

Before:

Search internal docs for the table owner
Check multiple Slack channels
Find upstream dependencies manually
Identify stakeholder contact info

After:

Search for all_hammers using DataHub’s metadata graph
Confirm it’s the right asset by inspecting tags and schema
Use lineage tools to trace upstream dependencies
Identify data owners with ease
Automatically pull communication info from service catalogs

“Something that might have taken hours, or days, or even weeks turns into just a few simple, short conversation messages.”
— Sam Osborn, Senior Software Engineer, Block

2. Self-service data discovery

Let’s say you’re running a fictional pet adoption agency.

Before: Teams navigated confusing data catalogs, parsed raw schemas, and hunted for scattered documentation.

After: Ask a question in natural language (e.g., “What data do we have on pets?”) and receive:

A curated list of core datasets
Adoption KPIs like return rate and foster failure rate
Dashboard references and data quality signals
Usage tips and active incident reports

You can even follow up with queries like “Tell me more about pet profiles” to get row counts, schema details, and any current issues, all in a conversational thread.

3. Schema impact analysis in seconds

Before: Schema changes triggered days of lineage tracing, stakeholder outreach, and meetings.

After: From your AI chat interface:

Ask “What’s the impact of changing the color column in pet_profiles?”
Receive a lineage map with both direct and multi-hop dependencies
Get a risk summary based on the downstream query volume
Ask “Who needs to know?” and see a breakdown of technical owners, data stewards, and business stakeholders

Every relevant contact is surfaced. No more Slack sleuthing or email guesswork.

4. In-IDE developer support

Before: Developers had to switch between their IDE, internal docs, Slack, and ticketing systems just to answer a single question. Constant context switching killed momentum and slowed down impact analysis.

After: Within your IDE:

Make schema edits to a dbt model
Ask “What breaks if I make this change?” directly in your Git diff thread
Instantly see affected queries, usage frequency, and potential blast radius

“I can ask ‘check DataHub – what will break if I make this change’ and know exactly what will be impacted without leaving the IDE.”
— Harshal Sheth, Founding Engineer, DataHub

What’s on the roadmap: Building toward a more powerful MCP server

AI-optimized SDKs

We’ve already begun refactoring our Python SDKs to better support AI—and we’re continuing to expand that work.

Here’s what’s in motion:

Simplified endpoints: No more deep knowledge of metadata models required
Higher-level operations: Queries like “get lineage” or “find datasets by tag” work out-of-the-box
Human- and agent-friendly: Whether you’re scripting or chatting with an LLM, the interface just makes sense. Both humans and AIs can gain value immediately.

Pydantic-powered queries for precision

All inputs and outputs in the MCP server will be defined as Pydantic models. That means agents get strong validation and consistency, while developers get access to a rich DSL for fine-tuned metadata exploration.

You can:

Combine filters with logical operators
Query by tag, platform, ownership, and more
Return only the data needed to answer the question

The result: Fewer hallucinations and better precision.

Moving beyond read-only

Currently, DataHub’s MCP server focuses on reading metadata and will not write anything or make changes. However, we’re planning to add write-oriented capabilities that will enable agents to:

Suggest and write documentation directly into DataHub
Propose new tags, glossary terms, or annotations based on conversations

These writing features will be disabled by default and require explicit configuration for security.

Enhanced setup and security

We’re also working on improvements to make setup even easier:

HTTP streaming support to simplify the installation process
Expanded authentication options, including OAuth integration
Better security guardrails for AI-powered metadata operations

A couple of parting tips

Here are some final pro tips before you dive in.

Sam’s pro tip: Manage context windows

Summarize regularly, then start a fresh session.

Session bloat hurts performance. Some agents, like Goose, automate context resets. Other tools may not.

Harshal’s pro tip: Focus your tools

Don’t enable too many MCP servers at once.

Providing the agent with exactly the right tools for the job in each session reduces confusion.

Try it yourself!

Both DataHub’s MCP server and Block’s Goose agent are open source. For full setup instructions with code snippets, check out our GitHub pages:

Join the conversation

Join our 13,000+ community members on Slack to collaborate, learn, and stay up to date on the latest best practices and troubleshooting advice about DataHub Core.

Join our user research panel

By joining the panel, you will be part of our go-to group for feedback on new and existing features, and your contribution will help shape the future of DataHub! Register here.

About the authors

Sam Osborn is a Senior Software Engineer at Block, where he has been building financial technology solutions since 2022. He previously spent nearly seven years at Tableau, progressing from Technical Support Engineer to Senior Software Engineer, contributing to data visualization and analytics platforms. Sam holds a Bachelor of Arts in Political Science and Government from The University of Texas at Austin and studied Computer Science at Universidade Estadual de Campinas in Brazil.

Harshal Sheth is a Founding Engineer at DataHub, where he helps build data discovery and governance solutions. A 2022 Yale Computer Science graduate, he has gained diverse technical experience through internships at Google, Citadel Securities, Instabase, and 8VC. At Google, he developed tracing tools for Fuchsia OS, while at Citadel Securities, he worked on low-latency technologies. His background spans distributed systems, venture capital evaluation, and scalable infrastructure development.

This collaboration between DataHub and Block advances open standards for AI-data integration. Both projects are open source, enabling organizations worldwide to benefit from AI-powered data discovery.

How DataHub’s MCP Server Transforms AI-Powered Data Operations:
A Technical Deep Dive with Block

How DataHub’s MCP Server Transforms AI-Powered Data Operations:
A Technical Deep Dive with Block

Your AI agents know everything—except your data