How Pinterest Powers Its #1 AI Agent With DataHub

INDUSTRY

Social Media / Consumer Internet

SIZE

4,500+ employees

DATA STACK

Snowflake, Apache Airflow, Hive, AWS

SOLUTION

DataHub Core (OSS)

USE CASE

Context management, data governance

Key Results

#1 agent at Pinterest: The Analytics Agent has 10x the usage of the next most-used internal agent
Minutes, not hours: Analysts go from question to trusted, validated answer without hours of manual table exploration
~70% reduction in manual documentation effort: AI-generated docs, join-based lineage propagation, and vector similarity search eliminated the majority of manual documentation work

See what DataHub Cloud can do for your team

Meet With Us

“The most effective systems carry forward what an organization has already learned and make that knowledge more usable for others. That’s the opportunity for DataHub — and for all of us.”

Keqiang Li

Software Engineer, Pinterest

The Topline

Challenge
With 400,000 ungoverned tables and institutional knowledge siloed, Pinterest’s analysts could not trust their data context, making accurate, consistent AI-powered analytics impossible.

Solution
Pinterest deployed DataHub as a centralized context platform, combining enforced data governance, AI-generated documentation, and intent-indexed query history into a single context layer that powers its Analytics Agent.

Impact
The Analytics Agent became Pinterest’s most-used internal agent: questions that once required hours of exploration now return trusted answers in minutes, with every analyst’s institutional knowledge made accessible to everyone.

Challenge

Pinterest is one of the world’s largest visual discovery platforms. Its product experience and advertising business depend on understanding what hundreds of millions of people find interesting, across feeds that update in real time. That scale generates an enormous amount of data. For years, the data platform grew at the same pace as the product itself, without the governance infrastructure to match.

The result: 400,000+ tables, 500 new tables created daily, no lifecycle policy, and institutional knowledge that existed in people’s heads but nowhere analysts could reliably reach it. Every day at Pinterest, analysts were spending time they did not have, reverse-engineering information that already existed. Which table was current? Who owned it? Whether it was trustworthy? The data was there. The context to use it confidently was not.

“Analysts spent their time reverse-engineering trust, going through code, tracing lineage, and dropping Slack messages hoping someone would respond. The knowledge existed, but it was buried, tribal, and impossible to verify.”

This is a context problem. When an analyst cannot determine which of hundreds of candidate tables is trustworthy, which join is valid, or what a column actually measures, they cannot get a reliable answer.

The same problem is catastrophically worse for an AI agent. A system that has no reliable signal of trustworthiness cannot distinguish a production Tier 1 table from a deprecated staging table on semantic similarity alone. Relevant and trustworthy are not the same thing, and without a mechanism to separate them, an agent’s answers are only as good as its context.

Solution

Pinterest rebuilt its context infrastructure around a single principle: PinCat, Pinterest’s name for its DataHub implementation, would serve as the centralized context platform for every analyst and AI agent at the company.

That required solving four problems in sequence.

1. Making context trustworthy: governance as the foundation

Before any AI system could use Pinterest’s metadata, the metadata itself needed to be reliable. Pinterest’s engineering team operationalized governance through DataHub with tier-based metadata requirements.

The tier model creates three levels of context quality:

Tier 1 tables are cross-team, production assets with full documentation, quality checks, and data steward review: the highest-trust context in the warehouse.
Tier 2 tables are team-owned, with owner, description, and domain tag required.
Tier 3 tables are temporary or personal, with minimal documentation requirements but an aggressive retention policy with automatic cleanup.

Pinterest phased this rollout gradually, starting with five tables, then 10, and eventually scaling across the entire company. The result: 400,000 tables reduced to 100,000 curated, AI-ready assets. They were left with a warehouse that an agent can navigate with confidence because every asset in it carries a trust signal.

“DataHub came into the picture and became the system that made governance operational and actionable. Every table creation gets gated through the DataHub UI, and if you do not provide a tier, owner, retention, and domain, you don't get a table.”

2. Making context complete: AI documentation at scale

The governance foundation created trustworthy assets. But 100,000 governed tables with sparse documentation is still a weak context layer. The Pinterest team tackled documentation on three fronts:

DataHub’s AI Documentation feature drafts table and column descriptions from code, glossary terms, lineage relationships, and queries. Human review is retained for Tier 1 tables, keeping expert judgment on the assets that matter most.
Join-based lineage propagates semantics to joined columns automatically: when a documented column joins an undocumented one, semantics flow across.
Where joins are sparse, vector similarity search fills remaining gaps.

Pinterest estimates this approach eliminated 70% of the manual documentation work previously required to keep the warehouse legible. Every governed asset now carries both a trust tier and a semantic description, the starting point for everything that follows.

3. Making context searchable: intent-indexed query history

With 100,000 governed, documented assets in DataHub, Pinterest had a reliable schema dimension of context. But the team identified a second, underutilized dimension to build semantic infrastructure: the expertise of their analysts.

Every SQL query an analyst has ever written encodes something a schema description cannot: the business intent behind the data, the joins that actually work at Pinterest, and the metric definitions specific to how the company measures things. As Software Engineer Keqiang Li puts it: “We treat this history of queries as a library of real solutions written by real analysts.”

Pinterest built the semantic infrastructure in two parallel paths that combine at retrieval time.

Analytical intent: Every historical SQL query is enriched with governed context from DataHub, including glossary terms, table descriptions, metric definitions, and domain expertise. An LLM then translates the SQL into natural-language analytical intent, using the context to determine questions the user is trying to answer. Those intent representations are then embedded as vectors. The result is an index searchable by meaning: when an analyst asks a question, the agent retrieves queries that answered similar analytical questions before.
Structural and statistical patterns: Running in parallel, the same query history is parsed for the patterns that capture how analytics is actually done at Pinterest: which tables are joined together, which filters are standard, how often a query is asked. These patterns are extracted as structured data and used at retrieval time to surface relevant tables and the specific ways those tables have been successfully used.

At retrieval, both paths combine. Results are ranked by DataHub governance signals, including tier, freshness, documentation quality, and usage frequency, so the most reliable and relevant answer surfaces first.

4. Centralizing context: building the Analytics Agent

The Analytics Agent is where three phases of context work become visible to the analyst: trusted answers in minutes, without the hours of lineage tracing, Slack messages, and dead ends that came before.

Every query an analyst writes enriches the index, making discovery faster for the next analyst who asks a variant of the same question. The more the agent is used, the better its context gets.

For more details, check out the Pinterest Engineering team’s blog post.

Impact

Pinterest’s Analytics Agent became the company’s most-used internal AI tool. Key outcomes included:

#1 agent at Pinterest, with 10x the usage of the next. The Analytics Agent was adopted by 40% of Pinterest’s analyst population within two months of launch, with a target to reach 50% by year-end, and runs at 10 times the usage of the next most-used internal agent.
Question to trusted answer in minutes, not hours. Before, answering a question in an unfamiliar domain meant hours of table exploration, lineage tracing, and Slack messages. The agent returns trusted, governance-grounded answers by drawing on years of institutional query knowledge encoded in DataHub.
Cross-domain knowledge, accessible to every analyst. Patterns developed by one team are now retrievable by any analyst across the organisation. Institutional knowledge that previously lived in individual heads or unanswered Slack threads surfaces through a natural-language question.
Consistent results from governed sources. Every answer the agent returns is grounded in DataHub-governed tables and conventions. Governance-aware ranking ensures that deprecated, undocumented, or low-quality assets never surface above production-ready Tier 1 tables, regardless of semantic similarity.
~70% reduction in manual documentation effort. AI-generated docs, join-based lineage propagation, and vector similarity search together eliminated approximately 70% of the manual work previously required to document the warehouse.
400,000 tables reduced to 100,000 governed, AI-ready assets. Pinterest’s governance program, enforced entirely through DataHub, reduced warehouse footprint by 75% by building the discipline to only keep what is owned, documented, and purposeful.

What’s next

Pinterest is extending the same governance-and-context model to additional asset types. Superset charts are being brought into the agent’s context layer in version two, with governance tiers applied to dashboards using the same principles as tables. Documentation currently fragmented across Google Docs, Confluence, and internal wikis is being consolidated into DataHub as the single, centralized context platform.

The intended outcome is the same principle at greater scale: every asset an analyst or agent might need, governed, documented, and semantically indexed in one place — so that the next question any analyst asks is answered faster, and more reliably, than the last.

Pinterest: before and after DataHub

Business Challenge	Before DataHub	With DataHub
Finding a trustworthy table in a 400K-table warehouse	Manual lineage tracing, Slack messages, tracking down owners	Governance-aware search ranked by tier, freshness, and documentation quality
Maintaining data ownership at scale	~30% of top tables owned by ex-employees; no team-level accountability	Project identifier + domain tag required at creation; ownership survives personnel changes
Keeping documentation current across 100K+ tables	Sparse; manual effort could not keep pace with 500 new tables per day	AI-generated descriptions, join-based lineage propagation, vector similarity search; ~70% manual effort eliminated
Getting from analytical question to trusted SQL	Hours of domain exploration, multiple Slack threads, custom filter guesswork	Minutes, via natural-language query grounded in DataHub-governed context and intent-indexed query history
Sharing analytical patterns across domain teams	Institutional knowledge siloed in individual teams; no way to discover what other teams had already solved	Every analyst query enriches a shared index; patterns from any team are retrievable by any analyst
Ensuring AI agents surface trusted, not just relevant, results	No mechanism; semantic similarity alone surfaces deprecated tables over production ones	Governance-aware ranking weights tier, freshness, and documentation quality above similarity alone

Start your own success story with DataHub

Meet with us

See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data. Request a custom demo.

Join our open source community

Explore the project, contribute ideas, and connect with thousands of practitioners in the DataHub Slack community.

Pinterest Powers its #1 AI Agent with DataHub Context