INDUSTRY
SIZE
DATA STACK
SOLUTION
USE CASE
GOALS
- Centralize metadata management
- Enable self-serve data discovery
- Extend metadata management to support custom data ecosystem
The Topline
Challenge
Data consumers couldn’t discover relevant tables or understand column meanings; data producers lacked tools to govern and enrich metadata
Solution
Adopted DataHub as a metadata platform, extending it with custom Thrift ingestion and “Data Elements” typing system
Impact
Streamlined data discovery and scalable metadata management across a bespoke data ecosystem
Note: This story was originally published November 2022.
Challenge
Pinterest’s data ecosystem had grown to over 250,000 Hive tables and 100,000 Thrift structs and enums, outpacing the capabilities of its existing metadata workflows.
The existing infrastructure relied on a Thrift schema repository that was compiled daily into JAR files for their Presto and Spark engines. While functional, this approach was rigid, inconvenient, and lacked support for metadata enrichment.
This created pain on both ends of the data lifecycle:
- Data consumers struggled to identify the right tables or understand what columns represented
- Data producers had no clear, centralized location to add or manage metadata
“For the data consumers, people didn’t know which table should be used and what the column means. And for the data producers, they needed a governance tool. They needed to add metadata, but they didn’t know where to add it.”
– Zhong Xu, Software Engineer, Pinterest.
Solution
Pinterest selected DataHub for its extensible metadata model, modern UI, granular access control, and strong support for custom metadata.
The team extended the platform in two key areas:
- Custom Thrift ingestion: Rather than ingesting from generated artifacts, Pinterest built a custom integration to ingest raw Thrift files directly, preserving critical metadata
- “Data Elements” typing system: To better represent logical business entities across datasets, the team developed a custom “Data Elements” typing system. Thanks to DataHub’s extensible architecture, integrating this custom system was as simple as adding a new field in the editable schema field info
Thanks to DataHub’s extensibility, we could easily build our Thrift model… and extend the DataHub model to support our Data Elements.
ZHONG XU
Software Engineer, Pinterest
Impact
By extending DataHub to fit their unique environment, Pinterest delivered a more powerful, user-friendly, and scalable data discovery experience.
Key outcomes included:
- Accelerated custom integration using DataHub’s extensible architecture
- Accelerated use of data by streamlining data discovery through a centralized, intuitive UI
- Empowered data producers with the ability to manage metadata at the source
Start your own success story with DataHub
Meet with us
See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data. Request a custom demo.
Join our open source community
Explore the project, contribute ideas, and connect with thousands of practitioners in the DataHub Slack community.