INDUSTRY
SIZE
DATA STACK
SOLUTION
USE CASE
GOALS
- Consolidate data discovery tools, replacing a 9-year-old legacy catalog and self-built lineage tool
- Reliable data discovery
- Provide data lineage visibility
- Establish data governance foundation
- Reduce reliance on tribal knowledge with a scalable solution
The Topline
Challenge
Legacy self-built data catalog couldn’t scale with Etsy’s growth, making it difficult for users to find trusted datasets across their complex data landscape
Solution
Implemented DataHub to replace their 9-year-old legacy catalog, prioritizing BigQuery and MySQL integration with data lineage capabilities
Impact
Successfully migrated 600+ users to DataHub, consolidated data discovery tools, and established a foundation for improved data governance
Note: This story was originally published September 2022.
Challenge
Etsy faced a critical data discovery challenge as their two-sided marketplace scaled to over 7.7 million active sellers, 93 million active buyers, and 100+ million unique listings. Their nine-year-old self-built data catalog, along with a separate lineage tool built in 2020, had entered “maintenance mode” and could no longer support the organization’s data discovery needs.
The core issues, identified through 30+ user interviews across Engineering, Product Management, and Analytics teams, centered on data discovery and trust.
“It was hard to find the right datasets, and it was unclear which datasets were reliable, where they came from, and how they were being used. Much of this information came from tribal knowledge, which at this stage of growth was no longer sustainable for us.”
— Vishal Shah, Senior Data Engineer, Etsy
With production data stored across hundreds of MySQL shards, a data warehouse in BigQuery, and multiple other data sources, users struggled to navigate the complex data landscape effectively. The existing tools couldn’t provide the comprehensive view needed for informed decision-making.
Solution
Etsy’s Data Discovery team, formed in April 2021 after completing their Vertica-to-BigQuery migration, took a methodical approach to solving their data catalog challenge. Rather than immediately building or buying a solution, they invested a full month in user research to understand the problem space thoroughly.
The team then conducted an extensive vendor evaluation, investigating 30 different tools over a month and creating proof-of-concepts for select solutions. DataHub emerged as the winner due to three key factors:
- Flexibility in metadata modeling
- Strong integrations with Etsy’s existing data sources
- Active and growing open-source community
For implementation, Etsy leveraged existing organizational expertise in GKE, Kafka, and Buildkite while setting up additional infrastructure including Cloud SQL database and managed ElasticSearch. They prioritized BigQuery and MySQL ingestion to achieve parity with their legacy Schemer catalog and implemented an MVP for data lineage connecting MySQL → BigQuery → Looker pipeline.
We found success in DataHub for its flexibility in metadata modeling, integrations with many of our existing data sources at Etsy, and an active and growing community.
VISHAL SHAH
Senior Data Engineer, Etsy
Impact
Etsy successfully launched DataHub in April 2022, delivering significant improvements in data discovery and user engagement.
Key outcomes included:
- Scaled data coverage across 11,000+ datasets spanning 5 data platforms, with numbers growing daily
- Eliminated legacy system dependencies with minimal business disruption
- Accelerated data use by streamlining data discovery in a unified platform
- Enhanced data lineage visibility for critical business workflows
- Established governance foundation with ownership management embedded in dataset creation workflows
Start your own success story with DataHub
Meet with us
See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data. Request a custom demo.
Join our open source community
Explore the project, contribute ideas, and connect with thousands of practitioners in the DataHub Slack community.