INDUSTRY
SIZE
DATA STACK
SOLUTION
USE CASE
GOALS
- Centralize metadata across vast data ecosystem
- Improve data lineage and dependency tracing
- Monitor data quality in real time
- Streamline operations
The Topline
Challenge
With 100+ games generating massive data volumes, Zynga struggled with data silos, unclear dependencies, and fragmented metadata across their complex tech stack
Solution
Implemented DataHub as their centralized data catalog, customizing it to support gaming-specific needs like A/B testing experiments
Impact
Enabled 300+ data practitioners to work more efficiently and confidently, with full visibility into lineage, data quality, and dependencies across 100,000+ assets
Note: This story was originally published September 2023.
Challenge
As Zynga’s portfolio expanded to over 100 games, so did the complexity of its data operations. The company processes more than 35 billion records daily, ingests around 66 terabytes of data, and maintains 1.5 petabytes of queryable data. This ecosystem supports over 300 data analysts, engineers, and PMs running 100,000+ queries and generating 4,000 reports each day.
But without centralized metadata, teams struggled to answer fundamental questions:
“We needed a tool to help us know our data better and answer questions, such as, ‘Where does the data from this report come from?’ ‘What experiment will be affected if I change my dataset?’ ‘When was this dataset last updated?’ ‘What are the dependencies between datasets and jobs?’”
— Felipe Gusmao, Data Engineer, Zynga
Despite having a robust tech stack including Redshift, Airflow, Databricks, Tableau, and Kubernetes, Zynga identified the need for a unified solution to address data discovery, reliability, and operational efficiency at scale.
Solution
Zynga selected DataHub as the backbone of their metadata strategy, integrating it across their entire tech stack. The implementation spanned 100,000+ ingested assets from Redshift, Airflow, Tableau, Databricks, and more.
Key implementation details:
- Custom code: Modified ingestion pipelines for tools like Redshift and Airflow to accommodate differences in SQL syntax and workflow structures
- New entities: Introduced an “Experiments” entity to capture metadata for A/B testing and experiments, a critical aspect of game development
- Deployment architecture: Used managed AWS services for Kafka, MySQL, cache, and search, with DataHub’s frontend, GMS, and Schema Registry deployed via Kubernetes using Helm charts
- Monitoring: Integrated logs with Splunk and monitoring through Datadog for system reliability
We looked around for a data catalog tool, and DataHub was a clear winner. We created a small POC to evaluate its capabilities and saw a huge potential for it to be much more than a data catalog.
FELIPE GUSMAO
Data Engineer, Zynga
Impact
With DataHub, Zynga established a robust, reliable metadata foundation and made “fully data-driven” a reality for its global data teams.
Key outcomes included:
- Centralized metadata management across 100,000+ ingested assets, establishing a single source of truth for their massive data ecosystem
- Enhanced impact assessment with comprehensive data lineage enabling teams to trace dependencies between datasets, jobs, and reports, answering questions like, “What experiment will be impacted by changing this dataset?”
- Reduced data incidents with real-time data quality monitoring using assertion-based validation that track dataset health across 35 billion daily records and flag issues proactively
- Streamlined operational troubleshooting with visibility into Airflow DAG status insights, allowing teams to pinpoint causes of issues like delayed dashboard refreshes
- Optimized infrastructure costs by identifying and deprecating unused datasets across vast data landscape
Start your own success story with DataHub
Meet with us
See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data. Request a custom demo.
Join our open source community
Explore the project, contribute ideas, and connect with thousands of practitioners in the DataHub Slack community.