INDUSTRY
SIZE
DATA STACK
SOLUTION
USE CASE
GOALS
- Centralize metadata management for fast-evolving data and AI environments
- Streamline ML data governance at scale
The Topline
Challenge
Existing tools couldn’t support the complexity or pace of Apple’s evolving AI and ML landscape
Solution
Implemented DataHub with custom platform, entities, connectors, and aspects to support metadata management for data and AI assets
Impact
Established scalable metadata management processes and strengthened governance across the ML lifecycle
Note: This story was originally published October 2024.
Challenge
As Apple’s machine learning landscape rapidly expanded, the team faced growing complexity in managing metadata across AI assets, spanning custom models, diverse data stores, and advanced pipelines.
“The ML landscape is now coming in with its diverse applications evolving at an unprecedented pace… We are rapidly evolving our tooling and capabilities to support metadata management for our AI assets.”
— Ravi Sharma, Senior Engineering Manager, Apple
Rigid platforms couldn’t model Apple’s unique asset types or adapt to frequent changes. And without a unified metadata foundation, governance efforts struggled to scale.
“You have a lot of heterogeneous data sources that have different characteristics,
and the entities in them may have different shapes. So this really brings a need to
define custom data sources and custom data entity shapes.”— Deepak Chandramouli, Senior Engineer, Apple
Solution
Apple implemented DataHub as a foundation for managing metadata across its evolving ML lifecycle. The team extended DataHub with custom components to reflect the complexity of their stack while prioritizing data governance.
Key features included:
- Custom platforms and entity types to reflect ML-specific assets and enrich the catalog with metadata
- Custom-built connectors using DataHub’s Python SDK to hydrate the catalog with data and ML entities
- Integrated lineage events from different compute engines and training frameworks into DataHub
- Sibling entities to rationalize metadata and identify logical links. ML datasets are symlinked to underlying tables and objects
- Custom aspects of metadata to overlay common types of metadata, such as policies, workflow status, and access status across heterogeneous entities. For example, pulling access control and privacy policies from the policy stores and providing a view of these aspects within the catalog
- Hybrid metadata integration strategy blending push, pull, and event processing mechanisms
- Glossary management automation framework for self-serve management of business ontologies to reduce friction in augmenting ML inventory with business metadata
“DataHub’s integration API patterns offer a balance in push, pull, and event processing mechanisms, which I think are all required to have a hybrid integration strategy.”
— Deepak Chandramouli, Senior Engineer, Apple
DataHub is an important component of our data infrastructure. We have leveraged several open source features of DataHub in the context of metadata management for the ML lifecycle.
RAVI SHARMA
Senior Engineer Manager, Apple
Impact
With DataHub, Apple established a robust, extensible architecture that supports metadata management at the speed of AI innovation.
Key outcomes included:
- Unified metadata management across a quickly evolving data and ML landscape
- Future-proofed metadata management for AI with extensible architecture
- Strengthened data governance with enriched technical metadata and self-service workflows
Start your own success story with DataHub
Meet with us
See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data. Request a custom demo.
Join our open source community
Explore the project, contribute ideas, and connect with thousands of practitioners in the DataHub Slack community.