Apple Scales ML Metadata with DataHub

INDUSTRY

Technology / Consumer Electronics

SIZE

150,000+ employees

DATA STACK

Kafka, Iceberg, Superset, Druid, DeltaLake, Gitops, Helm, Kubernetes, Spark

SOLUTION

DataHub Core (OSS)

USE CASE

Metadata Management, Scale AI Initiatives, Governance

GOALS

Centralize metadata management for fast-evolving data and AI environments
Streamline ML data governance at scale

See what DataHub Cloud can do for your team

Meet With Us

The Topline

Challenge
Existing tools couldn’t support the complexity or pace of Apple’s evolving AI and ML landscape

Solution
Implemented DataHub with custom platform, entities, connectors, and aspects to support metadata management for data and AI assets

Impact
Established scalable metadata management processes and strengthened governance across the ML lifecycle

Note: This story was originally published October 2024.

Challenge

As Apple’s machine learning landscape rapidly expanded, the team faced growing complexity in managing metadata across AI assets, spanning custom models, diverse data stores, and advanced pipelines.

“The ML landscape is now coming in with its diverse applications evolving at an unprecedented pace… We are rapidly evolving our tooling and capabilities to support metadata management for our AI assets.”

— Ravi Sharma, Apple

Rigid platforms couldn’t model Apple’s unique asset types or adapt to frequent changes. And without a unified metadata foundation, governance efforts struggled to scale.

“You have a lot of heterogeneous data sources that have different characteristics,
and the entities in them may have different shapes. So this really brings a need to
define custom data sources and custom data entity shapes.”

— Deepak Chandramouli, Apple

Solution

Apple implemented DataHub as a foundation for managing metadata across its evolving ML lifecycle. The team extended DataHub with custom components to reflect the complexity of their stack while prioritizing data governance.

Key features included:

Custom platforms and entity types to reflect ML-specific assets and enrich the catalog with metadata
Custom-built connectors using DataHub’s Python SDK to hydrate the catalog with data and ML entities
Integrated lineage events from different compute engines and training frameworks into DataHub
Sibling entities to rationalize metadata and identify logical links. ML datasets are symlinked to underlying tables and objects
Custom aspects of metadata to overlay common types of metadata, such as policies, workflow status, and access status across heterogeneous entities. For example, pulling access control and privacy policies from the policy stores and providing a view of these aspects within the catalog
Hybrid metadata integration strategy blending push, pull, and event processing mechanisms
Glossary management automation framework for self-serve management of business ontologies to reduce friction in augmenting ML inventory with business metadata

“DataHub’s integration API patterns offer a balance in push, pull, and event processing mechanisms, which I think are all required to have a hybrid integration strategy.”

— Deepak Chandramouli, Apple

DataHub is an important component of our data infrastructure. We have leveraged several open source features of DataHub in the context of metadata management for the ML lifecycle.

RAVI SHARMA

Apple

Impact

With DataHub, Apple established a robust, extensible architecture that supports metadata management at the speed of AI innovation.

Key outcomes included:

Unified metadata management across a quickly evolving data and ML landscape
Future-proofed metadata management for AI with extensible architecture
Strengthened data governance with enriched technical metadata and self-service workflows

Start your own success story with DataHub

Meet with us

See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data. Request a custom demo.

Join our open source community

Explore the project, contribute ideas, and connect with thousands of practitioners in the DataHub Slack community.

Apple’s Machine Learning Data Gets Tuned Up