Apple’s Machine Learning Data Gets Tuned Up

“DataHub is an important component of our data infrastructure. We have leveraged several open source features of DataHub in the context of metadata management for the ML lifecycle.”
CUSTOMER
Apple’s Machine Learning Data Gets Tuned Up
INDUSTRY
Technology / Consumer Electronics
SIZE
150,000+ employees
SOLUTION
DataHub Core (OSS)
USE CASE
Metadata Management, Scale AI Initiatives, Governance
DATA STACK
Kafka, Iceberg, Superset, Druid, DeltaLake, Gitops, Helm, Kubernetes, Spark
GOALS
Centralize metadata management for fast-evolving data and AI environments, Streamline ML data governance at scale
Curious to see DataHub in action?
The Topline
  • Challenge: Existing tools couldn’t support the complexity or pace of Apple’s evolving AI and ML landscape
  • Solution: Implemented DataHub with custom platform, entities, connectors, and aspects to support metadata management for data and AI assets
  • Impact: Established scalable metadata management processes and strengthened governance across the ML lifecycle

Note: This story was originally published October 2024.

Challenge

As Apple’s machine learning landscape rapidly expanded, the team faced growing complexity in managing metadata across AI assets, spanning custom models, diverse data stores, and advanced pipelines.

“The ML landscape is now coming in with its diverse applications evolving at an unprecedented pace… We are rapidly evolving our tooling and capabilities to support metadata management for our AI assets.”

Ravi SharmaApple

Rigid platforms couldn’t model Apple’s unique asset types or adapt to frequent changes.  And without a unified metadata foundation, governance efforts struggled to scale.

“You have a lot of heterogeneous data sources that have different characteristics,
and the entities in them may have different shapes. So this really brings a need to
define custom data sources and custom data entity shapes.”

Deepak ChandramouliApple

Solution

Apple implemented DataHub as a foundation for managing metadata across its evolving ML lifecycle. The team extended DataHub with custom components to reflect the complexity of their stack while prioritizing data governance.

Key features included:

  • Custom platforms and entity types to reflect ML-specific assets and enrich the catalog with metadata
  • Custom-built connectors using DataHub’s Python SDK to hydrate the catalog with data and ML entities
  • Integrated lineage events from different compute engines and training frameworks into DataHub
  • Sibling entities to rationalize metadata and identify logical links. ML datasets are symlinked to underlying tables and objects
  • Custom aspects of metadata to overlay common types of metadata, such as policies, workflow status, and access status across heterogeneous entities. For example, pulling access control and privacy policies from the policy stores and providing a view of these aspects within the catalog
  • Hybrid metadata integration strategy blending push, pull, and event processing mechanisms
  • Glossary management automation framework for self-serve management of business ontologies to reduce friction in augmenting ML inventory with business metadata

“DataHub’s integration API patterns offer a balance in push, pull, and event processing mechanisms, which I think are all required to have a hybrid integration strategy.”

Deepak ChandramouliApple

Impact

With DataHub, Apple established a robust, extensible architecture that supports metadata management at the speed of AI innovation. 

Key outcomes included:

  • Unified metadata management across a quickly evolving data and ML landscape
  • Future-proofed metadata management for AI with extensible architecture
  • Strengthened data governance with enriched technical metadata and self-service workflows
Curious to see DataHub in action?
DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.
Get a personalized demo
Work directly with a DataHub engineer to evaluate fit for your architecture, walk through technical integrations, and explore pricing and deployment options tailored to your use case.
Schedule a Personal Demo
Join our open source community
Explore the project, contribute ideas, and connect with thousands of practitioners
Join the Slack community