INDUSTRY
SIZE
DATA STACK
SOLUTION
USE CASE
GOALS
- Real-time data governance automation
- Automate data lifecycle management
- Optimize infrastructure costs
- Improve auditability
The Topline
Challenge
Manual data governance processes couldn’t scale with 42+ teams and a rapidly growing data ecosystem
Solution
Implemented DataHub Actions Framework for real-time, event-driven governance at scale
Impact
Reduced costs through automated depreciation, improved compliance with real-time data governance automations, and enhanced auditability
Note: This story was originally published November 2023.
Challenge
Checkout.com provides internet payment services for global brands like Financial Times, GE, Sony, and Patreon. But as the company grew, onboarding 42+ internal teams, so did its data governance complexity.
“The more teams we onboard, the more teams require a lot of data ingestion. And because we have a lot of data being ingested, it’s very quick for datasets to become stale and unused.”
— John Claro, Data Engineer II, Checkout.com
The team faced two critical governance challenges:
- Data sprawl and rising costs: Stale, unused datasets were cluttering the warehouse and driving up storage costs.
- PII compliance gaps: Teams lacked an efficient way to identify and mask sensitive data across systems.
Solution
Checkout.com’s data platform team deployed DataHub Actions Framework to automate data governance at scale. The framework listens for metadata changes in real time and triggers targeted workflows in response.
Automated deprecation
A custom detection model tracks dataset usage during Snowflake ingestion:
- 30 days of inactivity → flagged as “unused”
- 60 days → marked “stale”
- 90 days → queued for decommissioning
Data owners are notified about deprecated assets via Looker dashboards and provided with documentation to decommission unused datasets safely.
Verified datasets, those with full documentation and clear ownership, are protected from deprecation. Users also have the ability to un-deprecate a dataset if it becomes relevant again.
Real-time PII masking
As soon as a column is tagged with the glossary term “PII”, DataHub sends an event to the Kafka topic. This event is then parsed and triggers the system to propagate the masking to Snowflake and notify the user to confirm the masking, closing the loop.
Every metadata change is logged in DataHub’s metastore and replicated to Snowflake, so teams can easily track who made changes, when they happened, and whether data masking was successfully applied.
Additional governance and operational efficiency capabilities include:
- Automatic domain inheritance for container objects to streamline metadata management across large databases and schemas
- Temporary PII access for data owners to grant to select users with auto-expiration
- Real-time measurement and reporting of data quality rules
The most important bit of why we use DataHub’s Actions Framework is to allow us to make real-time changes as soon as an event happens in DataHub.
JOHN CLARO
Data Engineer II, Checkout.com
Impact
By automating routine workflows with the DataHub Actions Framework, Checkout.com unlocked meaningful gains in efficiency, compliance, and cost control.
Key outcomes included:
- Higher operational efficiency across 42+ teams by eliminating manual workflows with event-triggered actions
- Real-time data governance with automated PII masking and auto-expiration of temporary PII access
- Reduced data storage costs by automatically decommissioning unused datasets
- Built-in audit readiness with every metadata change logged in DataHub’s metastore and replicated to Snowflake
Start your own success story with DataHub
Meet with us
See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data. Request a custom demo.
Join our open source community
Explore the project, contribute ideas, and connect with thousands of practitioners in the DataHub Slack community.