Skip to content
DataHub
Get a Demo
Product Overview

Product Overview

AI-powered discovery, governance, and observability unify across your data estate to deliver data quality, compliance, and AI readiness.

Learn more

Platform

  • Discovery
  • Observability
  • Governance
  • Lineage
  • AI
  • Context Management New

Explore

  • The ROI of DataHub Cloud
  • DataHub Cloud vs Core
  • Integrations
  • Product Demos
Join the Community

Join the Community

Get help, share ideas, and connect with the DataHub community on Slack.

Learn more

Engage

  • Join the Community
  • Town Halls
  • Docs
  • Champions

Connect

  • Slack
  • Youtube
  • Office Hours
Pinterest Powers its #1 AI Agent with DataHub Context

Pinterest Powers its #1 AI Agent with DataHub Context

Modern data discovery goes beyond search. Learn how DataHub connects your data estate end-to-end.

Learn more
Resources
  • Blog
  • Guides
  • Events
  • Customer Stories
  • Webinars

Support

  • Docs
  • Get Support
  • Live Group Demo
Context Management for Enterprise AI

Context Management for Enterprise AI

The complete resource hub for context management: foundational concepts, architecture guides, implementation patterns, and comparisons.

Learn More

Hubs

  • Context Management
  • Data Lineage Coming Soon
Careers

Careers

Data is powering AI. But without context, even the best models fall short. Join us.

Learn more

Company

  • About us
  • Careers
  • News
Partners
  • AWS
  • Google Cloud
  • Snowflake
  • Databricks
DataHub
  • Platform

    • Discovery
    • Observability
    • Governance
    • Lineage
    • AI
    • Context Management New

    Explore

    • The ROI of DataHub Cloud
    • DataHub Cloud vs Core
    • Integrations
    • Product Demos
    Product Overview

    Product Overview

    AI-powered discovery, governance, and observability unify across your data estate to deliver data quality, compliance, and AI readiness.

    Learn more
  • Engage

    • Join the Community
    • Town Halls
    • Docs
    • Champions

    Connect

    • Slack
    • Youtube
    • Office Hours
    Join the Community

    Join the Community

    Get help, share ideas, and connect with the DataHub community on Slack.

    Learn more
  • Resources
    • Blog
    • Guides
    • Events
    • Customer Stories
    • Webinars

    Support

    • Docs
    • Get Support
    • Live Group Demo
    Pinterest Powers its #1 AI Agent with DataHub Context

    Pinterest Powers its #1 AI Agent with DataHub Context

    Modern data discovery goes beyond search. Learn how DataHub connects your data estate end-to-end.

    Learn more
  • Hubs

    • Context Management
    • Data Lineage Coming Soon
    Context Management for Enterprise AI

    Context Management for Enterprise AI

    The complete resource hub for context management: foundational concepts, architecture guides, implementation patterns, and comparisons.

    Learn More
  • Company

    • About us
    • Careers
    • News
    Partners
    • AWS
    • Google Cloud
    • Snowflake
    • Databricks
    Careers

    Careers

    Data is powering AI. But without context, even the best models fall short. Join us.

    Learn more
Get a Demo

Understanding DataHub’s Ingestion Transformers: A Flexible Approach to Metadata Customization

By: DataHub

09.13.24

Contents

    How You Can Simplify Metadata Management with DataHub’s Ingestion Transformers

    Imagine this: Your company is rapidly scaling, and predictably, your data team is inundated with datasets from various sources — everything from customer analytics to financial reports. Each dataset needs proper categorization, ownership assignment, and tagging to ensure that the right teams can find and use the data efficiently.

    However, doing all this manually is becoming increasingly unsustainable as the volume of data grows. You’re struggling to maintain consistency, and important datasets are getting lost in the noise.

    This is the problem DataHub’s ingestion transformers are designed to solve. They offer a flexible way to customize and modify metadata without needing to alter the ingestion framework code.

    Read on to understand what DataHub’s ingestion transformers are, how they work, and practical ways to use them in your data pipeline to streamline and enhance your metadata management.

    What Are Ingestion Transformers in DataHub?

    Ingestion transformers allow you to modify metadata as it moves through the data ingestion pipeline in DataHub. Typically, data flows from sources to sinks, such as a DataHub instance. However, before metadata reaches the sink, adjustments might be needed — such as adding tags, ownership, setting custom properties, or defining specific paths. Ingestion transformers make these changes straightforward by allowing you to modify metadata without touching the underlying code.

    How Do DataHub Ingestion Transformers Work?

    Setting up an ingestion transformer in DataHub is simple. You define transformers in the recipe.yaml file, where you specify your data sources and sinks. By naming the transformer and configuring it based on your needs, you can easily integrate it into your ingestion pipeline.

    source: 
    ...

    sink:
    ..

    transformers:
    - type: "simple_add_dataset_tags"
    config:
    tag_urns:
    - "urn:li:tag:NeedsDocumentation"
    - "urn:li:tag:Legacy"

    Built-in Transformers

    DataHub comes with a variety of built-in transformers, enabling you to add glossary terms, tags, domains, ownerships, and more. They are designed to let you add glossary terms, tags, domains, ownership, properties, statuses, and browse paths using a simple add method or based on a pattern.

    GlossaryTerm

    • Simple Add Dataset glossaryTerms
    • Pattern Add Dataset glossaryTerms
    • Pattern Add Dataset Schema Field glossaryTerms

    Tag

    • Pattern Add Dataset Schema Field globalTags
    • Simple Add Dataset globalTags
    • Pattern Add Dataset globalTags
    • Add Dataset globalTags

    Domain

    • Simple Add Dataset domains
    • Pattern Add Dataset domains

    Ownership

    • Simple Add Dataset ownership
    • Pattern Add Dataset ownership
    • Simple Remove Dataset ownership

    Properties

    • Simple Add Dataset datasetProperties
    • Add Dataset datasetProperties

    ETC

    • Mark Dataset Status
    • Set Dataset browsePath

    If these built-in options aren’t enough, you can create custom transformers to suit your unique requirements, offering flexibility and control over your metadata transformations.

    Applications of DataHub Ingestion Transformers: Example Use Cases

    Here are some common scenarios that illustrate how and where you can leverage DataHub’s ingestion transformers effectively.

    1. To Customize Dataset Browse Paths

    Imagine you’re managing a recurring ingestion pipeline that pulls data from a MySQL database. To improve data discovery and understanding, you might want to organize datasets into directories like “Production DB” or “Marketing DB.”

    In this situation, theset_dataset_browse_path transformer allows you to customize the browse path, enabling better organization of your datasets.

    You can either use predefined variables like platform and dataset parts or define static paths for a more tailored organization.

    Suppose you have a table named sales.orders. You want to set the dataset browse path in a structured format, such as /PROD/mysql/sales/order, where PROD represents the environment, mysql is the platform, and sales/order reflects the dataset parts. The following code snippet demonstrates how to achieve this:

    transformers:
    - type: "set_dataset_browse_path"
    config:
    path_templates:
    - /ENV/PLATFORM/DATASET_PARTS

    In this example, the set_dataset_browse_path transformer allows you to define dynamic paths based on environment variables like ENV, PLATFORM, and DATASET_PARTS.

    Alternatively, you can set a static path, as shown below. This configuration sets the path to /mysql/marketing_db/sales/order, where marketing_db serves as a static part of the path:

     path_templates:
    - /PLATFORM/marketing_db/DATASET_PARTS

    This flexibility makes it easier to manage and discover data across different teams and departments.

    2. To Add Tags to Datasets

    Tagging datasets is crucial for data organization and discovery. One of the simplest methods is using the set_add_dataset_tags transformer, which lets you add tags to all dataset entities for a specific ingestion run. Configuration is straightforward — just define the tag URN in the config section, and you’re set.

    transformers: 
    - type: "simple_add_dataset_tags"
    config:
    tag_urns:
    - "<tag_urn_1>"
    - "<tag_urn_2>"

    For more complex tagging scenarios, the pattern_add_dataset_tags transformer comes in handy. This allows you to match a regex pattern to dataset URNs and assign tags accordingly.

    transformers:
    - type: "pattern_add_dataset_tags"
    config:
    tag_pattern:
    rules:
    ".*example1.*": ["urn:li:tag:NeedsDoc", "urn:li:tag:Legacy"]
    ".*example2.*": ["urn:li:tag:NeedsDoc"]

    For the example above, you could add a “NeedsDoc” tag to datasets with names containing “example1” or “example2,” and a “Legacy” tag to datasets containing “example1.” This pattern-based approach offers granular control over tagging, ensuring that your datasets are labeled consistently and accurately.

    This approach allows for more granular control over how tags are assigned to your datasets, based on specific patterns or naming conventions.

    3. To Create Customized Transformers

    Custom transformers offer a powerful way to meet your specific data management needs and handle more complex metadata logic. You have two options: calling a custom function within an existing transformer or building a transformer from scratch.

    Let’s say you need to implement more advanced logic for assigning tags — something beyond the simple or pattern-based methods. With the add_dataset_tags transformer, you can call your custom function under the config section. This allows you to define your own logic, such as tagging datasets that belong to a specific department and have been updated within the last 30 days.

    // recipe.yaml 
    transformers:
    - type: "add_dataset_tags"
    config:
    get_tags_to_add: "<your_module>.<your_function>"
    # custom fucntion
    def custom_tags(entity_urn: str) -> List[TagAssociationClass]:
    tag_strings = []

    # Add custom logic here
    tag_strings.append('custom1')
    tag_strings.append('custom2')

    tag_strings = [builder.make_tag_urn(tag=n) for n in tag_strings]
    tags = [TagAssociationClass(tag=tag) for tag in tag_strings]

    return tags

    These are just a few examples of how you can make the most of DataHub’s ingestion transformers. Whether you’re adding domains, owners, custom properties, marking dataset statuses, or creating more complex logic, DataHub’s transformers make it simple to adjust metadata and ensure that your data is well-organized, discoverable, and usable.

    Over to You

    DataHub ingestion transformers are designed to make metadata management powerful and customizable. Whether you’re new to DataHub or an experienced user, these ingestion transformers offer a flexible way to tailor your metadata without needing deep technical knowledge of the ingestion framework.

    For more help with DataHub ingestion or to explore additional options, visit the DataHub documentation site for detailed guidance and support.

    Curious to see DataHub in action?

    DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

    Meet with us

    See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data.

    Book a Demo DataHub Cloud

    Join our open source community

    Explore the project, contribute ideas, and connect with thousands of practitioners.

    Join the Slack community slack

    Recommended next reads

    View All Blogs
    Netflix Reimagines Discovery and Governance at Scale
    CUSTOMER STORY03.20.26

    Netflix Reimagines Discovery and Governance at Scale

    With DataHub, Netflix empowers teams to define and manage metadata through self-serve workflows, improving flexibility and governance.

    Introducing DataHub Cloud v0.3.17
    PRODUCT UPDATES03.24.26

    Introducing DataHub Cloud v0.3.17

    DataHub Cloud v0.3.17 brings native Microsoft Fabric connectors for cross-platform lineage, Ask DataHub Plugins for multi-tool context, and smarter data quality monitoring.

    The State of Context Management in 2026
    CONTEXT MANAGEMENT03.09.26

    The State of Context Management in 2026

    Survey data from 250 IT and data leaders exposes the gap between AI confidence and the context management infrastructure production-scale agentic AI demands.

    Product

    • Product Overview
    • Discovery
    • Observability
    • Governance
    • Lineage
    • AI Data Management
    • Context Management
    • The ROI of DataHub Cloud
    • Product Demos

    Community

    • Join the Community
    • Docs
    • Champions
    • Town Halls
    • Office Hours
    • Slack
    • Youtube

    Resources

    • Customer Stories
    • Blog
    • Guides
    • Articles
    • Webinars
    • Get Support

    Company

    • About Us
    • Leadership
    • News
    • Careers

    © 2026 Acryl Data, Inc.

    Privacy Policy Terms of Service Security