Skip to content
DataHub
Get a Demo
Product Overview

Product Overview

AI-powered discovery, governance, and observability unify across your data estate to deliver data quality, compliance, and AI readiness.

Learn more

Platform

  • Discovery
  • Observability
  • Governance
  • Lineage
  • AI
  • Context Management New

Explore

  • The ROI of DataHub Cloud
  • DataHub Cloud vs Core
  • Integrations
  • Product Demos
Join the Community

Join the Community

Get help, share ideas, and connect with the DataHub community on Slack.

Learn more

Engage

  • Join the Community
  • Town Halls
  • Docs
  • Champions

Connect

  • Slack
  • Youtube
  • Office Hours
Pinterest Powers its #1 AI Agent with DataHub Context

Pinterest Powers its #1 AI Agent with DataHub Context

Modern data discovery goes beyond search. Learn how DataHub connects your data estate end-to-end.

Learn more
Resources
  • Blog
  • Guides
  • Events
  • Customer Stories
  • Webinars

Support

  • Docs
  • Get Support
  • Live Group Demo
Context Management for Enterprise AI

Context Management for Enterprise AI

The complete resource hub for context management: foundational concepts, architecture guides, implementation patterns, and comparisons.

Learn More

Hubs

  • Context Management
  • Data Lineage Coming Soon
Careers

Careers

Data is powering AI. But without context, even the best models fall short. Join us.

Learn more

Company

  • About us
  • Careers
  • News
Partners
  • AWS
  • Google Cloud
  • Snowflake
  • Databricks
DataHub
  • Platform

    • Discovery
    • Observability
    • Governance
    • Lineage
    • AI
    • Context Management New

    Explore

    • The ROI of DataHub Cloud
    • DataHub Cloud vs Core
    • Integrations
    • Product Demos
    Product Overview

    Product Overview

    AI-powered discovery, governance, and observability unify across your data estate to deliver data quality, compliance, and AI readiness.

    Learn more
  • Engage

    • Join the Community
    • Town Halls
    • Docs
    • Champions

    Connect

    • Slack
    • Youtube
    • Office Hours
    Join the Community

    Join the Community

    Get help, share ideas, and connect with the DataHub community on Slack.

    Learn more
  • Resources
    • Blog
    • Guides
    • Events
    • Customer Stories
    • Webinars

    Support

    • Docs
    • Get Support
    • Live Group Demo
    Pinterest Powers its #1 AI Agent with DataHub Context

    Pinterest Powers its #1 AI Agent with DataHub Context

    Modern data discovery goes beyond search. Learn how DataHub connects your data estate end-to-end.

    Learn more
  • Hubs

    • Context Management
    • Data Lineage Coming Soon
    Context Management for Enterprise AI

    Context Management for Enterprise AI

    The complete resource hub for context management: foundational concepts, architecture guides, implementation patterns, and comparisons.

    Learn More
  • Company

    • About us
    • Careers
    • News
    Partners
    • AWS
    • Google Cloud
    • Snowflake
    • Databricks
    Careers

    Careers

    Data is powering AI. But without context, even the best models fall short. Join us.

    Learn more
Get a Demo

Making Data Relevant Again

By: Swaroop Jagadish

01.29.24

Contents

    Picture This

    You are a risk analyst with a financial services organization. You’re part of a team that creates products and services for small- and medium-sized business (SMB) customers.

    But … your team’s efforts are constrained by poor-quality data. You struggle to identify and integrate the data you need to analyze SMB market segmentation and to model product viability and financial risk. The data isn’t where it should be, isn’t what it should be, isn’t in the expected format, and is riddled with anomalies.

    This is a “fictional” scenario, but it’s also not-so-fictional, because many organizations can identify with it. Increasingly, decision-makers and stakeholders just don’t trust their data and analytics—usually because what they’re seeing is out-of-date, incomplete, inconsistent, and sometimes flat-out wrong.

    Relevant Data Is Difficult to Find


    There are four common causes of this. First, data is siloed across a myriad of cloud applications, services, and resources—Databricks, Snowflake, and BigQuery, for example, or cloud storage resources like S3 or Google Cloud Storage. There’s also data in the on-premises environment, which is often siloed in legacy systems. Thanks to all of this siloing, it’s difficult to ask simple questions like “Where is my customer data?”—or complex analytic questions like “Who are my customers?”—because the relevant data lives in so many different places and is structured according to different schemas.

    In the best-case scenario, you know where your data is, you know what it is, you know where it came from, you have some idea of what format it’s in, and you also know what’s wrong with it.

    Sadly, the reverse is usually the case. The best-case scenario almost never happens.

    Relevant Data Lacks Essential Context


    In fact, for many organizations, knowledge of all five of these things is too frequently a luxury.

    They can’t pinpoint, exactly, the applications or resources in which all of their relevant customer data is siloed. And even if they can, this data isn’t always understandable—because it lacks context.

    This is the second reason people don’t trust their data or analytics.

    What is it? Where did it come from? What was done to it? By whom? For what purpose? Is it first-party data, created by an internal producer, or is it external, generated by a third party? Is it in any way sensitive? For example, are the values recorded in the `ID` column `device_ID`s or `customer_ID`s? The former could be sensitive, the latter probably aren’t. Without context, you just don’t know.

    Too often, people who want to discover and work with data … just don’t know.

    Relevant Data Is Nondescript


    The third reason has to do with timeliness. When you’re dealing with siloed datasets, even when you know something about the history of a specific dataset, you don’t always know just how “historical” it is.

    In other words, how fresh is it? Is it the product of a routine batch process, or is it a one-off thing—i.e., created at a specific time for a specific purpose? If it’s the product of a batch process, when was it last updated? How frequently is it supposed to be updated? If it hasn’t been updated, why not? Timely data isn’t just critical for the accuracy and reliability of operational analytics and executive dashboards: a lack of timely data (because of data outages, or delays in preparing and integrating data) can also lead to missed opportunities, paralyzed decision-making, poor customer service, and other bad outcomes.

    Any one of these sound familiar?

    The fourth reason has to do with semantics. Let’s say that you are dealing with high-quality datasets. Can you meaningfully compare attributes, entities, or definitions across them? For example, does `last_purchase_date` in Dataset A mean the same thing as `RecentPurchase` in Dataset B? With a little investigative effort, you can usually answer this question. But shouldn’t the answer be obvious?

    For data practitioners, it usually isn’t.

    Relevant Data and the Challenge of AI


    The issues I’ve raised have particular salience for organizations pivoting to take advantage of AI.


    Machine intelligence requires access to high-quality, context-rich data. Full stop. In undergraduate computer science classes, we still teach students the truism “Garbage in, garbage out” because we want to emphasize the importance of quality input data. But in ML, “Garbage in, garbage out” isn’t just a truism, it’s also an Iron Law. Train a model on poor-quality data, and you’ll get erroneous, biased, and/or totally unpredictable outputs. Feed a production model low-quality data and you’ll get inconsistent, imprecise, inaccurate results. In AI, everything depends on the quality of input data.

    But there’s one another critical factor to consider….

    Relevant Metadata for AI Transparency, Reproducibility, and Performance


    In the AI Era, having high-quality data and metadata are more important than ever.

    For one thing, companies need to collect and track the metadata created during LLM fine-tuning and prompt construction to demonstrate that their models are transparent, reproducible, ethical, and compliant with regulations.

    For another, having access to high-quality metadata is essential for improving the accuracy and precision of AI models—including LLMs. Metadata helps identify data biases and gaps—are certain demographic groups over-represented in the dataset? Is the data overwhelmingly derived from one source?—which is critical for training models that are accurate, precise, fair, and aligned with human values. Metadata also provides a baseline for creating new datasets for traditional ML model retraining, and helps accelerate data preprocessing and feature engineering. Finally, metadata allows for experiment tracking and version control, improving the reproducibility and reliability of ML models.

    Lack of high-quality data is the main reason organizations struggle to create and operationalize LLM prototypes that perform accurately and reliably in production. And lack of quality metadata is one of the main reasons organizations struggle to maintain and improve the performance of production LLMs.

    When Relevant Data Isn’t Relevant, or: The Status Quo


    One last thing to keep in mind is that data work is very much a team sport, requiring not only coordination and collaboration but also an irreducible element of trust.

    Teammates must trust not only one another, but also the ground on which they’re playing.

    Like a soccer pitch, data is the ground truth that defines the boundaries of the game and provides an orienting context for all players. Its condition has a material impact on how the game is played.

    Or whether it can be played at all.

    The problem is that people too often don’t want to play … because they don’t trust their data, either because it’s of poor quality or because it just isn’t relevant for them. How could it be? In order to discover relevant data, consumers have to learn to use third-party tools that force them to switch outside of their preferred workflows.

    Technical and logistical barriers—using unfamiliar tools, learning unfamiliar languages, or jumping through process-based hoops to gain access to useful data—absolutely discourage business analysts and other consumers from tracking down relevant data that they could use to enrich their analyses.

    But the thing that people forget is that “relevance” is also a function of comfort, familiarity, and value.

    By this standard, we fail to make data relevant for consumers whenever we silo it in a system or fail to describe what it is, when it was created, where it came from, what it’s for, and why it’s important. We fail to make data relevant for consumers whenever we erect process-related hurdles or hoops to accessing and using it. Unfortunately, by these criteria, most data, in most organizations, is irrelevant.

    It’s irrelevant because it’s too much trouble for people to access it. The value just isn’t there for them.

    Acryl Cloud: Making Data Relevant Again


    This is why Acryl Data exists. In Acryl Cloud, we deliver a reliable, trustworthy metadata platform—built on top of the thriving open-source DataHub project—that data practitioners can use to easily search for and discover data, as well as quickly answer questions like:


    What kind of data is this?
    Where did it come from?
    When was it created?
    By whom or what?
    For what purpose?
    What other data does it relate to or depend on?
    What other data can it be compared to?

    And we’re marrying these capabilities with best-in-class automated features—like metadata-driven governance, with the ability to continuously monitor and enforce policy-based governance standards.

    Intrigued? Transform your data operations with Acryl Cloud, the best-in-class metadata platform that makes your data relevant again: discoverable, accessible, usable, understandable—and governable.

    Curious to see DataHub in action?

    DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

    Meet with us

    See how DataHub Cloud can support enterprise needs and accelerate your journey toward context-rich, AI-ready data.

    Book a Demo DataHub Cloud

    Join our open source community

    Explore the project, contribute ideas, and connect with thousands of practitioners.

    Join the Slack community slack

    Recommended next reads

    View All Blogs
    Netflix Reimagines Discovery and Governance at Scale
    CUSTOMER STORY03.20.26

    Netflix Reimagines Discovery and Governance at Scale

    With DataHub, Netflix empowers teams to define and manage metadata through self-serve workflows, improving flexibility and governance.

    Introducing DataHub Cloud v0.3.17
    PRODUCT UPDATES03.24.26

    Introducing DataHub Cloud v0.3.17

    DataHub Cloud v0.3.17 brings native Microsoft Fabric connectors for cross-platform lineage, Ask DataHub Plugins for multi-tool context, and smarter data quality monitoring.

    The State of Context Management in 2026
    CONTEXT MANAGEMENT03.09.26

    The State of Context Management in 2026

    Survey data from 250 IT and data leaders exposes the gap between AI confidence and the context management infrastructure production-scale agentic AI demands.

    Product

    • Product Overview
    • Discovery
    • Observability
    • Governance
    • Lineage
    • AI Data Management
    • Context Management
    • The ROI of DataHub Cloud
    • Product Demos

    Community

    • Join the Community
    • Docs
    • Champions
    • Town Halls
    • Office Hours
    • Slack
    • Youtube

    Resources

    • Customer Stories
    • Blog
    • Guides
    • Articles
    • Webinars
    • Get Support

    Company

    • About Us
    • Leadership
    • News
    • Careers

    © 2026 Acryl Data, Inc.

    Privacy Policy Terms of Service Security