5 Features to Look Out for in a Modern Data Catalog

By 2025, 80% of organizations will fail to scale their digital businesses if they don’t have a modern data and analytics governance approach.

It’s not hard to imagine why.

We just have to think about the large volumes of data organizations are producing,
the growing number of data tools they’re using, and the disparate sets of users involved at every step of the data journey.

Data discoverability, data sharing, and data governance are as much a challenge as they are a business priority.

And that’s where a data catalog can help.

Benefits of a Data Catalog: How exactly does a data catalog help?

By organizing metadata (the technical details around data assets) into well-defined and searchable assets, data catalogs help enable data discovery and data sharing, to help data users, at the very least,

  • Centrally access the organization’s data
  • Know what data is available and where
  • Find the data they need and any information about it
  • Evaluate the quality of that data
  • Know where data is coming from and where it’s going

This translates into improved data context and analysis, higher data efficiency and data quality, and a foundation for data governance and regulatory compliance — and, ultimately, improved business efficiency.

While this sounds great in principle, most typical data catalogs fall short because they

  • rely solely on manual enrichment of metadata
  • create status silos for metadata and end up with stale metadata
  • fail to act on changing metadata
  • cater only to technical users

But the modern data catalog can — and should — do so much more.

Data Catalog Trends: 5 capabilities that forward-looking businesses need

Here’s our take on the five critical capabilities that make a data catalog the best solution to align platforms, processes, and people — so companies make, and get, the most out of their data.

1. Shift Left

Shift Left simply refers to the practice of declaring and emitting metadata right where the data is generated. A big part of effective metadata management is that we go beyond the manual enrichment of data — often the result of treating metadata as an afterthought.

By doing so, companies meet developers or teams where they are — instead of retrofitting new processes or workflows. Even better, it can help better understand the downstream implications of any changes.

Developers can enrich their data with ownership, PII status, domains, tags, etc, in code — right where it is created. For instance, by annotating, say your schemas, annotations live alongside the schemas — ensuring that technical schemas are always aligned with the business context. If you’re thinking, “isn’t this similar to Data Contracts?”, the answer is “Yes! Data Contracts are an example of a shift-left principle applied to the collection of schema, semantic and quality metadata of a dataset.” Read this post here.

The beauty of Shift Left is that it can be tailored to teams’ tools and development patterns. All you need is a data catalog that will surface all this metadata with all the associated context.

How DataHub surfaces metadata added at source in its UI

How DataHub surfaces metadata added at source in its UI

2. Granular Lineage and Impact Analysis

We spoke about how one of the key roles of a data catalog is to show where data comes from and where it goes. What users need is a way to trace lineage across multiple platforms, datasets, pipelines, charts, and dashboards — across production, transformation, and consumption.

Lineage in most catalogs shows users that a dependency exists, but what users really need is to know exactly how. And this is where column-level lineage comes to play to enable

  • proactive impact analysis
  • reactive data debugging,

Column-level lineage has been one of DataHub’s most requested features, and for good reason.

Modern data catalog

From using it to better manage sensitive/PII data to simplifying essential operations like schema changes and refactoring, column-level lineage has become one of DataHub’s most-loved and most-used capabilities.

3. Active Metadata and Streaming

A data catalog’s active metadata approach ensures that all the metadata collected and indexed is

  • Live
  • Active, and
  • Injected into the operational plane

This means that changes happening across all your tools, such as Airflow, dbt, Snowflake, GitHub, etc., should reflect within the data catalog and be stored in a metadata graph.

A data catalog built with streaming capabilities enables you to act on metadata changes in real time.

This means that your data catalog should help you ensure data availability, completeness, and correctness — on an ongoing basis — and with additional capabilities like pipeline-breaking, reporting, verification, etc.

The following examples are just a few of the many scenarios that are possible once you have a real-time metadata platform:

  1. Pipeline observability and SLA tracking
  2. Circuit-breaking pipelines based on data quality
  3. Instant Slack notifications on any change in metadata
  4. Syncing tags across DataHub and data warehouses like Snowflake.
4. Developer Friendliness (API-first philosophy)

Developer friendliness, also known as an API-first philosophy, is a highly sought-after feature in modern data catalogs. However, not all APIs are created equal. To truly support developer friendliness, data catalogs must provide APIs that are not only robust and well-documented, but also offer advanced features that go beyond basic CRUD operations.

One such feature is support for strong-types, which allows for type checking and validation, reducing the likelihood of errors. Another important feature is support for change subscriptions, which allows developers to subscribe to updates in the data catalog and receive notifications in real-time. Additionally, support for analytics in the API surface area enables developers to easily access and manipulate data in a meaningful way.

An SDK (Software Development Kit) that is easy to use and well-documented is also crucial for developer friendliness. SDKs allow developers to easily integrate the data catalog into their systems and workflows, and provide a streamlined experience for common tasks.

Finally, a delightful CLI (Command Line Interface) experience is also an important aspect of developer friendliness. A well-designed CLI provides a simple and intuitive way for developers to interact with the data catalog, allowing them to quickly and easily access and manipulate data without the need for a graphical user interface.

Overall, a data catalog that prioritizes developer friendliness by providing advanced and well-documented APIs, an SDK, and a delightful CLI experience will be highly valued among developers and enable greater innovation and automation in data management.

5. Business User Inclusion (Business-facing Views)

The biggest problem with traditional data catalogs is that they are built and designed to cater mostly to technical users.

Something that came up repeatedly at DataHub’s Metadata Day expert panel in 2022 was the pressing need to make business users active participants in the data ecosystem.

Even Gartner, in its top data trends for 2022, stresses the need for

  • shifting the focus from IT to business and
  • enabling “business users or business technologists to collaboratively craft business-driven data and analytics capabilities.”

This can only happen when a data catalog is designed for both business and technical users.

At DataHub, we refer to this as Metadata 360, which combines technical and logical metadata for a cohesive story based on a holistic view of your data stack. This means that business users can see not just what’s happening across systems but also understand how data is produced, transformed, stored, and used — along with the operational metrics that go with these stages.

We’ve built DataHub to ensure that business metadata (traditionally represented using glossary terms and taxonomies) connects to the physical metadata (tables and columns and operational metadata) emitted by the tools in your data stack. Every entity on DataHub can be seamlessly mapped to the business. Take, for example, our Business Glossary feature that ensures that data elements are logically classified in the business context so teams understand how different terms relate to one another.

When you view a glossary term on DataHub, you don’t just see the code associated with it but also a plain-language definition of the metric and even the users associated with that asset.

Return Rate

You can compile your company-specific vocabulary such as classifications (data sensitivity levels) or acronyms (such as KPIs), etc., so you have a single source of truth for your business language. More importantly, your glossary terms now live in the same location as your company’s data assets — allowing your users to associate terms with datasets, classify datasets, view business dependencies, etc.

Wrapping up

We believe that organizations need a developer-friendly and truly dynamic data catalog to tackle the scale and diversity of the modern data stack.

We’re building DataHub to provide organizations with the most reliable and trusted enterprise data graph they need to maximize the value they derive from data.

Want to learn more about Data Governance with DataHub? Join our Slack community.

Similar Posts