Column-level Lineage Comes to DataHub

Column-level Lineage Comes to DataHub

Data Lineage

With DataHub, we’re committed to helping our users to discover, trust and act on the data in their organizations. Upstream and downstream lineage, i.e., understanding where a data product comes from and how it’s used, is critical to making this happen.

And that’s why we couldn’t be more excited to have made column-level lineage, one of DataHub’s most requested features, happen!

About lineage in DataHub

When we were building DataHub’s Lineage feature, we wanted to provide end-to-end visibility of the production, transformation, and consumption of an organization’s data — irrespective of the platforms it is being curated through. To this end, lineage in DataHub is designed to trace lineage across multiple platforms, datasets, pipelines, charts, and dashboards.

Once we launched Lineage, the next obvious step was to take things further to enable the visualization of end-to-end lineage for columns.

Why we built column-lineage

Column-level lineage is powerful in its potential to enable

  • proactive impact analysis and
  • reactive data debugging.

Here’s how. It not only lets you know if a dependency exists, but it also helps understand exactly how. This means that you can understand how a column is calculated so you can answer questions like:

  • Which root input columns are used to construct this column?
  • Does this column read from any sensitive data?
  • What approach was used to come up with this aggregation?

It also means you can understand how a column is being used so you can answer questions like:

  • Can I safely deprecate this field?
  • Which dashboards are visualizing this column?

Regulatory compliance demands were another reason that made this feature a priority. Several DataHub users deal with sensitive data and need to have complete visibility into the columns with PII and how they link to destination tables in downstream dashboards.

Column-level lineage helps them connect the dots between columns with PII and user-facing dashboards so they can take precautions to ensure the sensitivity of this data.

Building column-level lineage in DataHub

Visualizing lineage metadata is undoubtedly a challenge. Show too little, and it fails to serve its purpose. Show too much, and it can become clunky and hard to visualize — and use.

Our key focus while building column-level lineage was ensuring that it was clean and easy to understand. The way to do this was to allow users to view as much or as little as they need.

DataHub Controls that let you view just what you need

DataHub Controls that let you view just what you need

The column-level lineage experience in DataHub

Here’s what you get with column-level lineage in DataHub:

  • APIs for emitting column-level lineage
  • Automatic column lineage extraction from Snowflake and Looker
  • Column-level lineage visualization in the Lineage Explorer
  • Impact Analysis of a single column

Using column-level lineage in DataHub

1. Viewing column-level lineage

Toggle the Show Columns control to switch between table-level and column-level lineage — in one click — without switching tabs, or losing context.

DataHub Controls that let you view just what you need

DataHub Controls that let you view just what you need

2. Impact Analysis at the column level

Just click on a table’s schema and select the column whose impact you want to analyze. Right-click the menu as shown below, to see its lineage.

Datahub Still

The Lineage Explorer shows you exactly what you need to know.
For instance, you can see the assets that directly consume the ‘email’ column by setting the degree of separation to 1, 2, or 3+ using the filter shown below.

Datahub Still Frame

To see further up/down, all you need to do is set your filters to higher/farther degrees of dependencies.

Datahub Still

DataHub can also show you multiple paths corresponding to the different situations in which a column connects to another.

Datahub Still

What’s next for column-level lineage?

  • Viewing transformation logic used to derive a column
  • Automatic column lineage extraction for other SQL sources starting with BigQuery and Redshift (Q4 2022)
  • Support for Spark, Tableau (Q1 2023)

Want to see column-level lineage in action?

Watch Chris Collins’s walkthrough from our September Town Hall here!

Recommended Next Reads