Extracting Column-Level Lineage from SQL

Some column-level lineage with dbt and Postgres.
Some column-level lineage with dbt and Postgres.

Data people really care about data lineage, particularly from SQL.

We looked at a bunch of open-source automated SQL lineage tools and found that many shared the same underlying problem: they were unaware of the underlying table schemas, and hence couldn’t generate accurate column-level lineage.

A metadata platform and data catalog like DataHub already has APIs for retrieving the schema for any tables in your data stack. So, we built a SQL lineage parser that’s schema-aware and can take advantage of DataHub’s APIs to generate accurate column-level lineage from SQL queries across a wide array of dialects.

In our tests, it works significantly better than other open-source, Python-based lineage tools.¹

What is data lineage?

Data lineage captures data dependencies between database tables, orchestrators, streaming tools, BI tools, and more. It enables you to trace the flow of data through your data ecosystem, and helps you answer questions like “Is this report built on good data?”, “What will break if I change this table?”, or “Am I managing PII correctly?”.

A big piece of the puzzle is SQL queries, which implicitly capture and document the lineage of the data they query and manipulate. Whenever you write a query like this:

CREATE VIEW my_view AS
SELECT
a.id,
a.name,
b.email
FROM
table_a a
JOIN table_b b
ON a.id = b.id

You’ve implicitly created lineage from table_a and table_b into my_view:

  • table_a.id → my_view.id
  • table_a.name → my_view.name
  • table_b.email → my_view.email

That may not look too bad, but when you’ve got thousands of queries and each is hundreds of lines long, manually figuring out all of that lineage is a nightmare.

As with all things in software, the solution is… more software.²

The dream of automated lineage

A lot of folks have gone through the same thought process, come to the same conclusion that data lineage is important, and tried to build a tool that automatically extracts lineage from SQL queries.

It’s a hard problem. SQL is a remarkably complex language, and that complexity is compounded by the fact that every database has its own dialect. Even the problem of parsing SQL into an Abstract Syntax Tree (AST) is non-trivial, and that’s before you even start thinking about lineage.

Despite the challenges, we’ve made some good progress: existing tools are quite good at extracting table-level lineage from SQL. Unfortunately, most open-source lineage tools fall flat when it comes to extracting column-level lineage.

There’s a simple reason for this: they’re naive to the underlying database schema.³

Schema-aware parsing

Let’s revisit that query from earlier, but with a small twist:

CREATE VIEW my_view AS
SELECT
table_a.id,
name,
email
FROM
table_a
JOIN table_b
ON table_a.id = table_b.id

In this query, we don’t explicitly qualify that name was in table_a and email was in table_b. Given that the query is functionally equivalent, the lineage should also be the same. However, without knowledge of the schemas of table_a and table_b, it’s impossible to generate the correct lineage. A schema-naive parser can’t possibly know where the columns actually came from.

The solution is to make the parser schema-aware. That is, it needs to know the underlying schemas of the tables involved in the query.

The immediate question then is “where do we get the schemas from?”.⁴ If you’re already a data catalog like DataHub, you’re in luck: it already has the schemas of every table in your data ecosystem, and provides a fast and scalable API for retrieving them.

Finally, we need to put it all together. Here’s a high-level overview of the process:

A flowchart showing the column-level lineage generation process.

Let’s walk through the main steps.

SQL Parsing

For SQL parsing, we use a fork of SQLGlot. We surveyed a lot of SQL parsers and found that SQLGlot was best suited for our needs. It’s pure Python, supports 20 different SQL dialects, and has nice APIs for traversing the AST.

Table qualification

This step is all about turning SELECT column FROM my_table into SELECT column FROM my_db.my_schema.my_table. This requires some information about the context in which the query was executed, which is why we take the default DB and schema as arguments. It partially depends on the SQL dialect because some have a two-tier hierarchy ( db.table) whereas others have a three-tier ( db.schema.table) format.

Fetch table schemas

Once we’ve got our candidate table names, we fetch the underlying schemas for those tables from the DataHub backend. Depending on the expected usage of the parser, we either prefetch all relevant schemas or fetch them on demand. There’s also some caching and memory offloading to ensure reasonable performance without blowing up memory usage.

Column qualification

This step is similar to table qualification, but for columns. It turns SELECT column FROM my_table into SELECT my_table.column FROM my_db.my_schema.my_table. It turns out to be significantly more complicated than that, because of CTEs, subqueries, and other advanced SQL constructs. For example, consider this query:

CREATE VIEW `my-proj-2`.dataset.my_view AS
WITH cte1 AS (
SELECT *
FROM dataset.table1
WHERE col1 = 'value1'
), cte2 AS (
SELECT col3, col4 as join_key
FROM dataset.table2
WHERE col3 = 'value2'
)
SELECT col5, cte1.*, col3
FROM dataset.table3
JOIN cte1 ON table3.col5 = cte1.col2
JOIN cte2 USING (join_key)

In this case, we need to expand the * into the full list of columns from dataset.table1, then figure out if col3 is from dataset.table2 or dataset.table3. Next, we need to compute the “schema” for cte1 and cte2 so that we can expand cte1.* and resolve col5 and col3 to their respective tables. We also need to know that JOIN ... USING is syntactic sugar for a JOIN ... ON. Throw some nested CTEs or a correlated subquery into the mix, and the bookkeeping gets complex quickly. The SQLGlot library has a number of utilities that make this easier.

Column lineage generation

Once we’ve got a column-qualified AST, we can generate column-level lineage. We decided to focus mainly on the lineage of the underlying data, and not on other clauses that might impact the data produced.

SELECT
item_id,
CASE
WHEN price > 1000 THEN 'high'
WHEN price > 100 THEN 'medium'
ELSE 'low'
END as price_category
FROM items
WHERE in_stock = true

In this query, it’s pretty clear that item_id in the output has lineage from items.item_id. However, it does not have lineage to in_stock because that column doesn’t actually show up in the output. We made an exception for things like case statements – the actual price data doesn’t show up in the output either but is used critically in the computation and hence should be considered part of the lineage. This does mean that queries like SELECT COUNT(*) FROM table don’t have any lineage, which we think is the right behavior.

Additionally, we generate lineage not just for SELECT statements, but also for INSERTUPDATEMERGE, and CREATE TABLE ... AS SELECT (CTAS) statements.

We also had a number of special cases to handle things like table and column name casing, quoted vs unquoted identifiers, BigQuery sharded tables, column aliases, union statements, and structs. It isn’t perfect, and there are likely more edge cases we haven’t discovered yet, but it’s a pretty good start.

Performance

How does this stack up against other open-source Python-based SQL lineage tools? For our tests, we compared against the SQL parsers used by other open-source metadata platforms: sqllineage (used by OpenMetadata) and openlineage-sql⁵ (used by OpenLineage).

Our comparison focused on lineage coverage — the percent of queries for which some table/column lineage was generated. Correctness was verified using a separate corpus of unit tests. Speed and memory usage were reasonable for all of the parsers tested and hence were not explicitly measured.

We tested on a corpus of ~7000 BigQuery SELECT statements and ~2000 CREATE TABLE ... AS SELECT (CTAS) statements.⁶

With lineage, the accuracy needs to be quite high in order for it to be useful. You can’t confidently answer questions like “What will I break by tweaking this column?” even if you’re missing a small fraction of the lineage.

We’re pretty happy with the results, although there’s always room for improvement. In particular, we don’t handle things like json_extract, struct fields, or UNNEST-based joins.

Using the DataHub SQL parser

If you’re already a user of DataHub 0.12.0+, we already use the new SQL lineage parser to generate column-level lineage for BigQuery, Snowflake, Redshift, dbt, Looker, Tableau, PowerBI, Airflow, and a few others.

If you’re using a different database system for which we don’t support column-level lineage out of the box, but you do have a database query log available, we have a SQL queries connector that generates column-level lineage and detailed table usage statistics from the query log.

If none of these suit your needs, you can use the new DataHubGraph.parse_sql_lineage() method in our SDK.

In keeping with our open-source ethos, the DataHub SQL lineage parser is also fully open-source. And it goes both ways — we expect to come across more SQL dialects and edge cases, and hope our community can continue helping us identify and resolve those issues.

Conclusion

The new SQL lineage parser improves our ability to extract column-level lineage across the data stack, which helps us understand and operate on the complexities of the data stacks that we work with.

If you’re interested in learning more about how SQL parsing and lineage fits into the broader world of data catalogs, join the DataHub Slack community.

[1] In our parlance, pure SQL tokenizers or parsers like python-sqlparse or sqlparser-rs are not lineage tools. However, if it has facilities for extracting lineage from queries, then we do consider it a lineage tool.

[2] Naturally, you build more software with more developers.

[3] Not all open-source lineage tools are schema-naive. On the Java side, Apache Calcite accepts metadata information to augment its query plan analysis, but it’s not trivial to use and is missing critical SQL dialects like Snowflake. ZetaSQL has similar capabilities, although actual lineage implementations using it have many limitations. Uber’s queryparser project also accepts catalog metadata information, but it’s written in Haskell and only supports a few SQL dialects. We were unable to find one that both works across many dialects and was easy to use for data practitioners.

[4] I suspect this is the reason why most existing tools try to avoid schema-awareness. One exception is sqllineage, which tries to return results that reflect the ambiguity that arises during the schema-naive parse. Unfortunately, the result feels cumbersome to use and doesn’t work super well, but it’s an interesting approach nonetheless. Most other tools take the easy route of only supporting table-level lineage.

[5] The openlineage-sql package is technically written in Rust, but it exposes Python bindings.

[6] Note that this isn’t a fully fair comparison, since the DataHub one had access to the underlying schemas whereas the other parsers don’t accept that information.

Recommended Next Reads