SQL Lineage: How DataHub Extracts Column-Level Lineage from Queries

By:

Lakshay Nasa

May 15, 2026

Quick definition: What is SQL lineage?

SQL lineage is the practice of tracing how data moves through a system by analyzing SQL queries. Table-level SQL lineage answers “Table A feeds Table B.” Column-level SQL lineage goes further, tracing individual fields through joins, transformations, CTEs, and subqueries to their exact source columns.

Data people really care about data lineage, particularly from SQL. We looked at a bunch of open-source automated SQL lineage tools and found that many shared the same underlying problem: they were unaware of the underlying table schemas, and hence couldn’t generate accurate column-level lineage.

DataHub parses SQL-based transformations to record upstream and downstream relationships directly in your lineage metadata. Because DataHub is a metadata platform and data catalog, it already has APIs for retrieving the schema for any tables in your data stack. So, we built a SQL lineage parser that’s schema-aware and can take advantage of those APIs to generate accurate column-level lineage from SQL queries across a wide array of dialects.

In our tests, it works significantly better than other open-source, Python-based lineage tools.¹

DataHub column-level lineage graph showing how the email field flows from the customer source table through customer_details and into an aliased view, alongside other fields sourced from address, city, customer_snapshot, and payments_by_customer tables. — Some column-level lineage with dbt and Postgres.

What is data lineage?

Data lineage captures data dependencies between database tables, orchestrators, streaming tools, BI tools, and more. It enables you to trace the flow of data through your data ecosystem, and helps you answer questions like:

“Is this report built on good data?”
“What will break if I change this table?”
“Am I managing PII correctly?”

A big piece of the puzzle is SQL queries, which implicitly capture and document the lineage of the data they query and manipulate. Whenever you write a query like this:

CREATE VIEW my_view AS
SELECT
  a.id,
  a.name,
  b.email
FROM
    table_a a
JOIN table_b b
  ON a.id = b.id

You’ve implicitly created lineage from table_a and table_b into my_view:

table_a.id → my_view.id
table_a.name → my_view.name
table_b.email → my_view.email

That may not look too bad, but when you’ve got thousands of queries and each is hundreds of lines long, manually figuring out all of that lineage is a nightmare.

As with all things in software, the solution is… more software.²

The dream of automated lineage

A lot of folks have gone through the same thought process, come to the same conclusion that data lineage is important, and tried to build a tool that automatically extracts lineage from SQL queries.

It’s a hard problem. SQL is a remarkably complex language, and that complexity is compounded by the fact that every database has its own dialect. Even the problem of parsing SQL into an Abstract Syntax Tree (AST) is non-trivial, and that’s before you even start thinking about lineage.

Despite the challenges, we’ve made some good progress: existing tools are quite good at extracting table-level lineage from SQL. Unfortunately, most open-source lineage tools fall flat when it comes to extracting column-level lineage.

There’s a simple reason for this: they’re naive to the underlying database schema.³

Schema-aware parsing

Let’s revisit that query from earlier, but with a small twist:

CREATE VIEW my_view AS
SELECT
  table_a.id,
  name,
  email
FROM
    table_a
JOIN table_b
  ON table_a.id = table_b.id

In this query, we don’t explicitly qualify that name was in table_a and email was in table_b. Given that the query is functionally equivalent, the lineage should also be the same. However, without knowledge of the schemas of table_a and table_b, it’s impossible to generate the correct lineage. A schema-naive parser can’t possibly know where the columns actually came from.

The solution is to make the parser schema-aware. That is, it needs to know the underlying schemas of the tables involved in the query.

The immediate question then is “where do we get the schemas from?”.⁴ If you already have a data catalog like DataHub, you’re in luck: it already has the schemas of every table in your data ecosystem, and provides a fast and scalable API for retrieving them.

Finally, we need to put it all together. Here’s a high-level overview of the process:

Flowchart showing how DataHub processes raw SQL into a final lineage report via AST parsing, table and column qualification, and schema fetching.

Let’s walk through the main steps.

SQL parsing

For SQL parsing we use SQLGlot, with targeted runtime patches applied via the patchy library. (We previously maintained a fork called acryl-sqlglot, but retired it in late 2024 once mainline SQLGlot caught up to our needs.) We surveyed a lot of SQL parsers and found that SQLGlot was best suited for our needs. It’s pure Python, supports 30+ SQL dialects as of v30.0.3, and has nice APIs for traversing the AST.

Table qualification

This step is all about turning SELECT column FROM my_table into SELECT column FROM my_db.my_schema.my_table. This requires some information about the context in which the query was executed, which is why we take the default DB and schema as arguments. It partially depends on the SQL dialect because some have a two-tier hierarchy (db.table), whereas others have a three-tier (db.schema.table) format.

Fetch table schemas

Once we’ve got our candidate table names, we fetch the underlying schemas for those tables from the DataHub backend. Depending on the expected usage of the parser, we either prefetch all relevant schemas or fetch them on demand. There’s also some caching and memory offloading to ensure reasonable performance without blowing up memory usage.

Column qualification

This step is similar to table qualification, but for columns. It turns SELECT column FROM my_table into SELECT my_table.column FROM my_db.my_schema.my_table. It turns out to be significantly more complicated than that, because of CTEs, subqueries, and other advanced SQL constructs. For example, consider this query:

CREATE VIEW `my-proj-2`.dataset.my_view AS
WITH cte1 AS (
    SELECT *
    FROM dataset.table1
    WHERE col1 = 'value1'
), cte2 AS (
    SELECT col3, col4 as join_key
    FROM dataset.table2
    WHERE col3 = 'value2'
)
SELECT col5, cte1.*, col3
FROM dataset.table3
JOIN cte1 ON table3.col5 = cte1.col2
JOIN cte2 USING (join_key)

In this case, we need to expand the * into the full list of columns from dataset.table1, then figure out if col3 is from dataset.table2 or dataset.table3. Next, we need to compute the “schema” for cte1 and cte2 so that we can expand cte1.* and resolve col5 and col3 to their respective tables. We also need to know that JOIN … USING is syntactic sugar for a JOIN … ON. Throw some nested CTEs or a correlated subquery into the mix, and the bookkeeping gets complex quickly. The SQLGlot library has a number of utilities that make this easier.

Column lineage generation

Once we’ve got a column-qualified AST, we can generate column-level lineage. We decided to focus mainly on the lineage of the underlying data, and not on other clauses that might impact the data produced.

SELECT
  item_id,
  CASE
      WHEN price > 1000 THEN 'high'
      WHEN price > 100 THEN 'medium'
      ELSE 'low'
  END as price_category
FROM items
WHERE in_stock = true

In this query, it’s pretty clear that item_id in the output has lineage from items.item_id. However, it does not have lineage to in_stock because that column doesn’t actually show up in the output. We made an exception for things like case statements: the actual price data doesn’t show up in the output either but is used critically in the computation and hence should be considered part of the lineage. This does mean that queries like SELECT COUNT(*) FROM table don’t have any lineage, which we think is the right behavior.

Additionally, we generate lineage not just for SELECT statements, but also for INSERT, UPDATE, MERGE, and CREATE TABLE … AS SELECT (CTAS) statements.

We also had a number of special cases to handle things like table and column name casing, quoted vs unquoted identifiers, BigQuery sharded tables, column aliases, union statements, and structs. It isn’t perfect and there are likely more edge cases we haven’t discovered yet, but it’s a pretty good start.

Performance

How does this stack up against other open-source Python-based SQL lineage tools? For our tests, we compared against the SQL parsers used by other open-source metadata platforms: sqllineage (used by OpenMetadata) and openlineage-sql⁵ (used by OpenLineage).

Our comparison focused on lineage coverage: the percent of queries for which some table/column lineage was generated. Correctness was verified using a separate corpus of unit tests. Speed and memory usage were reasonable for all of the parsers tested and hence were not explicitly measured.

We tested on a corpus of ~7000 BigQuery SELECT statements and ~2000 CREATE TABLE … AS SELECT (CTAS) statements.⁶

Comparison table showing SQL lineage accuracy scores for DataHub, sqllineage, and openlineage-sql across SELECT and CTAS queries at table and column levels.

With lineage, the accuracy needs to be quite high in order for it to be useful. You can’t confidently answer questions like “What will I break by tweaking this column?” even if you’re missing a small fraction of the lineage.

We’re pretty happy with the results, although there’s always room for improvement. In particular, we don’t handle things like json_extract, struct fields, or UNNEST-based joins.

Using the DataHub SQL parser

DataHub uses this parser to generate column-level lineage across BigQuery, Snowflake, Redshift, Databricks, dbt, Looker, Tableau, PowerBI, Trino/Presto, and others. (Airflow lineage is plugin-based rather than parsed from query logs, so it’s not part of this pipeline.)

If you’re using a different database system for which we don’t support column-level lineage out of the box, but you do have a database query log available, we have a SQL queries connector that parses your SQL statements, emits the corresponding lineage edges, and surfaces detailed table usage statistics. Dialect quirks can affect parsing accuracy on edge cases, so it’s worth spot-checking results when bringing up a new source.

To enable this in your ingestion config:

yaml

source:
  type: [platform_name]
  config:
    include_table_lineage: true
    include_column_lineage: true
    start_time: "-30d"

DataHub defaults to the last full day of query history. Use start_time to set a custom window (e.g., -30d).

If none of these suit your needs, you can use the DataHubGraph.parse_sql_lineage() method in our SDK. And for queries the parser can’t fully handle (highly dynamic SQL, exotic UDFs, custom macro patterns), the DataHub Python SDK also lets you emit lineage edges directly, so you can backfill what the parser misses.

In keeping with our open-source ethos, the DataHub SQL lineage parser is also fully open-source. And it goes both ways: we expect to come across more SQL dialects and edge cases, and hope our community can continue helping us identify and resolve those issues.

Supported sources for SQL lineage

Here’s how DataHub gets the SQL from each source, and where each one currently stands on column-level support.

Source	Lineage method	Column-level?
BigQuery	Query logs (INFORMATION_SCHEMA.JOBS)	✅ Yes
Snowflake	Query history (QUERY_HISTORY)	✅ Yes
Redshift	Query logs (STL_QUERY)	✅ Yes
Databricks	Query history + Unity Catalog	✅ Yes
dbt	Compiled manifest.json	✅ Yes
Looker	LookML parsing	✅ Yes (derived tables)
Tableau	Calculated field expressions	✅ Yes (partial)
PowerBI	DAX expressions	⚠️ Partial
Trino/Presto	Query logs	✅ Yes

For the current connector list, see DataHub’s integrations page.

The new SQL lineage parser improves our ability to extract column-level lineage across the data stack, which helps us understand and operate on the complexities of the data stacks that we work with.

If you’re interested in learning more about how SQL parsing and lineage fits into the broader world of data catalogs, join the DataHub Slack community.

Future-proof your data catalog

DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

Explore DataHub Cloud

Take the interactive product tour to see DataHub Cloud in action.

Join the DataHub open source community

Join our 15,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.

Footnotes

In our parlance, pure SQL tokenizers or parsers like python-sqlparse or sqlparser-rs are not lineage tools. However, if it has facilities for extracting lineage from queries, then we do consider it a lineage tool.
Naturally, you build more software with more developers.
Not all open-source lineage tools are schema-naive. On the Java side, Apache Calcite accepts metadata information to augment its query plan analysis, but it’s not trivial to use and is missing critical SQL dialects like Snowflake. ZetaSQL has similar capabilities, although actual lineage implementations using it have many limitations. Uber’s queryparser project also accepts catalog metadata information, but it’s written in Haskell and only supports a few SQL dialects. We were unable to find one that both works across many dialects and was easy to use for data practitioners.
I suspect this is the reason why most existing tools try to avoid schema-awareness. One exception is sqllineage, which tries to return results that reflect the ambiguity that arises during the schema-naive parse. Unfortunately, the result feels cumbersome to use and doesn’t work super well, but it’s an interesting approach nonetheless. Most other tools take the easy route of only supporting table-level lineage.
The openlineage-sql package is technically written in Rust, but it exposes Python bindings.
Note that this isn’t a fully fair comparison, since the DataHub one had access to the underlying schemas whereas the other parsers don’t accept that information.

FAQs

SQL lineage is the practice of tracing how data moves through a system by analyzing SQL queries. It maps the path data takes from source tables, through joins, filters, aggregations, and transformations, into derived tables, views, dashboards, and reports. Table-level SQL lineage shows which tables feed which other tables. Column-level SQL lineage goes further, tracing individual fields through every transformation back to their source columns.

Table-level lineage answers “Table A feeds Table B.” Column-level lineage answers “the revenue column in reporting.monthly_summary is derived from the amount column in transactions.orders, filtered and summed in this transformation.” Most open-source SQL parsers handle table-level lineage well. Column-level requires schema-aware parsing, which is significantly harder. For more on the tradeoffs, see our deeper post on column-level lineage.

DataHub uses SQLGlot to parse each SQL query into an AST, then resolves column references against table schemas pulled from the DataHub metadata graph. Once columns are fully qualified, DataHub generates lineage edges between source and derived columns and persists them in its metadata graph. The parser is schema-aware, meaning ambiguous column references get resolved by querying the catalog instead of being dropped.

DataHub supports 30+ SQL dialects through SQLGlot v30.0.3, including BigQuery, Snowflake, Redshift, Databricks (Spark SQL), Postgres, MySQL, Trino/Presto, Oracle, and SQL Server. Dialect coverage tracks SQLGlot’s release cadence.

SQLGlot is a pure-Python SQL parser and transpiler. DataHub uses it for four reasons: it supports the most SQL dialects of any open-source parser, its AST structure makes column lineage tracing tractable, it can transpile queries between dialects for normalization, and it’s actively maintained with frequent releases. DataHub uses mainline SQLGlot, with targeted runtime patches applied via the patchy library.

No. DataHub previously maintained a fork called acryl-sqlglot, but migrated to mainline SQLGlot in late 2024. The fork has been retired. Patches that don’t yet exist upstream are applied at runtime via the patchy library, which keeps DataHub on the latest SQLGlot release without needing a separate fork.

Table-level lineage can. Column-level lineage usually can’t, at least not reliably. Without schema information, a parser can’t disambiguate column references like SELECT name FROM table_a JOIN table_b (which table is name from?), expand SELECT * into actual columns, or trace columns correctly through CTEs and subqueries. This is why DataHub’s parser is schema-aware: it queries the DataHub metadata graph for schema information at parse time.

Accuracy depends on the parser and the queries. In our benchmarks against open-source alternatives, DataHub’s schema-aware parser had substantially higher coverage on column-level lineage than schema-naive parsers, because most of the cases that trip up other tools (ambiguous joins, wildcard expansion, complex CTEs) are exactly the cases that schema information helps resolve. For queries DataHub can’t fully resolve (dynamic SQL, complex UDFs, some macro patterns), the parser attaches a confidence score so downstream consumers can filter accordingly.