Acryl Data introduces lineage support and automated propagation of governance information for Snowflake in DataHub
Introduction
DataHub () is the leading open-source Metadata Platform for the Modern Data Stack. Acryl Data is driving the open-source project in collaboration with LinkedIn and the broader open source community. The vibrant DataHub open-source community surfaces key use-cases across data discovery, data observability and data governance. As you would expect, Snowflake Data Cloud is very popular in our community and is an integral part of the modern data stack. There are two key themes we have heard repeatedly from the open-source community and our customers:
- The need to understand end-to-end lineage for derived datasets in Snowflake
- The need to effectively govern data as it flows through multiple systems and reaches the Snowflake Platform.
Problem Statement
In a typical enterprise, the data stack has a lot of diversity in terms of the number of tools and platforms in the overall stack. The task of finding data efficiently and understanding the end-to-end lineage of a derived dataset ends up taking 50% of the time in analysis workflows. Being able to stitch together lineage information across multiple tools, often spread across different cloud providers, is a major challenge.
In addition, classifying datasets with the right classification terms from a standardized governance taxonomy allows policy-driven handling of the data (access control, pseudonymization etc.). Source datasets are often tagged using manual/automated classification systems but derived datasets get generated at a rapid rate and the task of correctly classifying data becomes a losing battle without the right automation. This problem is even more exacerbated as data travels across multiple platforms each with their own conventions of recording governance data.
As an example, an enterprise may ingest data from external sources which end up as AWS Glue tables. Automated quality checks and data classifiers may be run against these tables to apply glossary terms from a standardized governance taxonomy. Depending on the classification, the sensitivity levels of a dataset can vary from “safe to use” to “highly confidential”. After the datasets get loaded into the Snowflake Data Cloud for further analysis, multiple derived tables may get generated. It is important to ensure that the governance information propagates in an automated manner from the source data (AWS Glue tables) to derived datasets in Snowflake so that the right policies are applied.
Solution
Lineage support for derived tables
DataHub’s approach to metadata management is to integrate into the operational fabric and collect the most reliable metadata at the source. We prefer a push-based approach wherever possible to maintain the freshness of metadata. Given the large number of out-of-the-box connectors, enterprises are able to quickly visualize end-to-end lineage across various platforms. An example is shown below.

visualized end-to-end lineage
Till now, it was not possible to capture lineage edges within the Snowflake platform. Derived tables that were created using create-table-as or other forms of copying/transformations were being left out of the lineage picture. With the introduction of the access history functionality , we are now able to complete the picture by capturing this information.

access history functionality
In addition to table-level lineage, Snowflake access history functionality also provides information about sets of upstream and downstream columns participating in the lineage. This information is available in the dataset’s custom properties on DataHub as shown below.

Snowflake access history functionality
An ingestion recipe with `include_table_lineage: True` in the snowflake source configuration would now populate the Snowflake table-level lineage in DataHub.
View full demo of the feature below:
Automated propagation of governance information
DataHub supports recording governance information through standardized business glossaries. Given the developer-friendly nature of DataHub, glossaries are version controlled checked-in artifacts. Here is an example of a simple business glossary file.
DataHub allows defining relationships between terms belonging to different glossaries. For example, here are two glossaries for Personal information and Classification

Personal Information Glossary

Classification Glossary
The Email term from Personal Information has an inherits relationship with the Confidential term from Classification.

Glossary Terms Personalization Email Configuration
Modeling relationships between terms provides powerful flexibility. If the enterprise decides that Email is actually a Highly Confidential term, it is very easy to change the inheritance of Email term and doesn’t require re-classifying data again at the source. Going back to the example of an enterprise ingesting data from external sources into AWS Glue tables. After running data classification manually or through automated means, terms from the standardized glossary in DataHub are applied to fields in these tables. At the table level, a Highly Confidential term may be applied if one of the fields inherits from Highly Confidential. As the datasets are loaded into Snowflake for further analysis and derived tables are generated, these tables should automatically be associated with the right classification terms. DataHub Cloud (the Acryl Data managed version of DataHub) allows defining actions in response to metadata changes.
Actions are pre-packaged units of functionality that are extensible through both no-code (config) and low code (Python) ways. They execute within the DataHub Cloud platform and are able to respond to changes happening in metadata within seconds. Using actions, one can listen for and react to any important change happening, such as schema changes, ownership updates, lineage edge changes, tags or terms getting added or dropped, documentation being edited etc.
The Term Propagator Action (available only in DataHub Cloud ) detects changes in classification terms on all the Glue tables and then using lineage information present within DataHub, automatically propagates terms to all the derived tables within Snowflake. This action can also navigate relationships between terms to ensure that traits from related terms can also be propagated forward.
Conclusion
Capturing complete lineage information within the Snowflake platform allows for end-to-end understanding of how derived tables are generated. DataHub Cloud’s term propagation action leverages this information to propagate governance information automatically across multiple platforms. Try out the lineage feature for Snowflake in the open-source DataHub project. Sign up for DataHub Cloud from Acryl Data which is currently in private beta