From the Open Source Data Summit:
Key Takeaways on Data Catalog Evolution

By:

Shirshanka Das

October 2, 2024

Earlier today, I joined an exciting panel discussion at the Open Source Data Summit alongside some incredible industry experts — Russell Spitzer (Apache Iceberg committer and Principal Engineer at Snowflake, representing the Polaris project), Lisa N. Cao (Product Manager at Datastrato, representing Apache Gravitino), Denny Lee (Sr. Staff Developer Advocate at Databricks, representing Unity Catalog), and moderator Kyle Weller, Head of Product at Onehouse (Apache Hudi). We dove into the exciting topic of “The Rise of Open Source Data Catalogs,” discussing Unity Catalog, DataHub, Apache Gravitino, and Apache Polaris.

In this post, I’ll be recapping the key points from the panel discussion, along with additional insights and takeaways.

The Evolution of Data Governance: How Open source Data Catalogs Are Leading the Charge

Over the past five years, the data landscape has increasingly favored independent, neutral storage that’s decoupled from compute layers like databases and warehouses. This shift toward a data lakehouse architecture brings greater flexibility but also presents challenges in terms of data governance.

Open-source data catalogs have emerged as key solutions, with platforms like DataHub, Unity Catalog, Apache Gravitino, and Apache Polaris gaining traction. In today’s discussion, we explored how these catalogs are driving data governance and why open-source approaches are critical for modern data management.

Unstructured Data Governance

The panel first discussed questions around unstructured data governance and Generative AI. With the rapid growth of GenAI, there is increased demand for the handling of unstructured data. However, as this demand rises, so does the need for stronger governance to ensure that this unstructured data is managed effectively and responsibly.

Question One: “Given the challenges associated with unstructured data, from storage to compliance, what initiatives or strategies are being implemented to address these governance challenges around unstructured data?”

Denny emphasized that the rise of GenAI has placed unstructured data at the forefront, making its governance a critical issue. With unstructured data becoming the foundation for many AI models, organizations must prioritize frameworks to ensure data accuracy, privacy, and compliance. This shift will play a significant role in mitigating risks and enhancing outcomes as GenAI adoption continues to grow.

Russell from Apache Iceberg added that they are in the early stages of incorporating unstructured data, such as binary large objects (blobs), into their metadata structures. Traditionally, Iceberg has focused on structured data, with clearly defined schemas and tables. However, the increasing demand for unstructured data, fueled by AI and machine learning, is pushing Iceberg to expand its capabilities. This process is still in the exploratory phase, with ongoing efforts to extend its metadata architecture to handle unstructured data.

From my perspective, it is clear with the rapid rise of GenAI, converting unstructured data into structured data is now a huge opportunity for the industry, and creates a critical challenge for metadata platforms. AI systems are being pointed towards vast amounts of unstructured data, requiring platforms like DataHub to adapt and extend their responsibilities to include unstructured data. This expansion is necessary for mapping and comprehending the entire data lifecycle, from raw data generation to transformation and final usage. By broadening the scope, metadata systems like DataHub can provide a holistic view of the data journey in the AI era, ensuring better governance and deeper insights.

Data Lineage and Documentation

There is nothing worse when debugging a pipeline than trying to trace back the origin of certain columns or attributes of a table. This challenge often arises when data is transformed through multiple stages, leading to a convoluted lineage that can obscure the source and transformation logic. Without clear visibility into the data’s journey, identifying issues becomes a painstaking process.

Imagine a scenario where a discrepancy in data values prompts an investigation. If the lineage is unclear, you may find yourself sifting through layers of transformations, looking at various scripts, configurations, or even manual entries that contributed to the final output. This not only consumes valuable time but can also lead to frustration and misinterpretation of the data’s integrity.

Furthermore, the lack of a well-defined lineage can result in significant risks, such as misinformed decision-making or compliance violations, particularly in regulated industries. When teams cannot easily trace data back to its source, it hampers their ability to ensure data quality and governance.

Question Two: “Data lineage, documentation, data contracts…what is your community doing about this category of metadata management?”

Effective metadata management tools and data lineage solutions are crucial for overcoming these challenges. They provide insights into how data is transformed, where it comes from, and how it flows through the system. By investing in these resources, organizations can enhance their debugging processes, increase transparency, and ultimately foster greater trust in their data.

From my perspective, data lineage is absolutely vital — we realized early on in DataHub’s evolution that manually tracing it is impractical, especially in complex environments. That’s why we’ve prioritized developing connectors that automatically capture lineage across a wide array of systems, including major industry platforms. Our advanced catalog-assisted SQL parser allows us to offer detailed lineage insights, which are essential for understanding data flow and maintaining integrity.

Then there’s the challenge of documentation. Without it, users are reluctant to engage with a data catalog. But the catch-22 is that no one enjoys documenting data. This was a hard lesson during my time at LinkedIn, particularly when implementing GDPR annotations (circa. 2016–2018). We realized that a dual approach was necessary. On one hand, we had to make documentation and annotation experiences easy, embedding it into the user interface for convenience. The emergence of GenAI offers us an opportunity to further reduce friction in creating the first version of the data documentation. On the other hand, we needed to integrate documentation into the development process from the start, making it a seamless part of workflows.

By “shifting left” — embedding these practices earlier in the development cycle — we make sure that documentation is already in place by the time the data enters the catalog. This not only builds trust in the system but also ensures accountability and transparency, key components of a data-driven culture.

Ultimately, our focus on lineage and documentation goes beyond mere compliance. It’s about enabling users to extract the full value from their data. A culture that values both lineage and documentation leads to empowered teams and better decision-making, which can collaborate on creating meaningful data contracts, creating an environment where data truly becomes a competitive asset.

“The Core” of a Catalog

Question Three: The Iceberg community has created clear, concise specifications, including for tables, views, and now the REST catalog. Can you explain how Polaris builds on this foundation, particularly with the integration of the Iceberg REST catalog specification in the Polaris catalog implementation, and why it’s important to develop the specific features we’ve discussed?”

Russell noted that they have indeed focused on building a good spec around this in the Iceberg community as well as the Polaris project.

From my own perspective and experience, I’m thrilled to see that the previously extremely convoluted (and hard to understand, let alone implement) Hive metastore API is being replaced with a more user-friendly and implementable API. This transition opens up a wealth of opportunities for customers, providing them with a wider range of options to choose from. The shift toward a more approachable API means that we can expect significant innovation in implementations, and I genuinely hope that the best solutions will rise to the top, benefiting the entire data community.

Interoperability and Standardization

Question Four: “How do you view the importance of “universal” integration in open-source data catalogs for promoting adoption and community growth, particularly regarding data storage compatibility with formats like Delta Lake, Apache Hudi, and Apache Iceberg, as well as integration with various query engines?”

As I reflect on our role at DataHub, I recognize that our responsibility is to extract and represent metadata from a multitude of data tools. The challenge increases with the number of tools and APIs we encounter, making our work more laborious. This led to us creating a scalable and easy-to-extend open-source ingestion connector framework, which has been the central reason for why we see so many connector contributions to the DataHub project, and has inspired many others to build similar systems. Over the past several years, we’ve seen that integrations with systems boasting stable and well-maintained APIs are reliable and efficient. Conversely, dealing with rapidly evolving systems often leads to complications. It’s not about the absence of standards but rather how well these APIs are managed. If we can phase out certain tools or APIs, I wholeheartedly support that move!

Catalog of Catalogs

Question Five: “What exactly is a “catalog of catalogs,” and how should developers approach this concept differently from choosing a single data catalog?”

With the overwhelming number of catalogs available today, developers face a daunting task in selecting the right one. Lisa added some great commentary around this topic in our discussion.

From my perspective, to define a catalog, we must first address the fundamental question: What does it contain? Consider, for instance, the catalogs from Macy’s, which showcase the store’s product offerings. Similarly, Unity Catalog serves as an inventory of assets crucial to the Databricks ecosystem, such as tables and volumes. Thus, a catalog can be understood as an inventory of items specific to a particular platform or context.

By that definition, DataHub serves a broader purpose than just cataloging things that exist somewhere. One can create completely logical concepts like Data Products or Domains that don’t exist anywhere else. You could extend the metadata model and create a new type of thing, too. It functions not merely as a catalog but as a “graph” that represents not just data assets but also the relationships between data-adjacent assets that matter to the entire organization. Some might refer to this category as a “Catalog of Catalogs,” but this label can be misleading. For instance, while Apache Gravitino might fit this definition, it doesn’t encompass lineage or the logical concepts tied to data products and domains, which are crucial for understanding how different elements relate to one another within the data landscape.

When developers consider which catalog to utilize, it’s crucial to first reflect on their needs: Why do I need a catalog, and what do I plan to accomplish with it? If your goal is simply to point a query at a system to process a table by name, then a table catalog like Iceberg or Unity is appropriate, or if you want to federate this access, you might consider putting Apache Gravitino in front of those systems. However, if you aim to comprehend end-to-end lineage across systems to monitor data quality or assess the impact of changes before implementing them, then a cross-system metadata graph like DataHub is essential.

To dig further into this topic, check out my Decoding Data Live (DDL) chat with Vinoth Chandar, Founder/CEO at Onehouse — “Taming the Chaos: A Deep Dive into Table Formats and Catalogs.”

Looking to further the discussion? Join our DataHub Slack Community to ask your most pressing questions.

Why Open Matters

Question Six: “Can you elaborate on the significance of open-source data governance, specifically focusing on the unique benefits that come from developing and utilizing an open-source data catalog?”

Russell and Denny both emphasized why avoiding vendor lock-in is crucial for this category of problems. Your metadata layer represents your core business knowledge, policies and rules. As a leader who wants to always plan for the next change, the next innovation, you want to make sure that the system you are choosing to be the “keeper of your crown jewels” is not going to create a migration nightmare if you switch your data or AI platform provider.

I think that’s an excellent point, and there are three other top benefits in my view for why “open” is important when choosing a product in the metadata or data catalog space:

Evolving problem space: As I observe the evolving problem space of data cataloging and metadata management, it becomes clear that this landscape is in a constant state of innovation that creates flux. Embracing an open approach enables us to adapt quickly to these changes, ensuring that we remain aligned with the dynamic nature of data. By involving the community, we can tap into a wealth of knowledge and experience, which ultimately fosters innovation and responsiveness in our solutions. This adaptability is crucial for staying relevant in a rapidly shifting data environment.
Community contributions: An open-source project thrives on the diverse contributions from various companies and users, creating a collaborative ecosystem that accelerates innovation. This collective input not only enhances the quality and functionality of the product but also allows for rapid improvements that keep pace with user needs and industry trends. By harnessing a wide range of perspectives and expertise, open-source projects can evolve more quickly than traditional, closed-source alternatives, ultimately delivering greater value to the community. Through these contributions, the community establishes idioms to create a form of organic standardization of data management concepts and practices
Transparency: Open-source code empowers users by providing full transparency into how the system operates, which is essential for both security and compliance. Understanding the inner workings of the codebase enables organizations to identify vulnerabilities, assess risks, and ensure that they meet regulatory requirements. This level of insight fosters trust among users and stakeholders, as it allows them to verify the security measures in place and engage with the software confidently, knowing they have the ability to audit and modify the code as needed.

There are plenty of other reasons as well, but keeping these top of mind as you choose, it will benefit you greatly. And ultimately, the bottom line is that in order to work properly, it just has to be open.

Of course, just having the name “Open” in the project name, “OpenAPI” in the project SDK or “Open Source” in the project’s marketing materials doesn’t truly mean it is open. Pay attention to whether industry leaders are actually adopting the project, are they contributing code back, is the community vibrant and active, are there multiple system integrators that are deploying the project far and wide? These are great signals that a project is truly Open and being continuously evolved and shaped for the benefit of the industry.

Shaping the Future: The Importance of Open source Data Catalogs

As we reflect on recent years–as well as conversations we have now like today’s panel–it’s clear that the data landscape has shifted toward independent, neutral storage solutions, exemplified by the rise of the data lakehouse architecture. While this offers significant flexibility, it also presents challenges in data governance. Consequently, open-source data catalogs like Unity Catalog, DataHub, Apache Gravitino, and Apache Polaris are becoming vital tools for organizations. These catalogs not only enhance governance but also foster community collaboration, driving innovation. I encourage everyone to engage in the ongoing discussions about the role of open-source data governance in shaping our industry.

From the Open Source Data Summit:
Key Takeaways on Data Catalog Evolution

From the Open Source Data Summit:
Key Takeaways on Data Catalog Evolution

The Evolution of Data Governance: How Open source Data Catalogs Are Leading the Charge

Unstructured Data Governance

Data Lineage and Documentation

“The Core” of a Catalog

Interoperability and Standardization

Catalog of Catalogs

Why Open Matters

Shaping the Future: The Importance of Open source Data Catalogs

Recommended Next Reads

Unlocking the Future of AI and Data:
DataHub’s $35M Series B Journey

Introducing DataHub Cloud v0.3.11

PRODUCT

Community

Resources

Company

From the Open Source Data Summit: Key Takeaways on Data Catalog Evolution

From the Open Source Data Summit: Key Takeaways on Data Catalog Evolution

The Evolution of Data Governance: How Open source Data Catalogs Are Leading the Charge

Unstructured Data Governance

Data Lineage and Documentation

“The Core” of a Catalog

Interoperability and Standardization

Catalog of Catalogs

Why Open Matters

Shaping the Future: The Importance of Open source Data Catalogs

Recommended Next Reads

Unlocking the Future of AI and Data: DataHub’s $35M Series B Journey

Introducing DataHub Cloud v0.3.11

PRODUCT

Community

Resources

Company

From the Open Source Data Summit:
Key Takeaways on Data Catalog Evolution

From the Open Source Data Summit:
Key Takeaways on Data Catalog Evolution

Unlocking the Future of AI and Data:
DataHub’s $35M Series B Journey