Understanding DataHub’s Role as the Control Plane for Data

Earlier this summer, Shirshanka Das, Co-Founder and CTO of Acryl Data, joined The Ravit Show live from Snowflake 2024 Summit to discuss how DataHub acts as a control plane for data, enabling teams to focus on adopting best-in-class data tools without losing control of visibility into and implementation of their data.

Watch the full interview below and read on for a summary of Shirshanka’s breakdown of DataHub’s role as the control plane for data.

Ravit: What does the control plane of data mean? Please tell us a little more about it.

Shirshanka: Last year, I shared my thesis during my Data Council keynote that the data and AI world remains, and will remain, refragmented.

Of course, I say that standing here on the floor of Snowflake Summit, where Snowflake just unveiled the platform to rule them all. Next week, Databricks will unveil something else. And then there’s AWS, Google, Microsoft, and the streaming ecosystem with Confluent. AI workflows and innovations will continue to create a best-in-class experience.

As these new offerings come out, Data Teams will continue to say, “I want the best-in-class stack” for specific use cases, and then assemble it. That quickly turns into an accumulation of fragmented best-in-class tools — Snowflake, maybe Tableau or Looker, PyTorch, OpenAI, etc.

Then they’ll realize they lost control.

That’s when they’ll want to regain control — and that’s when they find us.

DataHub is the control plane that gives you two things: a visibility layer and a specification layer.

The visibility layer lets you see where all the data is, how it’s being transformed, what’s happening, and which models are being trained on what data. But it doesn’t let you control the chaos.

The specification layer lets you tell the control plane what you want to happen. For example, “I want my externally trained LLM model to never have access to PII,” or “I want this logical customer data product to be created, and to include this Tableau asset, this Snowflake table, and this AI model.”

Depending on who’s using it, there might be a data contract or a data product specification associated with it. We aim to be the storage and specification repository for all these logical concepts and translate them into the physical stuff.

If Snowflake offers tag-based masking and propagation, we’ll push all of that down. If Databricks offers it, we’ll push that down too. We give you peace of mind that you can adopt the best and fastest innovating technologies while retaining control and visibility into your entire data stack. That’s our promise.

Ravit: Love that. As companies think about AI use cases, how does the control plane help them?

Shirshanka: The scenarios are common across data and AI. They all start with search and discovery, governance, and quality. Those are evergreen areas that all these systems need. But AI brings more demands. AI use cases like fine-tuning and prompt engineering create new requirements not seen in the classic warehouse world.

I think that will drive new demands on infrastructure, and that’s what we’re doing in our control plane implementation so companies can have governance, reproducibility, and explainability.

Towards the end of my time at LinkedIn, I transitioned from making data discoverable to making it governed, and I started working on MLOps. LinkedIn wanted our AI developers to be productive and compliant with regulations.

What we realized was that the same infrastructure should be reused for that purpose. But it’s not the same — you have deeper demands, so you have to retool the platform to support those use cases.

We think the control plane will evolve to address AI use cases. But it has to be done in a unified way. If you fragment solutions for discovery, governance, observability, and AI into separate tools, you end up with the same problems all over again: Who owns this thing? Where was it produced? Is it safe to use? Was it trained on the right data?

With DataHub, that’s the problem people don’t have to solve.

Ravit: Absolutely. Someone needs to talk about these types of problems. I speak to a lot of enterprise leaders, and they say that the number of metadata solutions seems confusing and overwhelming. What does the future hold for AI use cases, and how do you solve that problem for the customers?

Shirshanka: Marketing departments are very aggressive and love to capitalize on buzzwords. Everyone touts data and AI and claims to be AI-ready in some form. I’m a builder at heart, so I like to look under the covers and say, “Sure, you’re marketing your system as AI-ready, but is it architected for it?”

I’ve seen how technology can’t keep up with marketing. AI has demands that are different from previous data workloads. Fine-tuning and prompt engineering require things like time travel, versioning, and a scale that hasn’t been seen before.

You really cannot bet on outdated solutions. You have to carefully look at the solution and ensure it’s not built on ancient technology. The system must be rapidly evolving and able to keep up with the pace of change in our industry.

Ravit: On that note, why do you think open source is important in this game? And tell us a little about the DataHub project as well.

Shirshanka: Absolutely.

Anytime there’s a messy problem to be solved, especially one dealing with fragmentation and complexity, the only way to really build it is in the open. If you’re trying to build an open standard for something, you have to build it in the open for two reasons:

First, if you’re a small team, there’s no way you can fully do justice to the breadth of complexity that exists.

Second, the community often leads in defining what the product needs to do.

The community tells us what’s needed and sometimes even pushes the project forward themselves. For example, PayPal, a big customer and contributor, added access management capabilities. Visa is pushing the platform forward by adding concepts like business attributes. Netflix, another big open source adopter, is adding runtime types and extensions. Without these contributions, we wouldn’t be able to build a standard for the Internet.

You don’t need to be open-source if you’re building a simple CSV to Salesforce exporter. But if you’re building something central to the data stack, it has to be open source. This way, the community can move it forward. We have a lot of customers contributing to the projects, and many contributors become customers. We now have 500 contributors to the data project, driving it forward at a much faster rate than if it were bankrolled by a single company. That’s the strength of open source, and it will keep the project vibrant for years to come. We have over a thousand production installs and are continuously adding new capabilities. It’s an exciting time to build.

Watch the full episode here!

Recommended Next Reads