Inside the Demandbase Migration to DataHub Iceberg REST Catalog: Q&A with a Data Systems Leader

Inside the Demandbase Migration to DataHub Iceberg REST Catalog: Q&A with a Data Systems Leader

Scaling data platforms often means juggling complexity, governance, and usability. 

In our latest Community Town Hall, Ryan Nowacoski, Senior Manager of Data Systems at Demandbase, shared how his team tackled these challenges head-on by migrating to DataHub Iceberg REST Catalog. 

Merging their business and technical catalogs into one unified operational catalog in DataHub has enabled Demandbase to simplify discovery, strengthen collaboration, and build a model ready for governance at scale.

In this article, we share highlights from our conversation with Ryan and spotlight how Demandbase is using DataHub to streamline data management and empower its teams.

Q&A with Demandbase Sr. Manager of Data Systems, Ryan Nowacoski

Note: The conversation has been lightly edited for clarity.

Q: Could you start by telling us a bit about yourself and your role at Demandbase?

A: I am the manager of our data systems team here at Demandbase. We oversee what we refer to as our unified data platform. The idea when we started building this platform was that it would be a centralized, reliable foundation for our internal data at Demandbase. 

When we set out to do this, we built this entirely around Apache Iceberg because we saw a lot of value in the new lakehouse technologies, the vast cost efficiency, the scalability, and all of that. So that is now expanded to all of our internal data. With that, of course, we needed a data catalog, and adopted the Iceberg REST catalog a short time ago. More recently, we moved to DataHub as our Iceberg REST catalog.

The challenge: Data discoverability in a distributed platform

Q: What were some of the key challenges or pain points that you were facing before considering DataHub for your REST catalog?

A: The biggest one was data discoverability

One of the biggest downsides of this Iceberg architecture is that it is very distributed amongst teams. There’s no central Snowflake, or central BigQuery, or central database that everyone can go to to log in and see everything there immediately. It just doesn’t exist. 

But that’s kind of a feature of this, right? We have this very strong separation of storage and compute. We can store our data in any number of AWS accounts, in GCP, etc. But because of that, it gives it this very distributed nature. So there’s no single place you can go and say, “What data exists? Who owns it? What should it be used for? What’s the structure of it?” It makes it much more difficult to do that.

This means that we had very low data discoverability. Even though we were moving our data into this unified data platform and that was making it much easier to utilize that data, streamline access, and streamline governance, the knowledge of what data was there and what it should be used for was very difficult to come by

That was by far our biggest challenge. We had to ask, “Now that this is accessible, how do we democratize that information? And how do we enable teams to know what’s there, know how to use it, and keep that up-to-date regularly?”

“The knowledge of what data was there and what it should be used for was very difficult to come by.”

Q: Could you share how you came to Iceberg in the first place?

A: We were somewhat early to the Iceberg trend because we started building this about four years ago, now, before the wave had really taken off. When we set out, we knew that we needed to build a unified platform that was eventually going to be used for all of our internal data. 

One of the main things we knew we needed was cost efficiency. We have a ton of internal data at Demandbase. Data is the backbone of our application. We had just gone through an acquisition, and we knew that we were going to continue to expand our data. And we knew that, in the past, some of these other data warehouse solutions had become quite expensive. And that’s what really drove us to a file-based solution.

But then, looking at some of the drawbacks, we immediately pulled back from that because our very first use case was a change data capture (CDC) use case.

Doing change data capture into raw Parquet files is very difficult. You have to do all of the management and the reconciliation of the rows yourself. You have to keep track of everything.
It becomes very cumbersome. 

So, we moved to the lakehouse space with Iceberg, Hudi, and Delta. From there, it really was looking at the three of them and deciding which one was going to be the best for us long term. We looked at a number of different things. We looked at the functionality, but then more broadly asked: “What was the state of the project? How confident were we that we were going to be able to pick up this tool and use it? How confident were we in the long term?”

That’s what really drove us toward Iceberg. It had a flourishing community. It was truly open source. It had all the functionality we needed on their roadmap. We felt it was the right project for us at the right time.

That’s how we picked up Iceberg and implemented it for that initial use case, as we were doing this big migration, and then since have built this entire unified data platform around Iceberg.

The migration to DataHub

Q: Before DataHub, you were using a different Iceberg REST catalog. You decided to move all of that into DataHub to address the discoverability challenge. What was the motivation behind making that choice?

A: Our previous REST catalog was Tabular, which was acquired by Databricks, a much larger company that has its whole ecosystem. With that came, obviously, uncertainty

It opened up this question of, “What should we do long term?” There was no path where we just did nothing and things continued to work. Databricks obviously wanted us to move to their catalog, but there was no zero-effort path here. So that really allowed us to take a step back and say, “Okay, we have to do a migration anyway. What really is the right long-term path for us?”

We looked at a variety of open-source options like Polaris, Lakekeeper, and Gravitino. But this is such a new space. There are a lot of projects out there, but all of them are pretty new—and immature in a lot of ways. Because of how critical the unified platform is for Demandbase and how big our use case is, we were a little bit hesitant to go with an unproven open-source option.

I’d always had it in the back of my mind that we had DataHub, while also using this other REST catalog. And we hadn’t really gotten around to fully integrating all of our Iceberg data from our old REST catalog into DataHub using the DataHub integrations. So, it struck me that maybe this would come around. If we could have both, that would be so much easier. It would unlock so much by not having to do those integrations—not having that extra step of complexity.

And, somewhat serendipitously, around two months before we started to execute on this migration, DataHub announced that Iceberg REST catalog support would be here soon.

So we immediately reached out to DataHub, and I was glad they were able to support us with a pretty quick POC even before it had even been released.

It just made sense from there. We didn’t have to do that extra step of doing the Iceberg REST catalog integration. We got all of our data in DataHub for free. We were able to start utilizing it as our business catalog, get our schemas in there, and start documenting them and all of that.

What made DataHub the right choice for Demandbase

Q: Can you tell us what you were using DataHub for before the migration?

A: Primarily data documentation. On my side, it was mainly from the perspective of the data platform owners—trying to make sense of everything we had, and helping teams discover that data and democratize it more.

We also had a strong push from the security and governance side, being able to document what existed, where it came from, what data was stored, whether it contained PII, and what the retention periods were. So there was a strong push coming from our governance team around that data.

From my perspective, it was really about that business catalog—being able to document what was there and how to access it. But with Iceberg, progress was a little slow, since it required that extra step of integration. With so much else going on, we only got part of it done and didn’t quite get all of our warehouses integrated.

Q: Can you tell us about the migration process from the existing REST catalog to DataHub?

A: At this point, we had upwards of 200 distinct tables. There are three sets of those—Dev, Stage, and Prod—spread across roughly a dozen different warehouses. Each warehouse is generally owned by a team, though some teams manage more than one depending on their use cases. So, it’s a pretty distributed architecture.

That meant a lot of downstream implications and a lot of teams to coordinate with, which was definitely the hardest part of this migration. The migration itself was relatively straightforward. Most of it could be handled by the team that owns a given warehouse. The parts that needed to be done by other teams reading from that warehouse could often be done asynchronously, since the old catalog would still work as long as it pointed to a valid snapshot.

It was this process of coordinating with all of these different teams, helping them with tooling around registering their tables across catalogs, helping with access management, and thinking: “How do we make sure that we have one-to-one access management from our old catalog to our new catalog? How do we have the same kind of authorization?”

We built some things around DataHub to allow us to have the same authentication scheme as we had in our old catalog. So it was about helping teams understand what they needed to do, understanding failure modes, building tooling around the process, and then building some additional functionality to help us have a one-to-one migration so that we can continue to use it and keep the system as seamless as possible.

Q: How long did it take you to get through the migration of those 200 tables across the different environments?

A: It took us about two to three weeks, working with various teams. It was a tighter timeline than I wanted. That was mostly pushed by our need to get off our old catalog. 

But it ultimately went fairly smoothly. We had a few hiccups here and there, but stuff that was relatively easy to get through and, all in all, it was a major success. And has been great ever since.

“All in all, it was a major success. And has been great ever since.”

Unifying technical and business catalogs into one operational catalog

Q: People often talk about business catalogs and technical catalogs as separate things. Do you believe this is a meaningful distinction?

A: It’s a great question, and such an interesting one because I see LinkedIn posts all the time about data catalogs and their uses. Even before DataHub supported the Iceberg REST catalog, I’d see articles comparing DataHub to Polaris and other tools. And while they were in spirit somewhat similar, at that time, they were really quite different in how they were used and the problems they solved. So in principle, yes, there’s definitely a distinction in terms of functionality and purpose when it comes to business and technical catalogs.

But, in practice, is where it gets really interesting.

In my view, a business catalog is much more of a UI application. It’s a place you go to search datasets, write documentation about them, catalog quality about your datasets, catalog your properties about them, all of these things.

A technical catalog is, as in its name, much more on the technical slide of things, so really more about the operation of the data itself. Things like disambiguating rights, everything that Iceberg can do with asset transactions, managing snapshots, time travel, is really what that technical catalog is then for.

Those two are completely disjoint functionalities. But there’s no reason why they can’t be part of the same tool—and there’s a lot of value when they are. 

That’s why it’s been so nice utilizing DataHub in this way as both [a business and a technical catalog]. Because now, when we make any change to any of our tables in our unified data platform, it is immediately available in DataHub

Teams can see what the change is. And as we get to more advanced functionality, like safe schema changes, it’ll be super helpful. It allows us to now have everything in sync at all times, and never have this lagging problem with information.

So, in principle, they’re different things. But when you can have them both in one place, it provides a lot of additional value.

Caption: Ryan on the value of unifying technical and business catalogs to align teams.

Q: Did you see any change in people’s workflows before and after the migration?

A: Definitely. A lot of it has to do with non-technical users trying to communicate with technical users. Previously, most of these conversations were very ad hoc over DMs or in Slack, and things would just get lost in translation. Someone would call a dataset one name, and nobody was quite sure what they meant or where the data came from.

There was no way, previously, for non-technical users to self-serve that information, because you had to (because of the distributed nature) deeply integrate with technical tools to even understand and find what was there.

Whereas now, in the last month and a half, we’ve seen that non-technical users and technical users are both able to go into DataHub, speak the same language, and talk about the same datasets. 

They can see what is in our unified data platform and what is in Iceberg. They can also see where they’ve been using things that are not in the unified data platform. For example, we had a number of processes still running off raw BigQuery tables. That worked fine for us two years ago. But now someone asks, “How do I expand this? How can I operationalize the delivery of this data? How do I give another team access to this BigQuery table?” The answer is: “Well, it’s because it’s not in the unified data platform. We really need to get this into Iceberg and get it into the unified data platform so that it can be accessed in all of the great ways that we have.”

It has really made it easy for teams to have this single place to go, understand what’s in there, what they’re using, how it’s supposed to be used, and see documentation about it. In particular for nontechnial users it has been a huge leap forward.

We’ve seen that non-technical users and technical users are both able to go into DataHub, speak the same language, and talk about the same datasets.

Demo: How Demandbase implemented Iceberg REST Catalog in DataHub

Check out this demo where Ryan walks through Demandbase’s DataHub environment and shows how teams can:

  • Browse all data across warehouses effortlessly
  • Explore table structures in detail
  • Apply tags for data types and ownership
  • Understand partitioning and storage to keep queries efficient and cost-effective
A walkthrough of how Demandbase uses DataHub Iceberg Rest Catalog 

“[With DataHub], it’s been really easy to understand what’s going on with the table, what’s the most recent snapshot of the table, and self-serve a lot of that information—even that technical information—out of the same catalog as we’re serving all of our nontechnical users.” 

What’s next for Demandbase

Q: What are the next steps for this integration from your point of view?

A: Building out more of the quality aspects is the next big step for us. Now that we can clearly see all of our datasets, it’s much easier to start building our version of the Medallion architecture, or Airbnb’s Midas standard, different ways of categorizing data health and quality, what should or shouldn’t be used by other teams, and moving in that direction.

Now that we have a place where everything is documented, we can enhance it with additional information about quality, lineage, statistics, tagging, and glossary terms for columns to help us understand the relationships between datasets and what specific columns mean. From there, we can move into that next layer of richer documentation.

Q: If you could wave a magic wand and add one feature capability to what you have currently set up, what would it be?

A: The next big thing for us would be automated lineage. The ability to have the Iceberg client, either itself or with some plug-in, be able to report when it’s doing it right, what Iceberg datasets came into that final data frame, and being able to track that automatically. Even going so far as column-based lineage would be amazing. 

Even just at the dataset level, we previously struggled with understanding how data is being used and who is using it. Now, we have our technical catalog and our business catalog in one place. Being able to see both of those things and being able to have that automatic lineage for our Iceberg tables would be incredible.

Watch the full August Town Hall recording for the complete conversation with Ryan.

Key takeaways from the Demandbase migration journey

Demandbase’s journey surfaces three lessons for building a stronger data foundation:

  1. Unify the source of truth. A consistent, reliable view of data is essential for decision-making at scale. DataHub’s unified context platform brings together rich metadata and context from across the stack into one accessible location.
  2. Make data discoverable. In distributed environments, simply knowing what exists, who owns it, and how to use it is often a bottleneck. DataHub removes that barrier by making this information visible and easy to act on.
  3. Build shared context. Business and technical users often operate in silos. By integrating technical and business catalogs into a single operational catalog, DataHub helps teams align, collaborate, and enable true self-service—even for non-technical users.

Looking for more insights and the latest DataHub updates? Dive into the recap from our August Town Hall. Or, watch the full recording on-demand.

Want to stay connected with real-world data stories, product updates, and expert advice? Join the DataHub Community. With 13,000+ members, you’ll gain direct access to product updates, exclusive content, and peer-to-peer knowledge sharing—while helping shape the future of DataHub.