DDL Ep 01: To Data Catalog or NOT to Catalog?

By:

Maggie Hays

May 3, 2024

We are thrilled to introduce DDL — Decoding Data Leadership, our series featuring conversations with data leaders from around the globe and across diverse industries.

While discussions today often revolve around technologies and tools, DDL shifts the spotlight to the leaders navigating the complexities of the data landscape. Our show dives deep into the challenges, strategies, and lessons shared by those riding new technology waves and steering data teams through choppy waters.

From navigating complex infrastructures to making the most of the latest tools to cultivating a culture of data excellence, join us as data leaders share their journeys and insights on tackling the challenges of leading and running large data teams and initiatives.

And now, on to our first episode.

Decoding Data Leadership Ep 01: To Catalog or Not to Catalog?

To Data Catalog or NOT to Catalog?

That’s the question we tackled head-on in our inaugural episode of DDL — Decoding Data Leadership. Data catalogs are a hot topic that truly divides data leaders.

In this session, I sat down with three seasoned data leaders for a spirited discussion and debate on the role of data catalogs in the modern data landscape — and the larger scheme of things, as they see it.

Meet our panelists:

Benn Stancil, founder of Mode; Sees data catalogs as a bit of a money-hungry black hole.
Taylor Brownlow, Head of Data and Product at Count; Makes a case for an alternative approach to data management
Shirshanka Das, CTO & co-founder of DataHub Thinks it’s high time for a makeover of data catalogs

Watch the video to hear the panel’s insights, perspectives, and recommendations on all things data cataloging.

Read on for a slightly edited version of the conversation.

Maggie: We all collectively bring decades of experience in the data space. Data catalogs aren’t new — they’ve been around since companies started using data. But how they manifest, how people deploy them, the tools used, and the problems they solve continue to evolve with our modern data offerings. To start, could you share your interpretation of what a data catalog means to you?

Taylor: A data catalog is a place to document data — what data columns mean, their lineage, and where the data is used. It helps people unfamiliar with the data understand how to use it — and use it well — well, being the keyword here.

Benn: It’s a theoretical place full of theoretical information that is theoretically useful.

Shirshanka: The data catalog represents the warehouse of all metadata that exists. It offers an exploration experience on top of this combined information. So, documentation is part of the story, but it’s more about information — literally building an information schema for your entire enterprise and making it accessible to both technical and non-technical personas. That’s my take on what a data catalog is — it’s about the information backing it and the user experience on top.

Maggie: Shirshanka, I’m curious — in the modern data catalog, who are the types of data practitioners that actually benefit from it? Benn mentioned theoretically, information is stored there. Who are the target personas involved in the implementation details, and who benefits from it?

Shirshanka: Multiple personas need to be targeted by this one tool. In my experience, that’s the secret to its success. If you only target one persona, then you only get limited success in one team or role.

I’ve seen multiple implementations of data catalogs that have been successful. In all those cases, there has been a focus on the non-technical user. Someone who says ‘I want to understand the definition of this metric.’

But you also have the technical persona that’s going in and saying, ‘Did my pipeline run? Did it produce good data? Did it do what I expected it to do?’

Further upstream, you could have a production engineer working on the online application, being able to say, ‘I’ve got this MySQL table over here and I’m about to make a small change to it. What’s it going to impact?’

If you can cater to all these personas, then you have something sticky and useful.

Maggie: Benn, keeping that in mind, or maybe you want to explore another angle, who are the personas that stand to benefit the most but typically benefit the least?

Benn: Well, in theory, every persona benefits because they’re the ones to do analysis on top of stuff and understand how it works and what the data means.

I think the reason BI people aren’t pro data catalog is because we’ve been burned by them a lot. It’s not that the abstract idea is bad, it’s more that data catalogs have all the information, but it’s often out of date, can’t be relied on, or isn’t complete. The challenge isn’t that I don’t think this is a good product, the challenge is it’s a dictionary that doesn’t ever seem to quite match reality. I feel burned by it because I’d love to have one, I’ve just never seen it work. You made a point about it being an information schema. There’s also an information schema database that tells you what the columns are, and that, I trust because it’s generated automatically by the database. If we had an information schema that was as reliable as information schema dot tables, I’d love that. But that’s not what we have.

Maggie: That must be a lived reality for folks — where you need information, go to a place, and what’s there doesn’t match reality — it’s outdated, stale, etc. So I’m curious, Benn and Taylor, how have you seen the ‘not to catalog’ approach solved from a tooling or culture perspective? The speed at which a company moves, the speed at which data is generated, new sources created, transformed, dumped somewhere, presented — that’s moving so rapidly. If a static catalog, ideally updated regularly, is unreliable, what’s an alternative? How have you seen folks address that problem?

Taylor: I think on the piece about it being incorrect, it’s more that it’s incomplete. When people come to a data catalog, they don’t just need technical definitions; they need context on why decisions were made. That’s hard to keep up to date and to explain at the right level of detail.

The best way I’ve seen that knowledge transfer happen is through calls or meetings. It seems very analog, but the teams I’ve seen succeed set up regular meetings with the business team. Every week, they go over metrics and explain them. Over time, this discussion develops into a deep and rich understanding of the data. There’s no replacing that interaction that needs to happen in person. If a catalog is used to replace that interaction, no one’s going to be happy. If it’s used to facilitate that discussion, it can be very helpful. But if it replaces relationships and conversations, that’s where many issues arise. The main thing is to find a way to have those discussions. Many issues with data catalogs become less important once you have that set up.

Benn: I agree. The one time I’ve seen this be successful is when it’s helping with tribal knowledge. It is because a data catalog is a little difficult to get out. If you have to do it that way, you get a much narrower set of things that people are aware of. You recognize people aren’t going to remember a thousand metrics, so the vocabulary you use becomes much more restricted.

That’s a useful dynamic. It’s not ideal to rely on tribal knowledge, but it narrows the scope of what you’re trying to talk about, and I think that’s good in a lot of ways, primarily for focus. So there are some benefits to doing it that way. It’d be better if we had proper documentation and all that, but absent that, I don’t think it’s necessarily all that bad because those corrective mechanisms are actually kind of a good thing.

Shirshanka: I think, funnily enough, I’m actually in the same camp of saying that the data catalog is there just to remove unnecessary conversations and to ensure that only the highest quality conversations happen. So, the simple stuff, like where did this data come from — all the stuff you can get out of technical systems automatically today, right?

Let’s not try to discover that through tribal knowledge.

And who do we even talk to? Often, we see teams that are a couple of hops upstream of your direct team. But if you have a question about a particular event schema that’s five hops upstream, you can find the team that owns that schema and set up that high-value call. Those are the things we think about in terms of what we’re doing. We’re about connecting consumers to the producers, but only for the high-value conversations. For everything else, we can self-serve using the information that’s in there.

To Benn’s point, I think a lot of the complaints about information being stale/outdated/wrong are conflated between technical information and curated, documented, human-generated information. On the technical front, the industry has made a lot of improvements. Most modern data catalogs can pull metadata out or have metadata pushed into them as quickly as data gets generated. So, technically, we’re in a good spot as an industry.

Now, on the human context capturing part and reflecting that, that’s been an age-old challenge, even for software documentation. Some attempts at shifting left and having documentation laid alongside code have proven to be more durable than others. I think a combination of those techniques helps keep information as current and live as it needs to be.

Ben: What do you mean by technical metadata in that sense?

Shirshanka: Technical metadata is information about the assets in a system. For example, in Kafka, it would be the topics that exist. In a database, it would be the tables, schemas, etc.

Related to this is lineage, which you don’t usually get in an information schema. How was this table derived from an upstream table, topic, or data lake folder? What process was used to transform it? Was it a Spark job or an Airflow task, and what was the logic behind it? Those kinds of things I consider technical metadata

Maggie: There’s a vast variety of metadata to be cataloged — technical, business process, authorization, quality, etc. Shirshanka, can you touch on this a bit more? When we vaguely or broadly talk about technical metadata, how far does that reach?

Shirshanka: Generally, technical metadata refers to information that’s native to a system and means something within that system. If you look at a relational database, technical metadata includes the tables, schemas, privileges, roles, and all that stuff — things you get by querying the information schema. Moving slightly upstream into event-oriented or operational systems, technical metadata might even include API definitions and GraphQL schemas.

Maggie: With all that in mind, Taylor and Ben, do analysts care? All the technical metadata that we can capture — is that meaningful? Does it matter?

Taylor: I think a piece of all this is an emotional need to control and manage all this information, to save it and keep it. It’s overwhelming to think about all the knowledge we have about our data. With that overwhelming feeling, we want to commit and save it, which makes us think it’s more manageable. We have to give up on that objective. We’re never going to have it all or keep all the information accessible. Instead, we need to think about what the most important thing is to keep.

So, to your point, no, probably not all technical metadata is important. We need to ask ourselves in different situations what is the most important thing. If it’s about helping stakeholders be confident in using data, what do they need? If it’s about analysts understanding new tables or columns, what information is most important to them? We can’t have it all, so pick the most important thing and optimize for that.

Maggie: In the absence of a catalog, what have you seen be successful in terms of collecting and using metadata in practice?

Taylor: I think it comes back to which interactions you’re trying to optimize for. On the side of working with stakeholders so they feel confident, the tool itself matters less than the process you have in place. It’s more about opening the door to have discussions at the right time. Based on those discussions, you build up what you need.

For instance, if they don’t understand a metric, you explain it. If they ask why a metric had a weird spike, you add that explanation. You build it up over time. Many tools now allow embedding, so you can start with basic things and see what you need and what works before scaling. Maybe that means getting a full-blown data catalog, or maybe it doesn’t. It’s about building what you need.

Maggie: With that in mind, I’m curious, Benn and Taylor, at what points would you consider bringing in a catalog? How do you know it’s time?

Benn: That’s a tough question. The point of all of this is to answer two questions: What does it mean? Can I trust it?

Those things are overwhelmingly answered by the behaviors other people follow.

The technical information you get isn’t always enough because you still need to know the source. For instance, if there’s an event schema, you need more than just its cataloged presence.

And then, can I trust it? That comes down to whether other people you trust also trust it. For example, if it’s in an executive dashboard and has been there for the last nine months, you can probably trust that the number is right. Trust gets built on top of existing trust.

So, when do you use a data catalog? When you’ve already solved some of those trust and understanding issues. The organization should already have some inherent trust and understanding of the data. You need a place to write down all these things for reference.

A data catalog is useful when the problem is already somewhat solved, and it becomes a resource for finding things people already trust. It’s not about solving trust issues from scratch. Trust comes from within the organization, and a catalog is useful when the groundwork has already been laid

Maggie: Here’s a question from one of our listeners. With capabilities like GenAI and semantic models, do we really need to catalog that technical metadata at all?

Is there a way — from what you are seeing with GenAI in particular — that catalogs might be able to expedite some of that? We’re still building consensus, thinking through how we build that trust metric or that knowledge bank.

Benn: No. I don’t see how GenAI helps here. In some ways, it makes things more unhinged. The places where I’ve seen people attempt to use GenAI in this world are for automatically creating a bunch of documentation or creating a data catalog on the fly.

I guess the only place where I can see GenAI helping is in determining data nicely — like a general AI type of thing. But if we’re applying it to the data cataloging world, it could be useful as a search function across a bunch of different unstructured data sources like DBT jobs, things people wrote down in Google Docs, dashboards, etc., and you can ask it to find something.

So, this question is basically about a synthesis layer across disparate things. I could see it being roughly useful there. I don’t really see that as being a catalog; it’s more like a GenAI-based search across data assets — which is roughly what I think Snowflake and Databricks try to do with their cataloging approach.

Shirshanka: Benn, when you said the fundamental problem that data catalogs are here to solve is ‘Can I trust it?’ I think that’s true for one persona, the analyst. But there’s also the central data team asking questions like ‘Why are my Snowflake bills so high this month compared to last month?’ and ‘Where does my sensitive data live?’ These centralized concerns are also things that good data catalogs address.

Depending on the persona we focus on, we can debate the utility of the information in the catalog. For the analyst persona, the question ‘Can I trust it?’ and your assertion that trust must already exist for a catalog to add value are important. However, sometimes trust doesn’t exist because there’s no visibility into how things flow. This is more true for larger organizations.

If your data team is one person with a mono repo using DBT, and they know the five people upstream, a data catalog might not be necessary. But even at Acryl, with just three people, we’re having debates about our community stats and analytics. We found the need to document and trace the lineage of our data from Fivetran to the lake and back into Snowflake. Despite having a simple, modern data stack with only two hops, we see the need for technical lineage to build trust.

So, my current experience says that while human trust is easy to get, it’s much easier to achieve when it’s based on technical trust signals. A robust catalog provides that foundation.

Benn: I think that’s true in a bit of an atmospheric way where there is a sense if something is monitored and there are dashboards around it. Then people go, ‘We can trust this a little bit more’ because it’s reliable and it’s data-driven.

To your point about using a data catalog to track activities and establish trust in the system. That, to me, expands the scope of data catalog beyond just cataloging data assets. It then becomes a single pane of glass for your data ecosystem, showing how everything is interconnected. It is not so much about cataloging, but more of a control plane for the entire data environment.

That’s certainly valuable — but it stretches the definition of a data catalog — into a data management platform.

Taylor: On the lack of transparency, is that a failure of BI tools rather than a need for another tool? If our data presentation tools lack transparency, and that’s crucial for trusting the numbers, it seems more like a tool failure than a need for something else.

Shirshanka: I wouldn’t blame a BI tool for lacking visibility into upstream processes. BI tools typically connect to data warehouses and offer lineage for what they manage. Upstream, data might be transformed or sourced from various places.

The BI tool shouldn’t be expected to understand where third-party data is coming from and transiting through multiple layers before landing in the warehouse.

Taylor: I think a lot of what we still look to catalogs for today — like column definitions — should ideally be integrated within the same context. It’s part of the user experience. It’s disjointed and unhelpful to have to build trust in one tool by relying on another. So, do you see data catalogs differently in terms of where they fit if you’re using them to trust numbers? Ideally, that trust should start where you directly engage with the data.

Benn: I think BI tools are all great (laughs). But, I understand your point and it makes sense to me. I wouldn’t say it’s a failure of BI tools. I think it’s more of a failure of the data team.

It’s more about how data teams often publish broken content in BI tools and fail to maintain it. If I visit a website that doesn’t work, it’s not the fault of the front-end technology — it’s usually because the people who built it allowed something broken to reach users. That’s the core problem here. Expecting BI tools to manage complex upstream data flows, as Shirshanka pointed out, is asking too much. The real issue is that data teams have often operated with unclear responsibilities, sending out dashboards and reports without ensuring they stay accurate and maintained. There are no clear production lines or SLAs. It’s a mix of consulting and product-building roles that leads to this situation where we often accept data that’s only half-working.

A BI tool, data catalog, ETL tool, or warehouse won’t fix that.

Shirshanka: I would argue that the data catalog is in the best position to fix it — because it’s the only one with the express intent of trying to understand the entire journey of data, not just the data as it lives in the analytics plane.

So, I think, you know, this debate, this problem that we’ve had is the classic OLTP OLAP divide, right?

You’ve got operational systems that are meant for transactional processing, and that’s what they’re optimized for. There’s an entire line of business that’s focused on that work.

And then you’ve got the entire data team that’s responsible for taking the downdraft of all of that data that’s generated by operational systems, and then making sense of it all for the business. That’s been the responsibility of data teams forever. And what we continuously find — and it shows up in data quality problems as well — is that the stuff you can control, you try to make it as good as you can.

Taylor, to your point about making the BI tool more transparent and making sure we can trust the computations — that’s the scoped locally correct definitions and locally correct realities, but very much impacted by incorrect assumptions about the data that is feeding into those. So, of course, 80% of problems can be fixed there, but maybe 20% of them are coming from upstream where someone went and stopped producing to a particular column, because they moved from, say, user ID to user agent. And now that’s the new column that brings in all the definitions.

No analyst could have prevented it because it just came in as a new column in the table, and you can only react to it later.

So data catalogs are in a better place to solve these problems — if they go upstream of the warehouse. And that’s why I think, you know, Ben, to your point, I’m fully in favor of the catalog as an app on top of the control plane. Because you don’t have a layer that can actually connect to all of these things and pull all this metadata into one place, then you cannot have a catalog that works for the enterprise. You only have an analytics-focused catalog.

In which case, you’re absolutely right — perhaps the BI tool is just good enough. Or, if the BI tool is anyway responsible for all the transformation, it can also function as that internal catalog.

Benn: That’s an interesting point. What’s driving my skepticism of data catalogs and data contracts, etc., is that data breaking is a relatively small problem. If pipelines stop working, it’s obvious because the number went from 146 to zero. I think that’s a data problem.

What makes things much messier is less that the data broke and more that revenue got defined in eight different ways across a bunch of different dashboards because people did it at different points in time. Everybody had a slightly different version of doing it, and it changed along the way too.

It’s not a problem with the data pipeline. We never really agreed to something, but we did and changed it until it was actually implemented. It’s like we’re trying to transform a bunch of data into a complicated business concept whose definition is always changing. So stuff moves around a ton and none of that is consistent. That’s the much, much bigger problem.

Taylor: I completely agree. On the point earlier about giving out broken things, in my experience, it’s similar. It’s not necessarily that the numbers are wrong; they could be correct, but there’s no way of confirming it. So understanding becomes the true value. Often in analytics, we come together, agree on how to define a metric, work through it, and establish consensus. That agreement often lasts longer than any dashboard or documentation. While those are helpful afterward, the real value lies in that initial discussion where everyone aligns on what they’re measuring. Anything that facilitates that discussion is crucial, in my opinion. It’s where it all comes back — you just have to dive into that discussion.

Maggie: How do you drive consensus in building these practices or these standards to define a metric, to agree on how you’re going to define it? Catalog or otherwise, what do we see as successful in either culture-building tooling, automation, etc.?

Benn: I think it’s 100% about culture. Tooling reflects culture, period. It doesn’t matter what tools you use; it’s entirely cultural. Words have dictionary meanings, but you see how youngsters use them as they see fit. We can tell them otherwise, but that’s how they interpret them.

Rules don’t matter; what matters is people’s actions. You have to reinforce culture repeatedly to ensure everyone understands what metrics mean and what is required. The tools simply reflect that.

Taylor: It’s also about understanding why they resist. Underneath that, there’s usually mistrust — doubts that it won’t matter or will be forgotten. They might not see the value. You need to have that honest conversation to explore their concerns, and then you can reassure them about its importance and how it benefits them.

If you try to change culture, you need to start small — where they agree to try it, see its usefulness, and then build from there.

Maggie: I wonder if there is a place for a catalog in reinforcing that culture or understanding the ‘why,’

Benn: I think it’s useful to do that. It shows that you care more. Having things written down is a signal of what you prioritize. It’s part of our culture to say, ‘This is what matters to us.’ Writing it down continues to push that culture forward.

Taylor: I broadly agree. It’s all about how it’s used. If someone asks a question and you refuse to answer because ‘it’s in the catalog,’ that’s a poor demonstration. On the other hand, if you bring the catalog into conversations and use it to enhance understanding, then it becomes a valuable tool.

It really depends on how it’s integrated into your workflows and how people choose to use it.

Maggie: Shirshanka, what insights can you share on enabling people to fill in documentation gaps and transfer institutional knowledge effectively? Any guidance on rallying teams around the catalog?

Shirshanka: Most of my guidance is based on what I’ve seen at companies like LinkedIn, Etsy, Pinterest, and Stripe, which have seen successful adoption of data catalogs with thousands of monthly active users. They didn’t try to make a catalog a platform for data discussions. Instead, they began by offering something useful. It could be as simple as using it to access a data set. For example, at LinkedIn, analysts began liking and using DataHub after we built out a feature — a metrics platform — to compute metrics. DataHub would be the place they would go for metric refreshes, and to request data backfills, etc. Stripe used DataHub for visibility into its pipelines — so data engineers could monitor if their pipelines were running on time.

Different teams have found different ways to make it useful first before making it the central place where documentation lives and where this higher-order human context starts living.

If you are trying to roll out a data catalog, first try to give your stakeholders early value for things that they’re already doing, or critical workflows that they just need to accomplish, period. That, I think, is step one, and that’s step two — what Taylor was talking about — which is you start building that culture in the act of answering questions.

We had a setup at LinkedIn that made it somewhat promotion-driven. There was a committee that would evaluate engineers — and one of the things that was part of the promo packet was their DataHub profile. Basically looking at what they own, how well-documented things were, etc.

And that, again, led to a lot of people caring about how that profile looked.

So, there are a few different hacks. Pick your favorite hack as your first and stickiness vector, and then, build from that

Maggie: Promotions and evaluations tied to metadata — that’s definitely a first.

Tons of ideas and insights here today — thank you so much, Benn, Taylor, and Shirshanka, for joining us for our first episode. It’s been an honor.

DDL Ep 01: To Data Catalog or NOT to Catalog?

DDL Ep 01: To Data Catalog or NOT to Catalog?

To Data Catalog or NOT to Catalog?

Recommended Next Reads

Unlocking the Future of AI and Data: DataHub’s $35M Series B Journey

Introducing DataHub Cloud v0.3.11

PRODUCT

Community

Resources

Company