Humans of DataHub: Atul Saurav

We are back with another round of Humans of DataHub! This week we are joined by Atul Saurav, Senior IT Architecture Design Manager, Data Governance at Genworth. Atul shares his journey to DataHub, his learnings along the way, and what features and use cases he’s most excited about.


Humans of DataHub interview with Atul Saurav

Conversation Transcript & Highlights

Edited for brevity & clarity

Maggie Hays: Hey folks, welcome back to another round of Humans of DataHub. This week, we are joined by one of our community members, Atul. Atul, go ahead and introduce yourself. Tell us about who you are, where you work what you do.

Atul Saurav: Hi everyone, and thank you Maggie for having me. I am Atul Saurav, I work for Genworth Financial. We are a financial services organization, which means we have offerings in insurance, we have offerings in annuities, we have offerings in mortgage insurance, and so on. In the organization, I am the architect of the Data Governance team. What that means is that in a financial service, our insurance, which has always been a little dated industry, the concept of governance is not new, it’s always been around. But what’s new, and what I’m particularly excited about is how we’re approaching, it has always been a very manual, collaborative stuff, it continues to remain collaborative. But we are slowly trying to remove that manual aspect of it and try to get most of it automated. So people can start seeing value in the data that they have.

Maggie Hays: That’s awesome. How did you even find out about DataHub? What brought you to the community? Obviously, if you’re working towards automated data governance, you have come to the right place. We’re so happy to have you. But yeah, how did you find out about us?

Atul Saurav: It was the Strata Conference of 2019, and I witnessed Shirshanka’s talk. I was thinking to myself “This is what we should be doing!” because Shirshanka talked about the [data] warehouse, the name was interesting, “Wherehows” This is interesting. And what particularly excited me about the talk was, it talked about some of the challenges that we [Genworth] were going through. So the challenges that we’re going through the things like, “hey, we have this process, this pipeline that feeds the stable, and we want to be able to understand the lineage, we want to understand better what’s going on. So you do governance, how about you figured out all your stuff for me and demystify it for you?”

Maggie Hays: Should be easy, right?

Atul Saurav: Right. So what Shirshanka shared was that LinkedIn had already tried the full aspect of things; going into the content, and parsing things out. And that’s precisely where we were at that point in time. And all the challenges that he spoke clearly resonated with me. So I was like, “Yeah, I got to do this.” So I was totally thrilled with the talk, I met Shirshanka, right after the talk. At that point, we had already signed a contract with one of the commercial products out there. And we kept looking at DataHub from that point in time, but we were like, this contract extends for many years, which means that if we want to go in a direction, that it’s not going to be happening sooner than that… so we thought, “let’s see where we can get with what we have, right?” Because we’ve invested so much. So that’s what was going on for quite some time. But we were able to leverage a lot of that learning from the other commercial offering. And we were so happy that some of those challenges that we had with the product.

More specifically, there are a lot of data catalogs and data governance products out there in the market today. And a lot of them that we have evaluated so far, their pricing model is such that, okay, if you have one execution of this, that counts, right, which means even if I do repeat execution, I’m paying more money out. Or if you have X number of data sources, then every connection that you make to the data source to pull some metadata out of it, that counts towards your license. And Genworth is a company that grew over acquisitions, for many years, we have a lot of data across lots of different data sources, which is just there because of the nature of the company. So all those pricing models just didn’t make any economic sense.

Maggie Hays: You’re almost punished for your company growing and going through acquisitions.

Atul Saurav: Yeah, so the thing that excited us about DataHub was, now I can have one catalog, with pretty much everything in the organization that exists is discoverable. And that was our challenge, right? Initially, when we did our work with other products also, we would constantly meet users. And they would say, “Okay, this is great, but where is my stuff?” [We’d respond] “Here it is, you can look for it.” And it would always be things that are not there, because they’re always focused on saying, “Okay, what is important for the organization, because we can only have x number of connections based on the share”. So they will always come up with something which we have not cataloged. And that brought down the level of trust in whatever you’re trying to offer to them. And that was like a huge setback, every direct two steps forward, one step back… two steps forward, one step back.

And we were able to get rid of that bottleneck with DataHub because we can totally connect to all different kinds of sources, bringing everything together. Now if someone says, “Hey, where’s my content?” I say, “Search for it, you should find it. Let me know because everything has to be there!” That’s quite compelling in itself, because any data user — Analysts, ETL Developers, BI report writers, Data Scientists, whoever — usually come in with a perspective of I know this data, and this is the area that I work in. But is that correct? Or is that even true? Because there’s probably more data, there’s data out there. And if I don’t do that due diligence, then I may just be portraying a partial picture with the limited knowledge of the data that I have. So being able to create that full picture of the entire data landscape in the organization. That’s really exciting.

Maggie Hays: That’s amazing. I mean, I don’t love the idea of losing trust. But I hear this a lot from Community members: we need a high level of trust when end users come in, it’s so important for that investment. And it’s also extremely easy to start losing trust very fast, right? And so it’s like, you get a few chances to get people to understand what’s going on and to capture their attention. If there are these glaring gaps in what data is available without an easy-to-understand reason why, you’ll quickly lose trust in the product. They won’t care that you had a limited number of connectors you could set up; that’s not an intuitive reason why data may be missing from the platform.

Atul Saurav: Not to say a particular data is not important. Right? I mean, what I think may be totally important to you may not be important to me, in that meeting that day, you will not trust my product.

Maggie Hays: Super interesting. So thinking about your work within DataHub now and rolling it out within your organization, what are some of the most powerful use cases or features within DataHub that you’re seeing?

Atul Saurav: Just the sheer fact that I can have all the data in the organization discoverable inside that itself is quite powerful. People have been asking for column-level lineage. We had that in our previous product that we were working with. And we had mixed kinds of reviews about that from our users, because some users totally loved it. Some users were like, “This is not true. It can’t be so complicated.”

Then we’re like, “so if you feel this is complicated, it’s a good kind of an eye-opener at this point.” Because if you, as a business user, think there need to be only three steps in this process, then maybe it warrants a discussion with the developer who built it. They probably were trying to solve some problem that you’re not aware of, or maybe they overcomplicated [it] because they did not understand something. And you could say, for that person, that the data becomes much more robust and more usable, much more trustworthy, just based on that discussion. So column-level lineage, even though it’s very powerful, we are kind of stepping slowly on that, as I believe it will slowly mature. And we’re happy to contribute ideas on that as well, based on our past learnings.

The other things which came in this year, some of which we are using, like being able to model new concepts into DataHub with minimal code, that’s pretty powerful data and comes with most commonly observed kinds of concepts, which, but it doesn’t prohibit us from creating something new, that’s really exciting. The Actions Framework that came in earlier this year, that’s pretty powerful. We intend to use it, but we have not used it yet. The Actions Framework is like, so if I tag something here, and there’s an action, looking for that tag, and based on that, it triggers some events — that’s totally powerful and we want to leverage that.

Maggie Hays: I’m curious about the Model Extension. Like, I think it’s like the no code metadata modeling, to me is one of the most powerful things we can offer, right? Where it’s like, we’re gonna start with the bare bones of common entities: we have datasets, dashboards, charts, glossary term, you know, tags, all of these things that we know are common across the data stacks. But maybe your organization talks about, or thinks about, entities in a slightly different way. So let’s just make that easy for you to introduce so that it’s an intuitive and more native-feeling experience for your end users.

I’m curious, can you share some examples of the customization that you’ve done?

Atul Saurav: Data processes and data pipelines are great. That’s the reality, or that’s the physical reality of how things are laid out in the organization. But there’s always an abstraction on top of it, which people usually talk about, like business processes and such.

If you think about it from that perspective, who knows our business process, (which is policies come in, and then they go through the underwriting process, an underwriter approves it, and then it becomes a signed policy — the entire lifecycle of an insurance policy.)

Now, if you look at it from a data lineage perspective, that’s a fairly complicated lineage on the backend. So, someone who is totally aware of exactly all those intricacies, can benefit a lot from the lineage because they can go in and say, “Okay, if we are capturing this new attribute from a user, then where does it impact? What does it change and things like that?” But from a person who is maybe new to that space, or who wants to understand things at a higher level, or maybe even an executive, right? They want to say, “Okay, so we launch this new feature, new coverage, new product, whatnot. And I want to see how does it impact? Or how are things going on with this new launch?” Okay, it’s a new field in some form, and who knows bad data, maybe going through all of that. And it’s not sensible for us to show an executive, full-blown column level in here saying, “hey, you know, what, the thing flow here. There’s some bad data here. So things are blowing up” — it doesn’t make sense. So we also believe in terms of higher-level abstracted business processes. And that is one of the cases that we’re trying to model out with DataHub; the idea is that I should still be able to attach data assets to the process, without, let’s say if a particular data asset has a quality issue, or has an operational issue going on has not been refreshed for past five days, whatnot. Those things can start to be surfaced in real-time.

It then becomes a place where you can look at the data, and you can understand how much is your level of trust in that data. So that is really pertinent. Something I forgot to include earlier, is that it’s all about the enterprise metadata graph that you’re sort of trying to build indirectly. So for that reason, we use Neo4j as our graph backend. Because we want to be able to annotate the graph and make it rich and be able to query data in different ways. So I think being able to introduce new concepts into the graph, being able to form those connections, being able to portray that picture at different levels in the organization, so people can still gain value out of it.

Maggie Hays: That’s really amazing. I love the idea of that, you know, in my experience, working as both an analyst but then also building out resources for analysts, or building out resources based on, other like applications output and like building all these really intricate dependencies, it can become really easy to lose the conceptual mapping to the business. And so I really liked that idea of how do you tie a data entity back to a flow in or a workflow or, you know, really like the inner workings of the business, right? And make it a little bit more concrete? And not just, “we have 1000s of pipelines, and they’re all equally as important.”

Atul Saurav: I mean, those things are important, right? All pipelines are important. But what does it mean at the end of the day? Where does that rubber meet the road? And one of the discussions we were in recently having was all about, okay, we’re all starting to use AI and machine learning and stuff. Because now, regulation is just catching up on the data engineering space. So it’s like, you’ve got to be aware, that thing is going to learn, it’s going to learn. We’re all trying to figure out what is that gonna look like? And from our perspective, it’s like: if they’re able to capture enough metadata to create a true picture and story out of what really happened, then we should be able to meet any regulators. Again, that notion of no code, low code extensions, and being able to annotate the entire enterprise graph with the concepts that may be relevant to my organization. Very powerful and enables some really cool things.

Maggie Hays: It sounds like you have plenty of existing use cases and places that you want to go and ways you want to leverage DataHub, I’m curious kind of thinking forward… I mean, now what we’re in Q4 of 2022, so thinking through, you know, kind of the next six months, 12 months, kind of specifically within the DataHub product space, are there any kind of like use cases you want to see us tackle?

Atul Saurav: Like I said, we are closely observing how the column-level lineage matures, because we have seen both sides of it; some users in our organization still are actively looking for it, so that is one thing we are closely following. I believe, the low code — we have some ideas, which we’ll probably share around enhancing the low code, no code extension as well because there’s some scope for improvement there where it can get much better because usable patterns occur all the time. And it’s about how do we add more of those, so it becomes much more powerful. So that is that space. One of the other areas was scale and performance. Because of how we are, and where we are today, we do bring in a lot of content. So just based on your pure ingestion of metadata at a dataset level, we are a pretty decent-sized deployment of DataHub. As we go into column-level lineage that is going to explode.

One other area which we want to keep a close tab on is scalability for column-level lineage. The architecture is very well thought out and something I’m really happy about how you guys have done that. But then, as we go into column-level, it’ll get really big and hairy. And with that, we’re gonna be [working with] users like us who want to say, “hey, I want to bring in my own concepts, and I want to enrich this further.” That will be very exciting.

Maggie Hays: I think with the column-level lineage stuff, in particular, I’m excited to understand how we surface it in a way that’s meaningful for end users. Because not only the performance part of it, right? There’s a human at the end of it that needs to make sense of it all. So how do we surface the intricacies between column-level dependencies in a meaningful way, especially for organizations that are dealing with hundreds of thousands of datasets? How do we actually like show that in a way that people can look at and say, “Cool, I’m now better informed and not more confused” Right? So yeah, I think that’s going to be a really exciting kind of challenge for us not just to say, like, kind of take column-level lineage off of the list, but make it an impactful, and reliable resource for folks. Yeah, very excited about that.

Atul Saurav: Let me know, I’d be happy to collaborate on that!

Maggie Hays: Yeah, of course. I mean, you’re just like, you’ll talk to me about this stuff all day. I love it. All right, one final question for you. And I love this one: If you met someone on the street or someone at work, and they were thinking about joining the DataHub Community, or maybe spinning up DataHub, trying to understand what it is, what’s happening, what advice would you give them?

Atul Saurav: The area and the realm of metadata, it’s still very fertile and growing. There are a lot of problems, which are unsolved problems, which have to be solved. So it’s a really exciting space to be in, the DataHub Community in particular. I have [also] found it to be very responsive, supportive and welcoming. I mean, there’s a #getting-started channel or #introduce-yourself. And those two channels are really great. I came in, I introduced myself, and there were a couple of folks who said, “Hey!” and that’s always nice to hear from someone [and actually get the acknowledgment].

The point is that there is still a lot of exciting stuff happening and a lot that is still to be done, and being a part of a community that is supportive, that is open, willing to take you in, willing to collaborate and discuss, *that* you cannot take that for granted. So I would say you have to go in with an open mind and collaborate, be part of the history that is being created, and have fun.

Maggie Hays: I love that. I really, I take a lot of pride in the fact that, as our Community grows, people are still feeling that sense of welcoming and that sense of, you know, if I post a question, I’ll get a response. But also if I just say “Hi, I’m so and so” I’ll also get a response, too. There’s just that collaborative nature, and it’s really, really wonderful.

Atul Saurav: And not to discount the monthly Town Halls, they’re always amazing. Every time I attend, my mind is like totally blown away. What was this? I was like, “Okay, now what else can I do? What else has been enabled?”

Maggie Hays: That’s amazing. Well, thank you so much for taking the time to chat with us. It’s just been a pleasure, you’re a fantastic partner in the community. And I’m personally so pleased to be working with you, and I know that for the rest of the core team as well.

Atul Saurav: Likewise, in my view, I think I’m able to work or do what I’m supposed to do better because of all the support I get from you all the time.

Maggie Hays: That’s what I’m that makes me so happy. Well, thank you so much. And we’ll see you on Slack!


If you are new to DataHub, just beginning to understand what “metadata” and “modern data stack” mean, or you’ve just read these words for the first time (howdy, friends! 🤠), let us take a moment to introduce ourselves and share a little history;

DataHub is an extensible metadata platform, enabling data discovery, data observability, and federated governance to tame the complexity of increasingly diverse data ecosystems. Originally built at LinkedIn, DataHub was open-sourced under the Apache 2.0 License in 2020. It now has a thriving community with over 4.8k members and 280+ code contributors, and many companies are actively using DataHub in production.

We believe that data-driven organizations need a reimagined developer-friendly data catalog to tackle the diversity and scale of the modern data stack. Our goal is to provide the most reliable and trusted enterprise data graph to empower data teams with best-in-class search and discovery and enable continuous data quality based on DataOps practices. This allows central data teams to scale their effectiveness and companies to maximize the value they derive from data.

Want to learn more about DataHub and how to join our community? Visit https://datahubproject.io and say hello on Slack. 👋

Similar Posts