Humans of DataHub: Harvey Li

Before we dive into this week’s wonderful conversation with Harvey Li, we want to take a moment to appreciate and celebrate the DataHub Community. Humans of DataHub launched five months ago (how time flies!) and we have had such a blast connecting and learning about our community members, how their teams discovered DataHub, what DataHub has enabled, and *so* much more. If you are following these conversations, you probably know by now how special our community is. Still, I think it’s always worth repeating – the DataHub Community is out of this world. We can’t wait for you to join us!


This week, we had the pleasure of speaking with Harvey Li, Senior Data Engineer at Grab, the Everyday Everything App, on a mission to drive Southeast Asia forward by creating economic empowerment for everyone.

Our conversation with Harvey was insightful and full of joy; Maggie and I couldn’t help but smile the entire time. Harvey shares how DataHub is the secret weapon that’s driving Grab’s adoption of Data Mesh principles, his love for the DataHub Community, and MORE!

Partial transcription available below, closed captioning available on Youtube.

Conversation Transcript & Highlights

Edited for brevity & clarity

Maggie Hays: Welcome to another round of Humans of DataHub! Today we are joined by Harvey Li from Grab. Harvey, please introduce yourself, tell us where you work, what you do with your team — Just a little bit about who you are.

Harvey Li: Thanks, Maggie. And by the way, I love the DataHub Community; it’s so vibrant, welcoming, and lively. My name is Harvey and I work at Grab, Southeast Asia’s leading Super App. We use data and technology to improve everything from transportation to payments and logistics across the region of over 620 million people. We offer services like ride-hailing, food delivery, e-payment, last mile, logistics, and more.

I work in the data engineering team that builds data applications, query platforms, and governance tooling to serve the entire data lake ecosystem at Grab. I’m now leading a metadata management dev team to develop the next-gen metadata management platform by leveraging DataHub. And so far, it’s been a very exciting journey.

Maggie: That’s awesome. I’m curious — how did you and your team discover DataHub? What led you to the community?

Harvey: Metadata management is a problem that any data-driven organization needs to tackle at some point in time. Fortunately for us, we actually started pretty early; we introduced a third-party proprietary data catalog over three years ago to reap the benefits of data discovery, break data silos, and make data easily discoverable by anyone that needs data. However, with the incredible growth of our data scale, more and more use cases surfaced. We saw an increasing need to have a metadata management platform that not only provides off-the-shelf features but, more importantly, offers us the building blocks to fine-tune and tailor for our use cases.

Last year we explored a few open-source data catalog solutions and we found that DataHub’s extensible architecture is best suited for our needs. We are moving toward data mesh — it’s such a hot topic right now — and DataHub is our secret weapon for us to move toward the data mesh architecture at Grab.

Elizabeth Cohen: So since you and your team have adopted DataHub, what has it enabled within your organization?

Harvey: So for DataHub, firstly, it actually enables us on the data discovery part. And we actually use the Presto on Hive plugin (which is actually the one that we contributed back to the community) to ingest the metadata for over 80,000 tables into DataHub. And the amazing part is that we managed to ingest this huge amount of metadata within less than 15 minutes. Of course, we put some parallelism in place, but the performance is amazing

Maggie: That’s outstanding!

Harvey: Another use case is data governance. We are now a public company and there are a lot more data governance use cases that we need to fulfill. We are also moving towards Data Mesh and one of the key principles is federated data governance. We’re taking a tech-first mindset; instead of introducing new processes or adding more overhead to the data users, we want to develop tools and platforms that basically govern their assets and make sure that they access the data in a very governed manner. We use Glossary Terms for proper data classification and to define data access rules.

DataHub has a lot of possibilities for how systems can integrate with it: OpenAPI, GraphQL API, and the new addition of Actions Framework. We see a lot of exciting possibilities with this particular framework for us to monitor metadata changes and react to them in real-time.

Maggie: Absolutely. Oh, that is so exciting. John Joyce from our team is going to be over the moon to hear that you guys are looking into the Actions Framework. There’s just an endless amount of potential with it and we’re so excited to see what the community ends up doing.

So earlier you said you love the DataHub Community. Obviously, we’re biased — we love the DataHub Community as well — but I’m curious: what about the community keeps you coming back?

Harvey: This community is very helpful. When I started to use DataHub, to be frank, my first impression of DataHub was it was so complex. There are many new concepts, like aspects, recipes, MAE/MCE, just to name a few. But when I am stuck, I know that the community is behind my back that is able to offer some advice.

I want to give a quick shout-out to Gabe and John [from the Core DataHub Team] for helping me and my organization in getting DataHub to its current status.

Let me share a story about how we got the idea of developing [and contributing] the Presto on Hive [connector]. So at that time, we saw this performance issue to ingest metadata from our data lake because there are simply so, so many tables. We raised these concerns in the DataHub Slack workspace and… shared some of our proposals on how we can optimize it. Then, fortunately, one community member actually saw this thread and said, “Hey, we also encountered this similar performance issue in another open-source catalog; we did this & that to improve it.” We thought this was a great idea and that’s why we [contributed] a new plugin to make it more performant than the original one …

I believe there are many stories like that happening in the DataHub workspace every day. This is what made me love this community so much; it is very intellectually stimulating. There are data practitioners across the globe in this community leveraging their own experience and contributing their expertise to help each other and stimulate meaningful discussion. This is truly amazing.

Maggie: Harvey, you just made my entire month; I am so happy to hear all of this. I love it!

Harvey: [Another] amazing part is that almost every [Slack] thread gets an answer; like, if there’s any question in the #troubleshoot channel, it’s answered. That’s really amazing.

Maggie: We try! The volume of questions is growing so we’re trying to keep up with it. But yeah, we really want to make sure that, regardless of how big the community grows, everyone has that same experience when they come in… [regardless of] the questions that they ask, people have a whole gang of support behind them so that they’re not spinning their wheels for too long.

Elizabeth: Thinking about the DataHub Slack community, what is your favorite Slack channel and why?

Harvey: In the past, my favorite channel was actually #troubleshoot… it’s been my savior every time I was stuck. Usually, when I got stuck, I asked a question there, and almost the next day, I got some advice in the thread. So it’s truly helpful… Now I’m [getting better] at troubleshooting myself… I try to see if there are any questions I can help answer as well.

Now my favorite channel is #announcements. A lot of juicy and exciting updates in this particular channel. And because I’m based in Singapore and the time difference, I can not attend the Town Halls live. So usually, I just tune into [#announcements] and get the recording the next morning and watch it during my breakfast.

Maggie: What are you excited to see within the DataHub project over the next six months? 12 months?

Harvey: I think the pace at which new features are rolled out in DataHub is already really impressive. I really can’t ask for more. But having said that…I would love to see more collaborative features incorporated into DataHub. As companies are now moving toward a Data Mesh world, data is no longer a byproduct; it’s a product by itself. Similar to products in e-commerce websites, there are ratings and reviews with upvotes and downvotes; ratings and reviews [on entities within DataHub] can actually become an additional dimension to data quality as well. Ratings and reviews also establish a healthy feedback loop between the data producers and the data users.

I imagine DataHub to be a one-stop marketplace for data. Everyone in the organization that needs data can go to DataHub and search for it without relying on the tribal knowledge by asking around in Slack or checking some documentation gathered somewhere. That was the old way. I think DataHub has enabled this new way of data discovery; it’s very innovative. I really hope to see more collaborative features, just like we are shopping for data products.

Elizabeth: Thinking about DataHub’s current features and use cases, what’s your favorite that you’ve come across so far?

Harvey: As a platform team at Grab, we provide tooling to cover every dimension of data quality, from data freshness to completeness to correctness. [The Great Expectations profiling and validation integration] helps us to enhance our existing observability for data quality as a whole for the data lake.

Speaking from an engineering perspective, I also really like how DataHub managed to break things down into standard and simple abstractions, for example, the metadata ingestion framework. Now, it’s actually very easy to develop a new plugin just to support new data sources. And when I checked last night, there are over 40 plugins that are supported out of the box by DataHub… and a majority of them are contributed by the community.

The Actions Framework is [another] example of such: it abstracts away the complexities and allows developers to just develop new action plugins to support a myriad of new use cases. For example, to enable Slack notifications if the technical schema of some key dataset has changed.

Maggie: If you were to meet someone who had just joined the DataHub Community or was interested in just learning more about its implementation, what advice would you give them?

Harvey: I think metadata management is an evergreen problem. But, if you see it as a big problem, you will never get started. Fortunately, with open source solutions like DataHub, we no longer need to start from zero to manage metadata. So my advice is to try to identify the key use cases in your organization first; start small and try it out.

The DataHub documentation is great, but please join the

DataHub Slack workspace; there are a ton of useful resources there. And don’t be afraid — if you get stuck, the community has your back. So yeah, just try! Go for it!

Elizabeth: Amazing advice. This was such a joyful conversation!

Maggie: At one point, my eyes started watering a little bit.. is it allergies?! Or am I just overcome with emotion?! I think it’s allergies if I’m being honest…

Elizabeth: It can serve both purposes!

Harvey: You folks really do a great job, from the bottom of my heart.

Maggie: Thank you so much, Harvey; this has just been a total joy to chat with you today. Thank you so much for taking the time. We really appreciate you.

Harvey: Thank you for having me and thank you for driving this community forward.

Harvey Li of Grab, Elizabeth Cohen & Maggie Hays of Acryl Data

Harvey Li of Grab, Elizabeth Cohen & Maggie Hays of Acryl Data


What is Humans of DataHub?

Humans of DataHub is a series highlighting the wonderful people that are helping define how the DataHub Community collaborates in 2022.

What’s DataHub?

If you are new to DataHub, just beginning to understand what “metadata” and “modern data stack” mean, or you’ve just read these words for the first time (Welcome, friends! 🌈), let us take a moment to introduce ourselves and share a little history;

DataHub is an extensible metadata platform, enabling data discovery, data observability, and federated governance to tame the complexity of increasingly diverse data ecosystems. Originally built at LinkedIn, DataHub was open-sourced under the Apache 2.0 License in 2020. It now has a thriving community with over 3.3k (🚀) members and 240+ code contributors, and many companies are actively using DataHub in production.

We believe that data-driven organizations need a reimagined developer-friendly data catalog to tackle the diversity and scale of the modern data stack. Our goal is to provide the most reliable and trusted enterprise data graph to empower data teams with best-in-class search and discovery and enable continuous data quality based on DataOps practices. This allows central data teams to scale their effectiveness and companies to maximize the value they derive from data.

Want to join the DataHub Community? Visit https://datahubproject.io and say hello on Slack. 👋

Similar Posts