Etsy’s DataHub Adoption Journey

Presentation originally shared at DataHub August 2022 Town Hall by Vishal Shah, Senior Data Engineer.

In a large and fast-growing organization, there are always many different types of data and complexities that make it hard for users to find the right dataset for their needs. Etsy is no different in that regard, and so we embarked on a journey to find the right tool to help users “Find the data they need, and Trust the data they find”.

Enter DataHub!

Etsy is a two-sided global marketplace for handcrafted and antique pieces. Founded in 2005, we are an e-commerce platform based out of Brooklyn, NY, with offices around the world. To give an idea of the size and complexity of our data, Etsy has over 7.7 million active sellers, 93 million active buyers, and over 100 million unique listings. Production data is stored across hundreds of shards in MySQL. In addition, we have our data warehouse in BigQuery and use several other data sources as well. Our platform has seen tremendous growth over the years. As a result, our data, and in turn our metadata, have grown significantly as well. This takes me to our journey to a new data catalog.

The journey to DataHub begins

The journey to DataHub begins

Our journey goes back nine years, to the year 2013, when we built our first in-house data catalog called Schemer. Over the years, Schemer became an integral tool for engineers, analysts and more. Users could search for tables from MySQL and Vertica, our data warehouse at the time. There were links to the source code for those tables, and users could add table and column descriptions as well. In 2020, we built a proof of concept for a data lineage tool to start understanding the complex relationships across our vast data landscape. However, both tools soon went on “maintenance mode”, and we knew it was time to reinvest in data discovery.

Cut to April of 2021- the data engineering teams at Etsy wrapped up our data warehouse migration from Vertica to BigQuery, and we formed a new Data Discovery team. We knew there was a need for a better data discovery experience, but we wanted to understand the problem space better before we jumped into a solution. Over the course of a month, our team of six engineers, one product manager, and one engineering manager conducted over 30 user interviews across Engineering, Product Management, and Analytics to learn more about their experience using our existing catalog and lineage tools, and where they found pain points along the way. We learned that the main issues were around data discovery and trust; It was hard to find the right data sets, and it was unclear which data sets were viable, where they came from and how they were being used. Much of this information came from tribal knowledge, which at this stage of growth was no longer sustainable for us. I would like to call out that while the interview phase wasn’t complex on the engineering front — there really wasn’t much engineering at all that entire month — these interviews helped the team learn about the problem space better and become invested in our team’s mission.

The next step was to find a solution! We split up into squads to investigate all of our options… there were quite a number. What if we extended Schemer? What if we built a new in-house tool? What if we paid for a fully managed solution that we could get off the ground faster? Or what if we use an open-source solution? We dove into 30 tools over the course of a month and POC-ed a couple of them. Our proof of concept for DataHub involved setting up a local instance, adding a custom source and adding a custom aspect to the metadata model. While it was a bit complex, we found success in DataHub for its flexibility in metadata modeling, integrations with many of our existing data sources at Etsy, and an active and growing community. So after months of research, we were finally ready to move forward with DataHub.

DataHub: from POC to roll out

Now, for the fun engineering part! Along with the DataHub, there are many technologies that our team did not have much experience with. However, we did have the advantage of other teams at Etsy owning instances of GKE, Kafka, and having expertise in Buildkite to guide us along the way. We also set up a Cloud SQL database and used managed ElasticSearch in our infrastructure setup. We started small and iterated along the way when ingesting our metadata. By prioritizing BigQuery and MySQL, we were able to create parity with Schemer as well as roll out an MVP for data lineage that connected from MySQL to BigQuery, to Looker — a widely used data pipeline at Etsy and also displayed in our in-house lineage tool.

Launching DataHub at Etsy

Launching DataHub at Etsy

After months of implementation work, we were ready to launch DataHub at Etsy by April of this year. To date, we have over 600 total users and about 45% are active each month. We’ve ingested over 11,000 datasets across five data platforms and the number increases each day.

Since we already had a data catalog in production, we had to be extremely thoughtful in our transition to DataHub to make sure that we had addressed any parity issues before deprecating Schemer. I’m happy to say that we were able to smoothly migrate onto DataHub and turn off Schemer last month with minimal disturbance. We are also well on our way to turning off our in-house lineage tool to consolidate all of our data discovery efforts into DataHub.

Implementation Learnings

It would not have been the journey it was without some bumps along the way; so I want to highlight some of our learnings as they pertain to implementation and governance. For instance, this question came up a few times — Why doesn’t this work as expected? — with no fault to DataHub or our data ecosystem, but the natural process of integrating two systems together.

The cool part about using an open-source solution was that we weren’t left in the dark with having to solve every problem by ourselves.

We were able to fit DataHub into our ecosystem and even benefit from upstream contributions along the way. The contribution with perhaps the most impact for us was around the BigQuery ingestion. We have BigQuery projects at Etsy used only for storage, and other projects which are used for running queries and jobs. This didn’t immediately work with the BigQuery ingestion in Datahub because at that time, the recipe only took in a single project ID. We ran into all sorts of problems when trying to query lineage information or profile the data from projects where we did not have the permissions to run queries. But with help from the DataHub team, we were able to contribute back and provide an additional field in the recipe for a Lineage Client Project ID. Similarly, for Airflow, the current version that we use at Etsy is behind the required version for the lineage backend. But, we were able to extend the data lineage functionality with a small change to the open source code that made airflow lineage compatible with our airflow instance.

I should add, however, that not all changes were as seamless for ingestion, such as LookML and some custom sources that read from GitHub repositories. We were surprised that we had to clone these repos into a custom image. And lastly, for MySQL ingestion, a hurdle that we came across was around how to profile sharded databases. Profiling off one host would not capture the full statistics for a table that has hundreds of shards. In addition, profiling against our production database was turning out to be complex and expensive, as well as a concern for our production systems. For now, this one is still on hold and continues to be a discussion on our team.

Making Data Governance a Priority

Outside of implementation, we learned that data discovery and data lineage are both highly dependent on data governance. How do we display all this helpful metadata information in DataHub if we don’t know where to find it? So our approach is to start with ownership, with a goal to make the process of adding an owner to a dataset extremely easy and accessible directly in the creation code for that dataset. We could follow up with adding more governance rules from there. Even for Airflow lineage, the following questions come up: Do we ask users to add inlets and outlets manually to DAGs? What if they changed a data job and don’t update the outlet? What if there’s a typo? Or should we find a way to add them programmatically into the operators themselves?

We believe that data governance will play a crucial role in the success of DataHub at Etsy. So, if you have any success stories around Data Governance, Ownership, or Airflow lineage, please reach out to us. We would love to hear about them!

Looking Ahead with DataHub

Lastly, I want to share a few wishlist items that we would be thrilled to see in DataHub.

  • Firstly, column-level lineage, especially for BigQuery, would be a huge win.
  • Subscriptions and favorites would be another significant addition, which would allow for more team collaboration and standardization. For example, if I were to join a new team, I would already know which are my team’s most important datasets, and I could also get notified of any changes, even if I were not the dataset owner.
  • Lastly, a list of dataset recommendations based on my most viewed or favorited datasets would lead to a boost in data discovery for end users.

I want to give a huge thank you to the DataHub team for all of the help and support along our journey. Also, a special shout out to the Data Discovery team at Etsy for their support with this presentation.

I cannot wait to see all of the cool work that continues in Datahub!


Interested in becoming a contributor to the DataHub Blog?

Inspire others and spark meaningful conversations. The DataHub Community is a one-of-a-kind group of data practitioners who are passionate about building enabling data discovery, data observability, and federated data governance. We all have so much to learn from one another as we collectively address modern metadata management and data governance; by sharing your perspective and lived experiences, we can create a living repository of lessons learned to propel our Community toward success.

Check out more details on how to become a DataHub Blog Contributor, we can’t wait to speak with you! 👋

Similar Posts