LinkedIn DataHub Project Updates
Introduction
This is the second post covering monthly project updates for the open-source metadata platform, LinkedIn DataHub . This post captures the developments for the month of March in 2021. To read the February 2021 edition, go here . Since everyone is observing Spring festivities, we also have an Easter egg included in this post. Read all the way to the end to discover it!
Community Update
Almost 150 PR-s were submitted and merged in over the last 30 days, representing a 3x growth in commit activity over the previous month! We had more than 50 people attending the monthly town-hall , where the data team from Wolt presented their adoption journey in using DataHub as the metadata platform for their data mesh implementation. Our Slack community grew by 20% in the last month! Join us on Slack and subscribe to our YouTube channel .
The big updates in the last month were:
- New homepage, improved documentation at datahubproject.io
- Live demo environment at demo.datahubproject.io hosted by Acryl
- Official roadmap published at datahubproject.io/docs/roadmap
- SSO (OIDC) support
- Tags
- Themes
- Search and discovery for Dashboards
- Metadata Platform Implementations for ML Models, DataFlows (Pipelines) and Jobs.
- Deprecation of ElasticSearch 5 and migration to ElasticSearch 7
- An official release 0.7.0 that packages all these improvements
Read on to find out more!
Support for Single Sign On (SSO) OIDC-based
SSO was one of the most requested features in DataHub, so we are excited to announce that it is finally here! We believe that data workers should be spending more time discovering and working with data, safely; and not entering credentials into every data tool that has been deployed at their company.

Google SSO Setup

Okta SSO Setup
The feature has already been tested with Google, Okta and Azure AD. Read the docs here to configure it for your environment.
Case Study: Data Discovery in Data Mesh at Wolt
This month, the data team at Wolt described their journey in adopting DataHub as their data discovery solution. Their goal is to map their entire data ecosystem from operational databases and third party APIs to ML models and dashboards. Their data stack includes Kafka, Airflow, Snowflake and Looker among other technologies, and they are moving towards a data mesh implementation. They have built internal tools that make it easy to integrate metadata with DataHub and connect their ecosystem with it. They have also made important contributions back into the project including driving the Tags RFC and support for DAG metadata storage (e.g. Airflow) in the DataHub backend.

Data Mesh Architecture Use Cases
Fredrik Sannholm , who leads the data engineering and core ML team at Wolt had this to say:
A metadata platform, like Datahub, that’s able to track ownership and stakeholder relationships between entities is crucial when we at Wolt scale and move towards a data mesh architecture. Datahub allows our data consumers to not only discover datasets but also track their lineage, while data producers have a single, purpose-built place to put documentation regarding the data. At the same time the core data engineering team can build useful observability and alerting features around the Datahub APIs. Datahub is a central part of of our data platform as we scale both in terms of data volume and number of data producers and consumers.
Check out the video below:
Tags
Social, global, easy
Another of the most requested features was support for a lightweight tagging mechanism for datasets, fields, dashboards and well… anything! Thanks to work done by the community spanning multiple teams, we now have support for tags. Read the RFC here and watch the video demo-ing how to use tags below!
Themes
Your DataHub, your way
As companies deploy DataHub inside their enterprise environments, one common request we’ve got is the ability to customize the look and feel of DataHub. Some companies are all business and some companies are all fun. We don’t want to let your style preferences get in the way of enjoying great metadata!
Enter Themes. Now you can customize the look and feel of DataHub to your heart’s content. Read the documentation here and check out the video of Gabe Lyons demo-ing how he customized DataHub to look like Airbnb Dataportal below.
New Connectors
After we launched the new metadata ingestion framework last month, we’ve had some new connectors added by the community. Thanks Pedro Silva and Thomas Larsson !
DataHub + Observability = Trusted Data
We’ve heard time and time again, that when searching for data, people want to know which data they can trust. Finding a well documented dataset is a good start, finding owners of this dataset is better, knowing which dashboards are powered by this dataset is amazing, but there’s still something missing.
The key ingredient in unlocking the next level of trust is operational metadata. This includes information like when a dataset was last updated, how often it is updated, which pipeline produced this dataset, how the shape (profile) of the dataset has changed in the last update and much more.
Armed with this level of information, a data scientist who is about to take a dependency on this dataset to build an important analysis, can be secure in the knowledge that they are building on a solid foundation.
We released some exciting mocks for what a great data observability product might look like; built on DataHub.
Here is a sneak peek!

Dataset Operational Health Summary
Check out the full set of mocks here , and give us your feedback !
Looking Forward
The pace of innovation and development continues to accelerate. We’re working on an improved lineage visualization and a deeper integration with Apache Airflow. Meanwhile, we’re expecting more integrations with popular systems like dbt, Looker, AWS Glue and others to land in the next month. Our roadmap for Q2 is packed and we’re excited to be building with all of you. Until next time!