DataHub Project Updates

Introduction
We’re back with our eighth post covering monthly project updates for the open-source metadata platform, LinkedIn DataHub . This post captures the developments for the month of September 2021. To read the August 2021 edition, go here . To learn about DataHub’s latest developments- read on!
Community Update
September was an exciting month for the DataHub Community. We welcomed Maggie Hays as the DataHub Community Product Manager (learn more about her journey here) and our Slack community grew by 250 members (1,350 total!). We saw 18 contributors from 11 companies contribute to the DataHub project, and 68 people joined our DataHub community town hall. There’s so much momentum growing within this group and I know we will have a strong Q4 to round out 2021.
Project Update
We had 112 commits in September, continuing our 100+ commits/month rate. We had contributions from 18 different contributors from 11 companies (3 new contributors!).
The September 2021 town hall had 68 attendees where we saw a demo of the new Faceted Search experience, Stateful Ingestion, Improvements to the Looker Connector, and a case study from the team at Adevinta about why they are adopting DataHub within their company. Join us on Slack and subscribe to our YouTube channel to keep up to date with this fast-growing project.
Read on to find out more about the September highlights!
Product and Integration Improvements
We saw some very exciting improvements in the DataHub user experience. Let’s dig in!
Improvements to Glossary Term management in the UI
As a reminder, we partnered with Saxo Bank this summer to introduce Glossary Terms as a way to manage governed terms via YAML, and to link terms to related entities within DataHub. This is separate from the existing free-form Tags which provide more flexibility for logical grouping of similar entities as use cases evolve.
Based on community feedback, we’ve made it possible for DataHub users to add and remove Glossary Terms via the DataHub UI. We’ve also visually separated Tags and Glossary Terms to emphasize the difference between the two and to minimize confusion.

Datahub Tags and Glossary Terms
Support for Redshift Usage
We now support the ingestion of query history and usage stats for Redshift in addition to Snowflake and BigQuery. This helps DataHub users to better understand the popularity and relevance of datasets during discovery by displaying top users, monthly query count, and recent queries.
Looker Integration Improvements
The DataHub — Looker integration is only getting better! DataHub v0.8.14.2 introduced:
- Fixes to view naming conventions to resolve naming collisions
- Improvements to extracting Explores from LookML and Owners from the Looker API
- Organization of DataHub entities to mimic Looker’s structure of Explores & Views
Watch the video below to check it out:
Primary Key/Foreign Key Mapping
We recently rolled out support to display primary key and foreign key relationships that are defined within any data store that supports the constraints. Want to see more? Check this demo from Gabe Lyons (Acryl Data), or see it live in the DataHub Demo site here .
Developer Tooling and Operations Improvements
Official Release of our GraphQL API
September marked the launch of our GraphQL API! This will be the primary programmatic interface for the metadata graph, and is where we will be building out an ecosystem of client SDKs to make it very easy to interact with DataHub wherever you might be.
Check out our GraphQL API docs here for rich documentation on all GraphQL queries, mutations, and types.
Additional Improvements
- DataHub CLI now supports env variables — no more sitting at your terminal and confirming all prompts
- Bootstrap common data platforms on startup — when you ingest metadata, logos will be available
- Build out frontend and backend monitoring through Prometheus + Grafana — check out Dexter Lee (Acryl Data) give a demo here
New User Experience: Faceted Search
During the September 2021 Community Town Hall, Gabe Lyons from Acryl Data gave us a walk-through of the new, faceted search experience within DataHub.
Our main goal was to help folks find the information they need in as few clicks as possible and to condense search into a unified experience across all entity types (Datasets, Pipelines, Dashboards, etc.).
Instead of separating search results by entity types, they are now blended together so the top-ranked results will appear first. Users can refine their search by filtering by entity type, platform, environment, tags, and more.
Watch the full demo here:
Case Study: DataHub + Adevinta
Iker Martinez de Apellaniz joined our Town Hall to share Adevinta’s journey in evaluating and adopting DataHub to support their international family of localized digital marketplaces. Adevinta has thousands of data assets spanning Kafka, s3, Athena & Redshift, plus a legacy data inventory system to assist with self-service access requests, GDPR compliance, and managing data ownership.

DataHub Architecture at Adevinta
Introducing DataHub to the company offered additional functionality: a way to manage data lineage, documentation, glossary terms, and dashboards while providing robust search functionality and indicators of data health from their thriving community of data practitioners.
Watch Iker’s full presentation here:
New Functionality: Stateful Ingestion
As DataHub adopters expand the volume of integrations and ingestion jobs they support, it becomes increasingly more important to optimize run time and to minimize redundancy.

I teamed up with Surya Lanka from Acryl Data to introduce the incubating Stateful Ingestion feature during the September Town Hall — DataHub’s mechanism to allow Sources to remember where they left off from the last ingestion run. You can check out the video here.
We shared our design and implementation considerations and gave a live demo to show how Stateful Ingestion can reduce load on source systems by extracting only the most relevant changes. This feature will be rolled out to different ingestion sources in the upcoming releases.
Looking Forward
I’m so excited to see the continuing growth in momentum in this project and looking forward to delivering big things in Q4 of 2021. From the level of engagement on Slack to the increased velocity of contributions from the community, it has been great to build together.
Over the next month, we expect to roll out some of these improvements to the project and start building out new capabilities like recommendations, improving our support for nested structs in systems like Hive, Trino and providing more controls to the operator for managing data profiling using DataHub. Until next time!