DataHub Community Update

DataHub october 2021 Town Hall

Hello, DataHub Enthusiasts!

The DataHub Community continues to be abuzz with activity, and the month of October was no exception. Want to see what has happened in prior months? Head over to the Project Updates section to check them out.

Let’s get you up to speed on all-things-DataHub!

Community & Project Updates

New Ways to Collaborate in DataHub Slack

We rolled out some new channels dedicated to collaboration in the DataHub Slack, including:

  • #office-hours: we now host open office hours every Tuesday at 9 am US PT — join the channel for reminders & Zoom details!
  • #contribute: looking for ways to contribute back to the DataHub Community? Post your ideas here!
  • #show-and-tell: are you excited about something you/your team has done with DataHub? Tell us all about it, and consider me your personal hype woman — I love celebrating y’all’s successes, especially when they have to do with :teamwork: 😎
:teamwork:

:teamwork:

We also announced that we are using the

Hey Taco! Slack app to show gratitude to one another. Whenever you’d like to show appreciation to someone that went above and beyond for you in the DataHub Community, give them a 🌮! Just mention their user name, write your message, and add the :taco: emoji.

  • Is it silly to send virtual tacos to say thanks? Yes! We love silly. And tacos.
  • Will we be rolling out a program to redeem 🌮s for limited edition DataHub Swag? Also Yes! We love swag. And tacos.
  • Will the rewards store have a ✨limited edition DataHub fanny pack ✨? Can neither confirm nor deny 🤐😬
Another great office hours

Q4 Roadmap Updates

Here’s what the Core DataHub team is working on in Q4 2021:

  • Updates to DataHub metadata model — we are targeting schema history, column-level lineage, and data quality (specifically Great Expectations in the first pass)
  • Support for multiple data platform instances — we’ll be making it easier for you to uniquely identify datasets across the multiple instances
  • Improved support for dbt — we will be fully leveraging the catalog & manifest JSON files in dbt projects and improving how we organize these entities in the lineage graph
  • Handling stale metadata — when data is deleted in the source environment, we will support removing/soft-deleting them in DataHub
  • Integrations with DeltaLake & Spark — woohoo!

Call for DataHub Community Support!

We are looking for Community Members to help in the design and/or development of Tableau and Clickhouse connectors — please reach out to @maggie in DataHub Slack to learn how you can contribute!

Project Updates

We’re excited to announce that three more companies have adopted DataHub!

Peloton, DFDS, and Uphold have adopted DataHub!

Peloton, DFDS, and Uphold have adopted DataHub!

(I asked — no, it doesn’t mean we all get a Peloton.. sorry y’all, I tried 🤷‍♀)

The DataHub Project saw 161 commits from 30+ people, spanning ~20 companies. We’re so excited to see the volume of contributions grow from a growing group of DataHub Enthusiasts!

Here are the biggest highlights from our v0.8.16 release:

Product / Feature Updates

  • Unified Search & Recommendations (read more about Recommendations below!)
  • Improvements to Primary / Foreign Key Support
  • Lineage Performance Improvements
User Group Management

User and Group Management Screens:

  • View all users & groups
  • Remove users & groups
  • Create a new group
  • Add & remove members from groups

Metadata Ingestion

  • Redshift Usage, External Tables
  • BigQuery Dataset Lineage
  • Ingestion performance improvements by enabling parallelism and max threads configurations
  • Nested field support for Hive & Trino (available as of v0.8.16.2)
  • Adding Owners through Ingestion Transformers
  • Want to dig into the DataHub Metadata Model? You can now view it on the demo site!
Navigating the Dashboard Entity from the DataHub Metadata Model

Navigating the Dashboard Entity from the DataHub Metadata Model

Map of DataHub’s Metadata Model

Map of DataHub’s Metadata Model

Check out the full video below —

Landing Page Recommendations

During the October 2021 DataHub Town Hall, John Joyce and Dexter Lee from Acryl Data revealed a brand new landing page for the DataHub UI, including Recommendations to help users find the metadata they care about with fewer clicks.

Available as of v0.8.17, the new user experience provides guided navigation to high-value metadata and is personalized on a per-user basis.

By building an extensible framework for generating and displaying personalized recommendations, John and Dexter walked us through the design and implementation that surfaces “Most Viewed”, “Recently Viewed”, “Top Tags”, and “Top Glossary Terms” on the new Landing Page.

This is only the beginning of what we plan to do with recommendations. Have feedback or ideas of how we can expand this? Tell us all about it in Slack!

Data Profiling Performance Improvements

Data Profiling is an extremely powerful tool to help Analysts, Data Scientists, and other consumers of data understand the shape & distribution of a dataset, including how it has changed over time. Not surprisingly, this can become a very costly operation to run every day against every column of every table in your data warehouse.

In the October DataHub Town Hall, Surya Lanka from Acryl Data gave us a preview of fine-grained control over data profiling, including:

  • High-level Performance controls, including disabling expensive operations and setting a default number of rows to sample
  • Column-level filtering to include or exclude specific columns, or to turn off column-level profiling altogether
  • Column-level metric filtering to specify which types of metrics to capture for each column (i.e. min/max/mean/stddev/quantile/histogram etc.)

Watch the full demo below —

Improvements to Lineage Support in DataHub

Gabe Lyons and Varun Bharill from Acryl Data gave an update on Lineage support within DataHub during the October Town Hall.

BigQuery users — this one’s for you! As of v0.8.17, you can fully leverage your Google Audit Logs to infer dataset lineage, making it easier than ever to build a lineage graph for your transformed datasets. Note: this requires access to Google Cloud API.

For those of you that have faced issues loading complex lineage views, we’ve rolled out improvements to make the page much more responsive and easier to navigate.

We also introduced drag-and-drop functionality so you can move lineage nodes around as you’re navigating the graph! Try it out here

Data pipeline gif

Watch the full presentation below —

Community Case Study: DataHub at hipages

During our October Town Hall, we had the honor of hearing from Chris Coulson from hipages as he shared the company’s experience adopting DataHub. He and his team have leveraged DataHub to supercharge data-related workflows for Analysts, Data Scientists, Data Engineers, and Senior Stakeholders.

Chris gave us an overview of the workflows they are able to power using DataHub:

  • Discovery — search for concepts, share ideas, and find work other people have done previously
  • Lineage — understand the data processing chain to debug problems and notify relevant stakeholders
  • Quality — define Glossary Terms to collaborate and communicate using an agreed-upon lexicon
  • Ownership — enforce and encourage responsibility and accountability for data sources

Check out the full presentation to learn how they leveraged the DataHub lineage graph to identify influential tables for data profiling, and to infer ownership by identifying clusters of datasets that commonly operate together.

That’s it for this round! Questions? Comments? Post them below — I can’t wait to hear from you!

Join us on Slackfollow us on Twitter, subscribe to our YouTube channel, and RSVP for our monthly Town Hall!

Similar Posts