Introducing DataHub Integration with Dremio

We’re excited to announce that Dremio, a powerful data lakehouse platform, is now supported as a first-class ingestion source in DataHub! Starting in v0.14.1.12, you can seamlessly integrate metadata from Dremio into your data catalog, unlocking advanced data discovery, governance, and observability across your entire data ecosystem.

Why This Matters

Dremio provides a modern approach to querying data directly in your data lake, making it a favorite for organizations that prioritize flexibility and performance.

By integrating Dremio with DataHub, you can:

  • Centralize metadata from your Dremio Spaces, Folders, and Datasets for centralized discovery & governance.
  • Gain visibility into data lineage and ownership to improve trust in your data assets.
  • Profile datasets for deep insights, enabling better data governance and decision-making.

Whether you’re using Dremio Cloud or Dremio Enterprise (Self-Managed), this connector will enhance your metadata management capabilities by allowing you to view and manage Dremio assets alongside everything else in the enterprise.

What’s Supported?

Comprehensive Metadata Extraction

The Dremio source extracts a wide array of metadata for discovery & governance within DataHub, including:

  • Spaces: Extract top-level organizational containers.
  • Folders: Extract sub-level organizational containers.
  • Sources: Extract external data connections.
  • Physical Datasets: Extracts schema and column metadata for tables.
  • Virtual Datasets: Captures metadata for views, including dependencies and transformations.
  • Asset Descriptions: Captures descriptions of datasets, tables, and columns.
  • Asset Owners: If present, extracts metadata for dataset ownership.
  • (Optional) Table & Column Lineage: Track dependencies and transformations between datasets using the history of queries executed on Dremio.
  • (Optional) Data Profiling: Extract data statistics like row counts and column distributions. A word of caution: Table profiling provides valuable statistics to DataHub users, but may result in long-running queries for large datasets.

Stateful Ingestion
Automatically detect and remove tables & keyspaces deleted inside Dremio using the stateful_ingestion.remove_stale_metadata feature, keeping DataHub clean and up-to-date.

Cloud & On-Prem Support
Whether you’re leveraging Dremio Cloud or Dremio Enterprise, this connector supports both setups with easy configuration.

Setup Made Simple

Getting started with Dremio ingestion is straightforward! Just follow these three steps:

1. Generate an API Token

  • Log in to your Dremio instance and navigate to your user profile.
  • Select Generate API Token and ensure it has the necessary permissions to access metadata, lineage, and datasets.

2. Configure the Ingestion Source

Set up your Dremio ingestion source using the following starter recipe:

# dremio-recipe.yml
source:
type: dremio
config:
hostname: "<your-dremio-host>"
port: 9047
authentication_method: PAT
password: "<your-api-token>"
include_query_lineage: true # Set to True to include query log lineage
is_dremio_cloud: True # Set to True for Dremio Cloud instances
sink:
type: datahub-rest
config:
server: "http://<your-datahub-instance>:8080"

3. Run the Ingestion

Use the CLI or UI to ingest metadata into DataHub:

datahub ingest -c dremio-recipe.yml

And that’s it! We’re on our way to a consistent, well-governed, high-quality data ecosystem.

Learn More

Explore all the configuration options in the Dremio Connector Documentation.

Join the DataHub Community

We’re excited to see how you leverage Dremio integration in your DataHub workflows. Have questions or feedback? Join our Slack or browse the Dremio source code for more insights.

Recommended Next Reads