Launching our Connector to GCP Knowledge Catalog

Sync metadata from your entire GCP estate through our Knowledge Catalog integration. Work in DataHub or in the GCP Console. 

Over the past few months, we’ve had the distinct pleasure of working closely with the Knowledge Catalog product team to build a way for teams to gain greater control and visibility of complex ML, analytics, and AI pipelines on GCP. The outcome of this work is our new connector to GCP Knowledge Catalog, the successor to Dataplex. 

“We’ve deepened our collaboration with DataHub to further help joint customers bring richer context and governance to their data on Google Cloud”, said Maggy Hu, Product Manager at Google. “The expanded Knowledge Catalog connector to DataHub makes it even easier for enterprises to build confidently with AI on GCP.”

Previously, our Knowledge Catalog connector pulled metadata from Knowledge Catalog into DataHub, but was limited to BigQuery and GCS assets. This was great for teams building traditional BI pipelines on BigQuery and Looker, but not enough coverage for teams running ML workloads, streaming pipelines, or multi-service architectures on GCP. 

Today, our connector to the newly launched GCP Knowledge Catalog supports bidirectional sync of enriched metadata from DataHub, and covers Vertex AI, Bigtable, Spanner, Pub/Sub, BigQuery, Cloud SQL, and GCS. For BigQuery assets, DataHub Cloud also provides bidirectional sync: the enrichment you add in DataHub (documentation, tags, glossary terms) flows back into Knowledge Catalog. DataHub won’t overwrite governance policies or labels that originate in Knowledge Catalog.

In the coming months we’ll keep making improvements to support the full breadth of what Knowledge Catalog supports. One integration point, consistent metadata across both platforms.

Why Knowledge Catalog, and why now?

Google’s metadata catalog offering has matured significantly over the past year. In response to more teams building ML and AI workloads on GCP, Knowledge Catalog now functions as a genuine metadata fabric across GCP, aggregating catalog information, lineage signals, and governance policies from services that previously had no unified API surface. For teams that have been pulling metadata from BigQuery, GCS, Vertex AI, and Pub/Sub through separate connectors, each with its own configuration and schedule, Knowledge Catalog now offers a single extraction point that’s maintained by Google and kept consistent with the underlying services.

As Knowledge Catalog is becoming a richer product, we know that our users will want to incorporate that richness and simplicity of ingestion into DataHub. Also, as more DataHub users have been building the context layer for their ML pipelines within Datahub, customers want to sync that richness back to Knowledge Catalog so that it is visible in the GCP Console. 

Improved management of ML pipelines  

This update matters most for teams running ML and streaming workloads. Consider a common GCP ML architecture: training data lives in BigQuery, a Vertex AI pipeline reads from it, trains a model, registers it in the Vertex Model Registry, and writes predictions back to BigQuery or GCS. 

With Vertex AI metadata now flowing through Knowledge Catalog into DataHub, you get the full graph from one connector: source tables, pipeline execution, model versions, endpoints, and output destinations — whether on GCP or another Cloud service. The same applies to streaming architectures where Dataflow jobs consume from Pub/Sub or Kafka, call Vertex endpoints for real-time inference, and write results to Bigtable or back to Pub/Sub.

For example, Trustpilot is building ML pipelines on GCP. Before this connector, their foundational ML models appeared as isolated BigQuery tables in their catalog, with no visibility into what produced them or what consumed them downstream.

“DataHub’s Knowledge Catalog integration was the missing piece for us. We can finally see our ML pipelines, models and data assets across GCP and AWS in one place, end to end.”

— David Walker, Staff Data Engineer, Trustpilot

Trustpilot ML engineers will be able to trace lineage from source data through Vertex pipelines to model registration and inference endpoints, and their data scientists can discover which upstream tables feed into foundational models. The entire team will have visibility into what models exist, where they’re deployed, and what data they depend on, without needing project-level GCP access.

Getting started

The Knowledge Catalog connector is available now in DataHub v1.5.0.2 and later. If you’re on DataHub Cloud, the connector is available in your ingestion UI. In DataHub Core, update to the latest ingestion package. Read the full docs for details.

For teams that were previously running separate BigQuery, GCS, and Vertex AI connectors, you can run the Knowledge Catalog connector alongside them during a migration period. DataHub deduplicates entities by URN, so there’s no risk of creating duplicate assets. When you enable Knowledge Catalog’s native integration with a GCP service, DataHub picks it up on the next ingestion run. Once you’re confident the Knowledge Catalog connector covers what you need, you can retire the individual connectors.

For GCP services that aren’t yet covered, you can still use the dedicated services connectors where available. They will not be deprecated.  

If you’re running GCP workloads that span multiple services and you’ve been stitching together connector configs to get coverage, this is a great time to consolidate. 

To learn more, book some time with our sales team or join the conversation in DataHub’s Slack community.

For details on how we work with Google Cloud, visit our Google Cloud partners page.

Future-proof your data catalog

DataHub transforms enterprise metadata management with AI-powered discovery, intelligent observability, and automated governance.

Explore DataHub Cloud

Take a self-guided product tour to see DataHub Cloud in action.

Join the DataHub open source community 

Join our 14,000+ community members to collaborate with the data practitioners who are shaping the future of data and AI.