Introducing DataHub Integration with Cassandra

We’re thrilled to announce that Cassandra, one of the most popular NoSQL databases, is now supported as a first-class ingestion source in DataHub!

Starting in v0.14.1.12 of DataHub CLI, you can seamlessly integrate metadata from Cassandra directly into your data catalog, unlocking data discovery and data governance on top of your Cassandra ecosystem. 🚀

Why This Matters

Cassandra is known for its high scalability and fault tolerance, making it a favorite for organizations managing large volumes of data.

With the new connector, DataHub empowers teams to extract, visualize, and manage Cassandra metadata in a centralized place, ensuring that no data asset goes untracked. Whether you use Cassandra Enterprise Edition (EE) or DataStax Astra DB, this integration brings unparalleled metadata management capabilities to your DataHub environment.

What’s Supported?

Comprehensive Metadata Extraction
The Cassandra source extracts a wide array of metadata for discovery & governance within DataHub, including:

  • Tables
  • Materialized views
  • Columns
  • Keyspaces
Manage & Discover Keyspaces, Tables, & Materialized Views in DataHub

In addition, the connector supports Table profiling, which allows you to collect rich statistics about tables and columns stored in Cassandra.

A word of caution: Table profiling provides valuable statistics to DataHub users, but may result in long-running queries for large datasets. Be sure to configure limits to avoid performance bottlenecks.

Stateful Ingestion
Automatically detect and remove tables & keyspaces deleted inside Cassandra using the stateful_ingestion.remove_stale_metadata feature, keeping DataHub clean and up-to-date.

Cloud & On-Prem Support
Whether you’re leveraging Astra DB’s managed cloud offering or running Cassandra EE on-premises, this connector supports both setups with easy configuration.

Setup Made Simple

Getting started with Cassandra ingestion is a breeze! Here’s a quick overview:

1. Set Up Credentials

  • For Astra DB, generate an Application Token with the required permissions and download the Secure Connect Bundle.
  • For Cassandra EE, ensure you have a user with SELECT permissions on the necessary keyspaces.

2. Configure the Ingestion Source

Use the following starter recipe to connect your Cassandra instance to DataHub:

# cassandra-recipe.yml
source:
type: "cassandra"
config:
contact_point: "<your-cassandra-contact-point>"
port: 9042
username: "admin"
password: "password"
sink:
type: "datahub-rest"
config:
server: "http://<your-datahub-gms-server>:8080"

3. Run the Ingestion

Using the DataHub CLI or UI, we can run it to start ingesting information from Cassandra:

datahub ingest -c cassandra-recipe.yml

And that’s it! We’re on our way to a consistent, well-governed, high-quality data ecosystem. 🎉

Learn More
Explore all configuration options in our Cassandra Connector Documentation.

Join the DataHub Community

We couldn’t be more excited about this release, and we can’t wait to see how you use it! Have questions or feedback? Ping us on Slack or check out the source code.

Recommended Next Reads