DataHub Integrations

Services that integrate with DataHub
  • Airflow

    Airflow is an open-source data orchestration tool used for scheduling, monitoring, and managing complex data pipelines.

  • Apache Hudi

    Apache Hudi is an open-source data lake framework that provides ACID transactions, efficient upserts, time travel queries, and incremental data processing for large-scale datasets.

  • Athena

    Athena is a serverless interactive query service that enables users to analyze data in Amazon S3 using standard SQL.

  • Azure AD

    Azure AD is a cloud-based identity and access management tool that provides secure authentication and authorization for users and applications.

  • BigQuery

    BigQuery is a cloud-based data warehousing and analytics tool that allows users to store, query, and analyze large datasets quickly and efficiently.

  • Business Glossary

    A source provided by DataHub for ingesting glossary metadata that provides a comprehensive list of business terms and definitions used within an organization.

  • ClickHouse

    ClickHouse is an open-source column-oriented database management system designed for high-performance data processing and analytics.

  • CSV

    An ingestion source for enriching metadata provided in CSV format provided by DataHub

  • Dagster

    Dagster is a next-generation open source orchestration platform for the development, production, and observation of data assets..

  • Databricks

    Databricks is a cloud-based data processing and analytics platform that enables data scientists and engineers to collaborate and build data-driven applications.

  • DataHub

    Integrate your open source DataHub instance with DataHub Cloud or other on-prem DataHub instances

  • dbt

    dbt is a data transformation tool that enables analysts and engineers to transform data in their warehouses through a modular, SQL-based approach.

  • Delta Lake

    Delta Lake is an open-source data lake storage layer that provides ACID transactions, schema enforcement, and data versioning for big data workloads.

  • Demo Data

    Demo Data is a data tool that provides sample data sets for demonstration and testing purposes.

  • Druid

    Druid is an open-source data store designed for real-time analytics on large datasets.

  • Elasticsearch

    Elasticsearch is a distributed, open-source search and analytics engine designed for handling large volumes of data.

  • Feast

    Feast is an open-source feature store that enables teams to manage, store, and discover features for machine learning applications.

  • File

    An ingestion source for single files provided by DataHub

  • File Based Lineage

    File Based Lineage is a data tool that tracks the lineage of data files and their dependencies.

  • Glue

    Glue is a data integration service that allows users to extract, transform, and load data from various sources into a data warehouse.

  • Great Expectations

    Great Expectations is an open-source data validation and testing tool that helps data teams maintain data quality and integrity.

  • Hive

    Hive is a data warehousing tool that facilitates querying and managing large datasets stored in Hadoop Distributed File System (HDFS).

  • Hive Metastore

    Hive Metastore (HMS) is a service that stores metadata that is related to Hive, Presto, Trino and other services in a backend Relational Database Management System (RDBMS)

  • Iceberg

    Iceberg is a data tool that allows users to manage and query large-scale data sets using a distributed architecture.

  • JSON Schemas

    JSON Schemas is a data tool used to define the structure, format, and validation rules for JSON data.

  • Kafka

    Kafka is a distributed streaming platform that allows for the processing and storage of large amounts of data in real-time.

  • Kafka Connect

    Kafka Connect is an open-source data integration tool that enables the transfer of data between Apache Kafka and other data systems.

  • LDAP

    LDAP (Lightweight Directory Access Protocol) is a data tool used for accessing and managing distributed directory information services over an IP network.

  • Looker

    Looker is a business intelligence and data analytics platform that allows users to explore, analyze, and share data insights in real-time.

  • MariaDB

    MariaDB is an open-source relational database management system that is a fork of MySQL.

  • Metabase

    Metabase is an open-source business intelligence and data visualization tool that allows users to easily query and visualize their data.

  • Microsoft SQL Server

    Microsoft SQL Server is a relational database management system designed to store, manage, and retrieve data efficiently and securely.

  • Microsoft Teams

    Send notifications to Teams channels on updates to entities in DataHub.

  • MLflow

    MLflow is an open-source platform for managing the end-to-end machine learning lifecycle.

  • Mode

    Mode is a cloud-based data analysis and visualization platform that enables businesses to explore, analyze, and share data in a collaborative environment.

  • MongoDB

    MongoDB is a NoSQL database that stores data in flexible, JSON-like documents, making it easy to store and retrieve data for modern applications.

  • MySQL

    MySQL is an open-source relational database management system that allows users to store, organize, and retrieve data efficiently.

  • NiFi

    NiFi is a data integration tool that allows users to automate the flow of data between systems and applications.

  • Okta

    Okta is a cloud-based identity and access management tool that enables secure and seamless access to applications and data across multiple devices and platforms.

  • OpenAPI

    OpenAPI is a specification for building and documenting RESTful APIs.

  • Oracle

    Oracle is a relational database management system that provides a comprehensive and integrated platform for managing and analyzing large amounts of data.

  • Postgres

    Postgres is an open-source relational database management system that provides a powerful tool for storing, managing, and analyzing large amounts of data.

  • PowerBI

    PowerBI is a business analytics service by Microsoft that provides interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards.

  • Prefect

    Prefect is a modern workflow orchestration for data and ML engineers.

  • Presto

    Presto is an open-source distributed SQL query engine designed for fast and interactive analytics on large-scale data sets.

  • Protobuf Schemas

    Protobuf Schemas is a data tool used for defining and serializing structured data in a compact and efficient manner.

  • Pulsar

    Pulsar is a real-time data processing and messaging platform that enables high-performance data streaming and processing.

  • Redash

    Redash is a data visualization and collaboration platform that allows users to connect and query multiple data sources and create interactive dashboards and visualizations.

  • Redshift

    Redshift is a cloud-based data warehousing tool that allows users to store and analyze large amounts of data in a scalable and cost-effective manner.

  • S3 Data Lake

    S3 Data Lake is a cloud-based data storage and management tool that allows users to store, manage, and analyze large amounts of data in a scalable and cost-effective manner.

  • SageMaker

    SageMaker is a data tool that provides a fully-managed platform for building, training, and deploying machine learning models at scale.

  • Salesforce

    Salesforce is a cloud-based customer relationship management (CRM) platform that helps businesses manage their sales, marketing, and customer service activities.

  • SAP HANA

    SAP HANA is an in-memory data platform that enables businesses to process large volumes of data in real-time.

  • Slack

    Send notifications to Slack channels on updates to entities in DataHub.

  • Snowflake

    Snowflake is a cloud-based data warehousing platform that allows users to store, manage, and analyze large amounts of structured and semi-structured data.

  • Spark

    Spark is a data processing tool that enables fast and efficient processing of large-scale data sets using distributed computing.

  • SQLAlchemy

    SQLAlchemy is a Python-based data tool that provides a set of high-level API for connecting to relational databases and performing SQL operations.

  • Superset

    Superset is an open-source data exploration and visualization platform that allows users to create interactive dashboards and perform ad-hoc analysis on various data sources.

  • Tableau

    Tableau is a data visualization and business intelligence tool that helps users analyze and present data in a visually appealing and interactive way.

  • Teradata

    Teradata is a data warehousing and analytics tool that allows users to store, manage, and analyze large amounts of data in a scalable and cost-effective manner.

  • Trino

    Trino is an open-source distributed SQL query engine designed to query large-scale data processing systems, including Hadoop, Cassandra, and relational databases.

  • Vertica

    Vertica is a high-performance, column-oriented, relational database management system designed for large-scale data warehousing and