DataHub Integrations
Services that integrate with DataHub
-
Airflow
Airflow is an open-source data orchestration tool used for scheduling, monitoring, and managing complex data pipelines.
-
Apache Hudi
Apache Hudi is an open-source data lake framework that provides ACID transactions, efficient upserts, time travel queries, and incremental data processing for large-scale datasets.
-
Athena
Athena is a serverless interactive query service that enables users to analyze data in Amazon S3 using standard SQL.
-
Azure AD
Azure AD is a cloud-based identity and access management tool that provides secure authentication and authorization for users and applications.
-
BigQuery
BigQuery is a cloud-based data warehousing and analytics tool that allows users to store, query, and analyze large datasets quickly and efficiently.
-
Business Glossary
A source provided by DataHub for ingesting glossary metadata that provides a comprehensive list of business terms and definitions used within an organization.
-
ClickHouse
ClickHouse is an open-source column-oriented database management system designed for high-performance data processing and analytics.
-
CSV
An ingestion source for enriching metadata provided in CSV format provided by DataHub
-
Dagster
Dagster is a next-generation open source orchestration platform for the development, production, and observation of data assets..
-
Databricks
Databricks is a cloud-based data processing and analytics platform that enables data scientists and engineers to collaborate and build data-driven applications.
-
DataHub
Integrate your open source DataHub instance with DataHub Cloud or other on-prem DataHub instances
-
dbt
dbt is a data transformation tool that enables analysts and engineers to transform data in their warehouses through a modular, SQL-based approach.
-
Delta Lake
Delta Lake is an open-source data lake storage layer that provides ACID transactions, schema enforcement, and data versioning for big data workloads.
-
Demo Data
Demo Data is a data tool that provides sample data sets for demonstration and testing purposes.
-
Druid
Druid is an open-source data store designed for real-time analytics on large datasets.
-
Elasticsearch
Elasticsearch is a distributed, open-source search and analytics engine designed for handling large volumes of data.
-
Feast
Feast is an open-source feature store that enables teams to manage, store, and discover features for machine learning applications.
-
File
An ingestion source for single files provided by DataHub
-
File Based Lineage
File Based Lineage is a data tool that tracks the lineage of data files and their dependencies.
-
Glue
Glue is a data integration service that allows users to extract, transform, and load data from various sources into a data warehouse.
-
Great Expectations
Great Expectations is an open-source data validation and testing tool that helps data teams maintain data quality and integrity.
-
Hive
Hive is a data warehousing tool that facilitates querying and managing large datasets stored in Hadoop Distributed File System (HDFS).
-
Hive Metastore
Hive Metastore (HMS) is a service that stores metadata that is related to Hive, Presto, Trino and other services in a backend Relational Database Management System (RDBMS)
-
Iceberg
Iceberg is a data tool that allows users to manage and query large-scale data sets using a distributed architecture.
-
JSON Schemas
JSON Schemas is a data tool used to define the structure, format, and validation rules for JSON data.
-
Kafka
Kafka is a distributed streaming platform that allows for the processing and storage of large amounts of data in real-time.
-
Kafka Connect
Kafka Connect is an open-source data integration tool that enables the transfer of data between Apache Kafka and other data systems.
-
LDAP
LDAP (Lightweight Directory Access Protocol) is a data tool used for accessing and managing distributed directory information services over an IP network.
-
Looker
Looker is a business intelligence and data analytics platform that allows users to explore, analyze, and share data insights in real-time.
-
MariaDB
MariaDB is an open-source relational database management system that is a fork of MySQL.
-
Metabase
Metabase is an open-source business intelligence and data visualization tool that allows users to easily query and visualize their data.
-
Microsoft SQL Server
Microsoft SQL Server is a relational database management system designed to store, manage, and retrieve data efficiently and securely.
-
Microsoft Teams
Send notifications to Teams channels on updates to entities in DataHub.
-
MLflow
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle.
-
Mode
Mode is a cloud-based data analysis and visualization platform that enables businesses to explore, analyze, and share data in a collaborative environment.
-
MongoDB
MongoDB is a NoSQL database that stores data in flexible, JSON-like documents, making it easy to store and retrieve data for modern applications.
-
MySQL
MySQL is an open-source relational database management system that allows users to store, organize, and retrieve data efficiently.
-
NiFi
NiFi is a data integration tool that allows users to automate the flow of data between systems and applications.
-
Okta
Okta is a cloud-based identity and access management tool that enables secure and seamless access to applications and data across multiple devices and platforms.
-
OpenAPI
OpenAPI is a specification for building and documenting RESTful APIs.
-
Oracle
Oracle is a relational database management system that provides a comprehensive and integrated platform for managing and analyzing large amounts of data.
-
Postgres
Postgres is an open-source relational database management system that provides a powerful tool for storing, managing, and analyzing large amounts of data.
-
PowerBI
PowerBI is a business analytics service by Microsoft that provides interactive visualizations and business intelligence capabilities with an interface simple enough for end users to create their own reports and dashboards.
-
Prefect
Prefect is a modern workflow orchestration for data and ML engineers.
-
Presto
Presto is an open-source distributed SQL query engine designed for fast and interactive analytics on large-scale data sets.
-
Protobuf Schemas
Protobuf Schemas is a data tool used for defining and serializing structured data in a compact and efficient manner.
-
Pulsar
Pulsar is a real-time data processing and messaging platform that enables high-performance data streaming and processing.
-
Redash
Redash is a data visualization and collaboration platform that allows users to connect and query multiple data sources and create interactive dashboards and visualizations.
-
Redshift
Redshift is a cloud-based data warehousing tool that allows users to store and analyze large amounts of data in a scalable and cost-effective manner.
-
S3 Data Lake
S3 Data Lake is a cloud-based data storage and management tool that allows users to store, manage, and analyze large amounts of data in a scalable and cost-effective manner.
-
SageMaker
SageMaker is a data tool that provides a fully-managed platform for building, training, and deploying machine learning models at scale.
-
Salesforce
Salesforce is a cloud-based customer relationship management (CRM) platform that helps businesses manage their sales, marketing, and customer service activities.
-
SAP HANA
SAP HANA is an in-memory data platform that enables businesses to process large volumes of data in real-time.
-
Slack
Send notifications to Slack channels on updates to entities in DataHub.
-
Snowflake
Snowflake is a cloud-based data warehousing platform that allows users to store, manage, and analyze large amounts of structured and semi-structured data.
-
Spark
Spark is a data processing tool that enables fast and efficient processing of large-scale data sets using distributed computing.
-
SQLAlchemy
SQLAlchemy is a Python-based data tool that provides a set of high-level API for connecting to relational databases and performing SQL operations.
-
Superset
Superset is an open-source data exploration and visualization platform that allows users to create interactive dashboards and perform ad-hoc analysis on various data sources.
-
Tableau
Tableau is a data visualization and business intelligence tool that helps users analyze and present data in a visually appealing and interactive way.
-
Teradata
Teradata is a data warehousing and analytics tool that allows users to store, manage, and analyze large amounts of data in a scalable and cost-effective manner.
-
Trino
Trino is an open-source distributed SQL query engine designed to query large-scale data processing systems, including Hadoop, Cassandra, and relational databases.
-
Vertica
Vertica is a high-performance, column-oriented, relational database management system designed for large-scale data warehousing and