[Live Workshop] How to Build GenAI Applications with a Data Streaming Platform | Register Now
In the age of AI, the hunger for fresh, reliable data to power machine learning (ML) models and real-time analytics is insatiable. Yet, organizations frequently hit roadblocks when trying to bridge their operational data in motion, typically flowing through Apache Kafka®, with their data at rest in data lakehouses. On one side, you have the data streaming platform, the central nervous system managing the real-time flow of business events. On the other, you have the Databricks Data Intelligence Platform, the premier destination for large-scale AI and analytics workloads.
Historically, connecting these two worlds has been a story of friction. The process involved a tangled web of brittle, hand-coded ETL (Extract, Transform, Load) pipelines, often held together by custom scripts and nightly batch jobs. The consequences were real: data scientists worked with stale data, business leaders made decisions based on outdated dashboards, and operational applications remained blind to the powerful insights being generated by AI models.
This architectural bottleneck created a significant drag on an organization's ability to innovate. To bridge this gap, Confluent developed Tableflow, announcing its general availability during Current Bengaluru. Today, Confluent is proud to announce some major features for Tableflow to truly operationalize AI while delivering a landmark expansion of its Databricks partnership.
At the heart of this expansion lies the support for Delta Lake (Open Preview), announced at Current London, and a robust integration between Confluent’s Tableflow and Databricks Unity catalog (Open Preview expected by end of June). This isn't just an improvement—it's a reimagining of the data architecture, creating a seamless highway between data in motion and data at rest and making it easier than ever to feed the data lakehouse with Delta Lake tables.
Ready to see Tableflow in action?
Historically, feeding raw operational data from Kafka into Databricks and other data lakehouses in Delta format has been a complex, expensive, and error-prone process that requires building custom data pipelines. In these pipelines, you need to transfer data (using sink connectors), clean data, manage schema, materialize change data capture streams, transform and compact data, and store it in Apache Parquet™ and Delta Lake table formats. As stated earlier, this intricate workflow demands significant effort and expertise to ensure data consistency and usability.
What if you could eliminate all the hassle and have your Kafka topics automatically materialized into analytics-ready Delta Lake tables in your data lake or lakehouse? That’s precisely what Tableflow allows you to do. Tableflow revolutionizes the way you materialize Kafka data into data lakes and data lakehouses by seamlessly materializing Kafka topics as Delta Lake tables. Here are the key capabilities of Tableflow:
Data Conversion: Convert Kafka segments and schemas in Avro, JSON, or Protobuf into Delta Lake compatible schemas and Parquet files, using Schema Registry in Confluent Cloud as the source of truth.
Schema Evolution: Automatically detects schema changes, such as adding fields or widening types, and applies them to the respective table.
Catalog Syncing: Sync Tableflow-created tables as external tables in AWS Glue, Snowflake Open Catalog, and Databricks Unity Catalog (coming soon).
Table Maintenance and Metadata Management: Automatically compact small files when a detection threshold is reached and also handle snapshot and version expiration.
Choose Your Storage: Store the data in your own Amazon S3 bucket or let Confluent host and manage the storage for you.
With just the push of a button, you can now represent your Kafka data in Confluent Cloud as Delta Lake tables to feed your data lake, data lakehouse, or any analytical engine.
Unity Catalog provides a fine-grained, unified governance solution for all data and AI assets on the Databricks Lakehouse Platform. When Tableflow populates Delta Lake tables, Unity Catalog steps in to:
Centralize Data Discovery & Access Control: Easily discover the streaming data ingested by Tableflow and manage permissions with fine-grained access controls for these Delta Lake tables, ensuring security and compliance.
Provide End-to-End Lineage: Gain comprehensive visibility into your data’s journey. Unity Catalog can track lineage from Kafka topics, through Tableflow’s materialization into Delta Lake, and all the way to its consumption in Databricks notebooks, jobs, dashboards, and ML models. This is vital for impact analysis, troubleshooting, and meeting regulatory requirements.
Simplify Data Sharing: Securely share governed data assets, including those sourced from real-time streams, across different teams and workspaces.
Unify Governance: Apply consistent governance policies across all your data assets, whether they originate from batch sources or real-time streams made analytics-ready by Tableflow.
Let’s take a closer look at how you can seamlessly materialize real-time Kafka data into analytics-ready Delta Lake tables using Tableflow, Delta Lake, and Databricks Unity Catalog.
The process begins by creating a Kafka topic and publishing data to it, preferably using a defined schema in Confluent Schema Registry. When enabling Tableflow, choose Delta as the target table format. Then you’ll configure an AWS S3 bucket to store the tables and set up a provider integration in Confluent Cloud to grant Tableflow the necessary write access to that S3 location.
Once enabled, Tableflow automatically materializes the Kafka topic data into the specified S3 bucket as Delta Lake tables. To make these tables accessible from Databricks, you’ll configure Unity Catalog integration within Tableflow. Before doing that, you must first register your S3 storage location as an External Location in Databricks.
With that, Unity Catalog integration is established at the Kafka cluster level in Confluent. During setup, you’ll provide your Databricks workspace URL along with the client ID and secret of a service principal that has permission to access the Unity Catalog, as well as the name of the catalog you want to integrate with.
Once the integration is successful, Tableflow syncs all Delta Lake tables associated with the Kafka cluster to the specified Unity Catalog. A new schema, named after the cluster ID, will be created automatically, and all topics with Delta Lake format enabled in Tableflow will be published as external tables under that schema.
Once the Delta Lake tables are registered as external tables in Unity Catalog, you can query them directly from Databricks using familiar tools like Databricks SQL.
This also unlocks powerful downstream capabilities such as leveraging Mosaic AI to build and deploy machine learning and generative AI applications, use LakeFlow for transformation of data further for various business needs, and integrate with BI and AI tools to drive real-time insights across your organization.
The strategic partnership between Confluent and Databricks, with Tableflow at its core, marks a pivotal advancement in the journey towards real-time AI and analytics. By seamlessly connecting data in motion with the Databricks Lakehouse Platform, this integrated solution empowers organizations to build a robust, scalable, and governed real-time data foundation. Tableflow is more than just an ingestion tool—it’s a key enabler that simplifies architectures, accelerates insights, and allows data teams to focus on innovation rather than data plumbing.
With that said, our journey doesn’t stop here. Throughout the remainder of the year and beyond, Confluent will continue to invest heavily in Tableflow, enhancing its capabilities, performance, and integration points based on customer feedback and evolving market needs.
Our product roadmap entails continued enhancements such as the general availability of Delta Lake support, Unity Catalog integration, Upsert capabilities, Dead Letter Queue (DLQ) functionality, Apache Flink® integration, bidirectional data flow, and support for Microsoft Azure and Google Cloud Platform to further optimize the data transfer from operational systems to Databricks, facilitating insightful analytics and AI-driven initiatives.
Together, Confluent and Databricks will continue to empower organizations to build next-generation, real-time applications and analytics in the cloud.
Ready to unlock the full, transformative potential of your streaming data for cutting-edge AI and advanced analytics within the Databricks ecosystem? Explore Tableflow today!
Learn More: Dive into the Tableflow product documentation.
See It in Action: Watch our short introduction video or Tim Berglund's lightboard explanation.
Get Started: If you're already using Confluent Cloud, navigate to the Tableflow section for your cluster. New users can get started with Confluent Cloud, for free and explore Tableflow's capabilities.
Additionally, contact us today for a personalized demo and start unlocking the full potential of your data on Confluent Cloud and Databricks. We are incredibly excited to see how you leverage Tableflow and Databricks to turn your real-time data streams into tangible business value!
The preceding outlines our general product direction and is not a commitment to deliver any material, code, or functionality. The development, release, timing, and pricing of any features or functionality described may change. Customers should make their purchase decisions based upon services, features, and functions that are currently available.
Confluent and associated marks are trademarks or registered trademarks of Confluent, Inc.
Apache®, Apache Kafka®, Kafka®, Apache Flink®, Flink®, Apache Iceberg™️, Iceberg™️, and the Kafka, Flink, and Iceberg logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by the Apache Software Foundation is implied by using these marks. All other trademarks are the property of their respective owners.
Tableflow represents Kafka topics as Apache Iceberg® (GA) and Delta Lake (EA) tables in a few clicks to feed any data warehouse, data lake, or analytics engine of your choice
An expanded partnership between Confluent and Databricks will dramatically simplify the integration between analytical and operational systems, so enterprises spend less time fussing over siloed data and governance and more time creating value for their customers.