3 min read

50 shades of Iceberg CDC

50 shades of Iceberg CDC

There is a plethora of ways to transfer data from common relational SQL databases (source) to Apache Iceberg (destination). This post covers 100% opensource software with a few key points:

  • Debizium parses database's BINLOG in all variants thus no alternative is JVM free and will leave som small footprint on the SQL server machine.
  • CDC levels are Raw, Historization,per table transactional, Cross table Transactional.
  • Database flavours are MySql/MariaDB/Oracle/Microsoft SQL server , all that is supported by Debezium supported.
  • Possible involved components are kafka/Pulsar/Flink

This post is not about the just released Iceberg 1.20 CDC feature where Iceberg report changes done using  Spark Procedures.  This metadata features Iceberg emits are useful and deserves a post itself.

The following picture summeries variants deployments and their execution.

Below are the software and their projects/companies involved for "parsing" binlog and "moving" the data:

0. Debezium

Is doing the binlog parsing at the source database server.

Debezium
Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly an…

1. memiiso/debezium-server-iceberg

The first connector used as a base and inspiration for other conncectores. It runs without kafka and saves to Iceberg directly.  

PRO: no Kafka needed

CON: No automatic schema evolution

GitHub - memiiso/debezium-server-iceberg: Replicates database CDC events to Apache Iceberg Tables
Replicates database CDC events to Apache Iceberg Tables - GitHub - memiiso/debezium-server-iceberg: Replicates database CDC events to Apache Iceberg Tables

2. Getindata/kafka-connect-iceberg-sink

Based on memiiso's and introduces kafka as a source. Thus this is a sink connector.

PRO: Enables kafka

PRO: Supports creation of new tables and extending them with new columns.

Real-time ingestion to Iceberg with Kafka Connect - Apache Iceberg Sink
At GetInData we have created an Apache Iceberg sink that can be deployed on a Kafka Connect instance. Run a Kafka Connect instance then deploy Debezium source and our Apache Iceberg sink.

3. 10xfuturetechnologies/kafka-connect-iceberg

Based on Getindata's connector and seems to be an opensourced variant of an initial customer project which is good.

Pro: supports schema evovle support , table autocreation.

Con: Only Append mode.

Con: Not written/tested for Debezium but seems to be configurable to do so using transform called New Record State Extraction.

10x Future Technologies
10x Future Technologies has 8 repositories available. Follow their code on GitHub.

4. Spark structured streaming

Use this for for a do it your self solution

CON: You need to implement deduping  etc your self.

PRO: Support sharding and deduping

Using Flink CDC to synchronize data from MySQL sharding tables and build real-time data lake — CDC Connectors for Apache Flink® documentation

6. StreamNative/Apache Pulsar

Iceberg Sink Connector for Apache Pulsar | StreamNative
Read about the Iceberg Sink connector for Apache Pulsar that allows you to move data from Pulsar to Iceberg without requiring user code.