X14 kod & infrastruktur AB

May 24, 2023 3 min read

50 shades of Iceberg CDC

There is a plethora of ways to transfer data from common relational SQL databases (source) to Apache Iceberg (destination). This post covers 100% opensource software with a few key points:

Debizium parses database's BINLOG in all variants thus no alternative is JVM free and will leave som small footprint on the SQL server machine.
CDC levels are Raw, Historization,per table transactional, Cross table Transactional.
Database flavours are MySql/MariaDB/Oracle/Microsoft SQL server , all that is supported by Debezium supported.
Possible involved components are kafka/Pulsar/Flink

This post is not about the just released Iceberg 1.20 CDC feature where Iceberg report changes done using Spark Procedures. This metadata features Iceberg emits are useful and deserves a post itself.

The following picture summeries variants deployments and their execution.

Below are the software and their projects/companies involved for "parsing" binlog and "moving" the data:

0. Debezium

Is doing the binlog parsing at the source database server.

1. memiiso/debezium-server-iceberg

The first connector used as a base and inspiration for other conncectores. It runs without kafka and saves to Iceberg directly.

PRO: no Kafka needed

CON: No automatic schema evolution

2. Getindata/kafka-connect-iceberg-sink

Based on memiiso's and introduces kafka as a source. Thus this is a sink connector.

PRO: Enables kafka

PRO: Supports creation of new tables and extending them with new columns.

3. 10xfuturetechnologies/kafka-connect-iceberg

Based on Getindata's connector and seems to be an opensourced variant of an initial customer project which is good.

Pro: supports schema evovle support , table autocreation.

Con: Only Append mode.

Con: Not written/tested for Debezium but seems to be configurable to do so using transform called New Record State Extraction.

4. Spark structured streaming

Use this for for a do it your self solution

CON: You need to implement deduping etc your self.

5. Ververica/Apache Flink

PRO: Support sharding and deduping

6. StreamNative/Apache Pulsar