50 shades of Iceberg CDC
There is a plethora of ways to transfer data from common relational SQL databases (source) to Apache Iceberg (destination). This post covers 100% opensource software with a few key points:
- Debizium parses database's BINLOG in all variants thus no alternative is JVM free and will leave som small footprint on the SQL server machine.
- CDC levels are Raw, Historization,per table transactional, Cross table Transactional.
- Database flavours are MySql/MariaDB/Oracle/Microsoft SQL server , all that is supported by Debezium supported.
- Possible involved components are kafka/Pulsar/Flink
This post is not about the just released Iceberg 1.20 CDC feature where Iceberg report changes done using Spark Procedures. This metadata features Iceberg emits are useful and deserves a post itself.
The following picture summeries variants deployments and their execution.
Below are the software and their projects/companies involved for "parsing" binlog and "moving" the data:
0. Debezium
Is doing the binlog parsing at the source database server.
1. memiiso/debezium-server-iceberg
The first connector used as a base and inspiration for other conncectores. It runs without kafka and saves to Iceberg directly.
PRO: no Kafka needed
CON: No automatic schema evolution
2. Getindata/kafka-connect-iceberg-sink
Based on memiiso's and introduces kafka as a source. Thus this is a sink connector.
PRO: Enables kafka
PRO: Supports creation of new tables and extending them with new columns.
3. 10xfuturetechnologies/kafka-connect-iceberg
Based on Getindata's connector and seems to be an opensourced variant of an initial customer project which is good.
Pro: supports schema evovle support , table autocreation.
Con: Only Append mode.
Con: Not written/tested for Debezium but seems to be configurable to do so using transform called New Record State Extraction.
4. Spark structured streaming
Use this for for a do it your self solution
CON: You need to implement deduping etc your self.
5. Ververica/Apache Flink
PRO: Support sharding and deduping