1 min read

Kafka Delta Ingest

Kafka Delta Ingest

A first look at an Delta sink meant to transfer data from Kafka to Delta. This software is implemented in Rust using the latest Apache Arrow(Parquet) och delta.rs. That means an very effective/robust ingest solution separated from any Spark cluster and jvm dependencies.

Questions and Answeres:

  • Q:Does it run out of the box ?
  • A:YES!

  • Q:Are updates handled/configurable ?
  • A: ongoing work...

  • Q:Can this Delta sink operate in pararell with an spark cluster accessing the data ?
  • A:ongoing work...

Does it run out of the box ?

Make sure git lfs is installed for your git client. On ubuntu do an "apt-get install git-lfs" before cloning the repo itself that contains large test data files.

sudo apt-get install git-lfs

Clone it ,extract test data and fire upp local S3/Kafka.

git clone https://github.com/delta-io/kafka-delta-ingest
cd kafka-delta-ingest
./bin/extract-example-json.sh
sudo docker-compose up

From a separate shell start the Delta sink itself. I copied the example start script from repo and called it start_delta_sink.sh and placed it in the root of the cloned git repo.

sh start_delta_sink.sh

Ingest json lines to kafka from a separate shell

/opt/kafka_2.13-3.2.1/bin/kafka-console-producer.sh --bootstrap-server=localhost:9092 --topic web_requests < tests/json/web_requests-100K.json

ls -l tests/data/web_requests/


References