Aug 2, 2022 4 min read

Brighter future

New compute engines being implemented with effective languages and functionality pushed down into hardware via organizations like SNIA/CMSI aimed for computional storage devices. Compare this to Hadoop that was an early compute & storage implemented by interprented and jvm languages. Apache Spark that evolved from Hadoop fare much better and is getting substantial improvements by Delta/Iceberg and acceleration from Arrow/Pandas.

New Compute engines

Apache Arrow base project for table format/datatypes used in the emerging Rust based compute engine Datafusion and its distribution engine Ballista. Apache Arrow with its jvm free implementation squeezes out every last drop of your cpu/cores and accelerates every utilizing engine. From a recent customer projekt parsing a fixed column file to parquet , the Arrow based solution on 1 node outgunned 49 spark workers in troughput. For those confused what to do with a 49 times faster ingest using a simplified infrastructure see upcoming my posts . More and better explained Apache Arrow performance gains can be found in the excellent book "In-Memory Analytics with Apache Arrow" by Matthew Topol.

Functionality pushed down into hardware

Example of software functionality beeing pushed to hardware is key/value and is a part of nvme 2.0 spec. There exists many usecases for this even outside obvious key/value databases. Organizations driving this is best presented by themself , from their website.

"The member companies of the SNIA Compute, Memory, and Storage Initiative (CMSI) support the industry drive to combine processing with memory and storage, and to create new compute architectures and software to analyze and exploit the explosion of data creation over the next decade."

For those not buying a datacenter with upcoming computional storage devices this cant be tried at home yet, I myself ordered the just released Nvme 2.0 enabled Samsung 990 pro ssd but have learned that it doesnt implement the optional KV Command-Set.

Additionally existing software implementations becomes an bottleneck when new hardware emerge. The excellent and well writen Xnvme library/tool works out of the box and shows how to bypass existing Linux drivers and talk directly to the nvme ssd's up to nvme v1.4. The main author Simon Lund explains this very well in his presentation called "SDC2020 xNVMe Programming Emerging Storage Interfaces for Productivity and Performance" https://www.youtube.com/watch?v=80mokqrjyxs

Effective languages

A Language like Rust allows deterministic applications (no GC collection pauses) and has out of the box functional tooling with not too many options like C/C++ . Compared to GO Rust is tedious (at least for a beginner like me) but has very good compiler hints. I suspect this combination makes Rust the choice for the above meantioned new compute engines.

Transactional

Delta and Apache Iceberg competes of fixing all that Apache Hive cannot, driving lakes to become much more performant/robust and feature rich with time traveling,snapshots, table transactional,row update etc. A quick pro and cons od Delta VS Iceberg:

Good: Iceberg is totally free and supports other slightly less JVM dependent/damaged engine Flink .
Bad: Iceberg is totally jvm based , no Rust in sight except an early plan/prototype iceberg-rs. Iceberg would benifit from start using Apache arrow as a base.
Good: Delta has its Rust initiative called Delta.rs and since yesterday (2022-09-02) Arrow/Datafusion kan read Delta lakes both from remote like S3 and local files. In theory enable jvm free Delta lake storage and processing. Slight Downside is ml library and other functional libraries of Apache Spark , indering a jvm free processing of a Delta lake.
Bad: Delta was/is a bit less "free" than Iceberg and vendors support onley the non free versions in the cloud , making it a bit isolated on prem for non technical organizations.

Crosstable Transactional

Currently (2022-09-03) neither Delta or Iceberg implements crosstable transactions. This is mitagated with the "Nessie" catalog/metadata handler that is called "a git like experience" for datalakes , Supporting not only spark but Flink etc. Nessie can use RocksDB as backend aswell as Spark's 3.2.1 for checkpoints. There exists multiple initatives on RocksDB aswell as a "api compatible" speeddb claiming to be magnitures faster (I have not seen this!). With "slight" modificationRocksDB could benifit from Xnvme 2.0 and its hardware based key/value api . On such initivate are https://github.com/OpenMPDK/KVRocks .At this writing i dont know of an client ssd implementing KV Command-Set.

Grinding Halt

Hive & Spark 2.0 is beyond end of life as a base for new solutions. Multiple migrating paths to Delta/Iceberg exists and could be as "simple" as using kafka connectors/spark streaming to transfer the data or , use keep existing Hive as a datasource from Delta/Iceberg enabled Spark.

What did i left out

A zillion softwares scaling from emerging/still functional/dying initiatives exists like:

Apache Hudi ?
Apache Impala , jvme free and improving.
Apache Crail that looked promising as an RDMA based HDFS drop in has retired.

Planned follow up posts:

Path/Systemmap to for the three combinations:

Companies going for Pure on-prem lake (yes do it !)
Companies going for Hybrid Cloud and on-prem lake
Companies going Pure Cloud

2. How to coexist/ datashare " hybrid Cloud - on prem"

Delta-Share
Rclone

References