Sep 4, 2022 1 min read

Fixed column to parquet

Apache Arrow

The original solution parsing fixed column size files to parquet using spark v2.3.2/jdk8 was suspiciously slow. Trying out the same operation with an parser based on Apache Arrow outgunned 49 Spark nodes !

A simple multi core schema driven parser in Go utilizing Arrow 8.00 and its ability to save to Parquet , running on 1 machine performed 7 times faster than a 7 nodes Spark cluster. This gives a ratio of 1:49. The on-prem cluster withs it name-nodes , HDFS etc consists in total of 12+ machines having one hard working system operator fully occupied.

Github code here

Java version

Spark v2.3.2 using 10gb network and HDFS on SAN. In short "not great, not terrible"

jdk 8 (no SIMD)
Non performant java-programmers-style, multiple chained generics implementation for datatype lookup. Fancy looking code ,instead of much less resource consuming fix sized map with key integer.

Golang version

In short the 1-node Arrow version fully utilized the CPU's all reasources and avoids network,disk.

Simplified fixed map with integer as key , giving O(1) on datatypeParse(type) lookup per cell with 1 nanosecond lookup from L1 cache access or 100 nanoseconds if cache miss.
SIMD utilized under the hood in Arrow for example date parsing etc.

References