Fixed column to parquet
The original solution parsing fixed column size files to parquet using spark v2.3.2/jdk8 was suspiciously slow. Trying out the same operation with an parser based on Apache Arrow outgunned 49 Spark nodes !
A simple multi core schema driven parser in Go utilizing Arrow 8.00 and its ability to save to Parquet , running on 1 machine performed 7 times faster than a 7 nodes Spark cluster. This gives a ratio of 1:49. The on-prem cluster withs it name-nodes , HDFS etc consists in total of 12+ machines having one hard working system operator fully occupied.
Java version
Spark v2.3.2 using 10gb network and HDFS on SAN. In short "not great, not terrible"
- jdk 8 (no SIMD)
- Non performant java-programmers-style, multiple chained generics implementation for datatype lookup. Fancy looking code ,instead of much less resource consuming fix sized map with key integer.
Golang version
In short the 1-node Arrow version fully utilized the CPU's all reasources and avoids network,disk.
- Simplified fixed map with integer as key , giving O(1) on datatypeParse(type) lookup per cell with 1 nanosecond lookup from L1 cache access or 100 nanoseconds if cache miss.
- SIMD utilized under the hood in Arrow for example date parsing etc.