How Fast Can Java Parse 1 Billion Rows of Temperature Data
Gunnar Morling nerd snipes the internet into building Arrow in Java.
Gunnar Morling launched 1brc on his blog this week. 1brc, or “one billion row challenge”, is a seemingly simple task: use Java to parse 1 billion rows of temperature data and calculate the min, max, and mean for each city. The format is:
Frederic Branczyk, founder of Polar Signals, pointed out that 1brc is re-implementing part of Arrow. But Gunnar’s goal is to get us using new Java features like SIMD, virtual threads, and ZGC. He previously shared a Java Update for Java modernization efforts (most of this is available in Java 21):
NVMe’s are insanely fast—it takes only a couple of seconds to read the 12 gigs of data on my M2. Most of the focus has been on speeding up the text parsing.
Developers are having a lot of fun trying different strategies:
Processing the file in parallel (one chunk per-core) (code)
Memory mapping the temperature file (code)
Implementing custom Hashtables/data structures (code)
Using SIMD instead of .split() for text parsing (code)
Using virtual threads (code)
Lots of GC tuning (comment)
GraalVM instead of the standard JVM
I was initially drawn to the project because of the opportunity to try out SIMD, something I’ve written about before. My own implementation clocks in at a measly 53 seconds. I never got around to adding SIMD.
Gunnar opened up a Show and tell section for those wishing to participate in other languages.
You can support me by purchasing The Missing README: A Guide for the New Software Engineer for yourself or gifting it to new software engineers that you know.
I occasionally invest in infrastructure startups. Companies that I’ve invested in are marked with a [$] in this newsletter. See my LinkedIn profile for a complete list.