One Billion Row Challenge

One Billion Row Challenge
Published on 17.01.2026
Well before I get started, I'd like to mention that this was a challenge i took upon myself from the original java challenge. This is not the best solution, but it's the result of 10-12 hours of fun non-vibe coding. 

Problem

Given a text file with 1,000,000,000 rows of weather measurements in the form <station_name>;<temperature>, compute per station min, mean, and max temperature as fast as possible. And output the result in the form of a dictionary.

{StationA=min/mean/max, StationB=min/mean/max, ...}

Instructions to generate the dataset are on the original repo.

Rules (roughly)

Constraints
- No external libraries
- Single source file
- Computation must happen at runtime

Input format & limits
- Station name: non-null UTF-8 string, 1-100 bytes, no `;` or `\n`
- Temperature: non-null double in [-99.9, 99.9], exactly one fraction
- Max 10,000 unique station names
- Line endings are `\n` on all platforms
- Must support any valid station name and any data distribution

So bascially an aggreation of data points like how an OLAP database would do it. Ok so how do you even approach a problem like this? With an obvious naive approach ofcouse!

Naive v0

The first version of the code took around 142 seconds. It was a mess, but it worked and that's how it should be for the first version. Pseudocode:

for line in file:
    city, temp = line.split(";")
    stats[city].update(min, mean, max)

Now that you have something that works, you can start to optimize it. Meta steps are pretty simple:

1. Profile and measure
2. Optimize
3. Repeat

I was on linux machine and mostly used perf report to measure the performance. It has a nice TUI with how much cpu time is spent on each function. It also has assmebly view of the code with source along with it. You can see how exactly rust compiles and tweak things.

For benchmarking, I ran each version 5 times, removed the fastest and slowest runs, and averaged the remaining 3. All times mentioned below follow this approach.

oh i use(d) rust btw ;p

I'm still learning rust, so I'm not sure if its decent code or not, so feel free to judge.

mmap

The first optimization I tried was using mmap to map the file into memory. Instead of reading the full file into memory, we can map the file into virtual memory and let the OS handle page caching and fault lookups.

#[link(name = "c")]
unsafe extern "C" {
    fn mmap(
        addr: *mut std::ffi::c_void,
        len: usize,
        prot: i32,
        flags: i32,
        fd: i32,
        offset: i64,
    ) -> *mut std::ffi::c_void;
}
Since we cant use external libraries, we need to use the ffi to call the system call directly.

unsafe {
    // map file into virtual memory
    let ptr = mmap(
        ptr::null_mut(),        // let kernel choose address
        len,                    // map entire file
        PROT_READ,              // read-only (0x1)
        MAP_PRIVATE,            // private copy-on-write (0x2)
        file.as_raw_fd(),       // file descriptor
        0,                      // offset from start
    );

    // convert raw pointer to rust slice for easy access
    let data = slice::from_raw_parts(ptr as *const u8, len);
    
    // work with data...
}

And I kinda skipped calling munmap to avoid extra syscall. Life of a performance hacker.. 142s to 94s.

Fast Hashmap

I used HashMap from std library. But it was slow, so I wrote a tiny custom hasher. Rust's default hasher is cryptographically secure (SipHash), which is overkill for this use case.

I went with FNV-1a but with a twist - sample every alternate letter instead of hashing the entire city name. City names have enough entropy in alternate letters, and testing showed this was faster.

struct FastHasher(u64);
impl Hasher for FastHasher {
    fn write(&mut self, bytes: &[u8]) {
        let mut hash = 0xcbf29ce484222325;
        let mut i = 0;
        while i < bytes.len() {
            hash = (hash ^ bytes[i] as u64).wrapping_mul(0x100000001b3);
            i += 2;  // skip every other byte
        }
        self.0 = hash;
    }
}

Collision probability for city names with this approach is around 0.004% from quick napkin math on my dataset. Good enough for the dataset I generated(kinda cheating). Also pre-allocate hashmap capacity.

Raw bytes

At this point, the code looked like this:

let data = slice::from_raw_parts(ptr as *const u8, len);
let file_str = std::str::from_utf8(data).unwrap();

for (i, line) in file_str.lines().enumerate() {
    let mut parts = line.split(';');
    let city = parts.next().unwrap().to_string();
    let value = parts.next().unwrap().parse::().unwrap();
    // ... update stats
}

Converting the entire mmap'd data to a UTF-8 string upfront is wasteful. We're validating UTF-8 for the entire file and allocating strings for every city name. Instead, work directly with raw bytes.

let data = slice::from_raw_parts(ptr as *const u8, len);

for i in 0..data.len() {
    if data[i] == b'\n' {
        let line = &data[file_start..i];
        
        let semi_index = line.iter().position(|&b| b == b';').unwrap();
        let city = &line[..semi_index];  // &[u8] slice
        let value = std::str::from_utf8(&line[semi_index + 1..])
            .unwrap()
            .parse::()
            .unwrap();
        
        file_start = i + 1;
    }
}

This went from 87s to 52s. We only validate UTF-8 for the temperature values (which we need to parse), not the entire file or city names.

Temperature parsing

Still doing std::str::from_utf8().parse::<f32>() for temperatures. Rust's f32 parser is general purpose and handles scientific notation, infinity, NaN, etc. We don't need any of that - our temps are simple: -12.3 or 45.6.

Started with a hand-rolled parser using loops. But we know the exact format: (-)(I)I.F where I is integer digit, F is fractional digit. This is basically a deterministic state machine with only a few cases.

After iterations, final version has no loops, just straight-line code:

fn fast_parse_f32_from_bytes(bytes: &[u8]) -> f32 {
    let neg = bytes[0] == b'-';
    let mut i = neg as usize;

    // integer part (1 or 2 digits)
    let mut result = (bytes[i] - b'0') as f32;
    i += 1;

    if bytes[i] != b'.' {
        result = result * 10.0 + (bytes[i] - b'0') as f32;
        i += 1;
    }

    i += 1; // skip '.'

    // fractional part (always 1 digit)
    result += (bytes[i] - b'0') as f32 * 0.1;

    if neg { -result } else { result }
}

52s to 45s initially, then further optimized down to 31.4s.

Chunked processing

Instead of scanning the entire file byte by byte, process it in 8MB chunks. This improves cache locality and sets up the foundation for threading later.

for chunk in data.chunks(8 * 1024 * 1024) {
    // find first newline to align chunk boundary
    let first_newline = chunk.iter().position(|&b| b == b'\n');
    let chunk_start = first_newline.map(|pos| pos + 1).unwrap_or(0);

    for i in chunk_start..chunk.len() {
        if chunk[i] == b'\n' {
            let line = &chunk[chunk_start..i];
            // process line...
            chunk_start = i + 1;
        }
    }
}

The key trick is finding the first newline in each chunk to avoid splitting lines across chunk boundaries. 

I tried a few different chunk sizes on different machines. 8MB gave the best results for most cases. Later bumped to 16MB after more testing. 45s to 38.6s.

SWAR byte matching

Instead of scanning byte by byte for newlines and semicolons, process 8 bytes at a time using SWAR(SIMD Within A Register). It's a bitwise trick to find a target byte in a u64 without loops.

The idea: replicate the target byte across all 8 bytes, XOR with input to find matches, then use arithmetic to detect zero bytes (matches).

// replicate byte across u64
let pattern = (byte as u64).wrapping_mul(0x0101010101010101);

// XOR makes matching bytes zero
let xor_result = val ^ pattern;

// arithmetic trick detects zero bytes
let match_mask = xor_result.wrapping_sub(0x0101010101010101) 
                    & !xor_result & 0x8080808080808080;

// find position using trailing_zeros
(match_mask != 0).then_some((match_mask.trailing_zeros() >> 3) as usize)

If we're looking for \n (0x0a), multiply by 0x0101010101010101 to get 0x0a0a0a0a0a0a0a0a. XOR with input - matching bytes become 0x00. The subtraction trick detects these zero bytes and sets the high bit. The final mask has 0x80 at each matching byte position. We then use trailing_zeros() >> 3 to find the byte position (shifting right by 3 divides bit position by 8 to get byte offset).

AVX-512 SIMD

SWAR processes 8 bytes at a time. AVX-512 can process 64 bytes at a time using special CPU instructions.

The trick is to load 64 bytes into a 512-bit register, compare all bytes in parallel with _mm512_cmpeq_epi8_mask, get a bitmask showing which bytes matched. Unroll this 4 times to scan 256 bytes per loop iteration.

let pattern = _mm512_set1_epi8(needle as i8);

while i + 256 <= end {
    // load 4 chunks of 64 bytes each
    let chunk0 = _mm512_loadu_si512(...);
    let chunk1 = _mm512_loadu_si512(...);
    let chunk2 = _mm512_loadu_si512(...);
    let chunk3 = _mm512_loadu_si512(...);

    // compare all 64 bytes at once, get bitmask
    let mask0 = _mm512_cmpeq_epi8_mask(chunk0, pattern);
    if mask0 != 0 {
        return Some(i + mask0.trailing_zeros());
    }
    // ... check other chunks
    
    i += 256;
}

// fallback: SWAR for remaining, then byte by byte

For remaining bytes that don't fit in 256-byte chunks, fallback to SWAR(8 bytes at a time), then byte by byte for the last 1-7 bytes. Instead of checking every byte individually, we check 256 bytes at once.

Threading

Time to throw cores at the problem. Split the file into segments(one per CPU core), process each segment in parallel, then merge results.

let num_threads = std::thread::available_parallelism().unwrap_or(1);

// pre-compute boundaries aligned to newlines
let mut segment_boundaries = vec![0];
for thread_id in 1..num_threads {
    let target_pos = thread_id * (data_len / num_threads);
    let boundary = find_previous_newline(data, 0, target_pos);
    segment_boundaries.push(boundary + 1);
}

for thread_id in 0..num_threads {
    thread::spawn(move || {
        let mut local_stats = HashMap::new();
        local_stats
    });
}

// merge results from all threads

The key insight is to precompute boundaries aligned to newlines so threads don't split lines. Each thread builds its own hashmap, then we merge at the end. 31.4s to 27.25s.

CPU prefetch hints

Tell the CPU to load data into cache before we actually need it. Prefetch 32KB and 64KB ahead while processing each chunk.

while file_pos < segment_end {
    // prefetch data we'll need soon
    _mm_prefetch(data.as_ptr().add(file_pos + 32 * 1024), _MM_HINT_T0);
    _mm_prefetch(data.as_ptr().add(file_pos + 64 * 1024), _MM_HINT_T0);
    
    // process current chunk...
}

CPUs can often prefetch sequential data automatically, so manual prefetching isn't always needed. But in my testing, adding prefetch hints in a few strategic places helped. The idea is to warm up the cache right before we need the data.

Backwards temperature parsing

Instead of searching forward for the semicolon, parse backwards from the newline. The temperature format is fixed, so the semicolon is always 3-5 bytes from the end.

I ran a DuckDB query on the data to confirm:

SELECT len(line) - instr(line, ';') as semicolon_from_right, count(*)
FROM read_csv('measurements.txt', ...)
GROUP BY semicolon_from_right;

┌─────────────────────┬───────────┐
│ semicolon_from_right│   count   │
├─────────────────────┼───────────┤
│          3          │ 178772818 │
│          4          │ 801065020 │
│          5          │  20162162 │
└─────────────────────┴───────────┘

Only 3 possible positions. We can parse backwards by grabbing the fractional digit, skipping the dot, grabbing integer digits, and checking for a minus sign. There are only 4 patterns: ;I.F, ;-I.F, ;II.F, ;-II.F.

let ptr = data.as_ptr().add(end);

let float = (*ptr.sub(1) - b'0') as i16;  // last byte
let int_l = (*ptr.sub(3) - b'0') as i16;  // skip dot
let byte_4 = *ptr.sub(4);

let mut value = int_l * 10 + float;

if byte_4 == b';' {
    // ;I.F
    semicolon_pos = end - 4;
} else if byte_4 == b'-' {
    // ;-I.F
    semicolon_pos = end - 5;
    value = -value;
} else {
    // ;II.F or ;-II.F
    let int_h = ((byte_4 - b'0') as i16) * 100;
    value = int_h + value;
    // check byte 5 for semicolon or minus...
}

No searching, just direct memory access at known offsets. This is faster than scanning forward with AVX-512.

Integer arithmetic with standard rounding

Instead of using floats throughout, store temperatures as i16(multiply by 10). This avoids floating point arithmetic during aggregation and uses standard rounding for the average calculation.

struct CityStats {
    sum: i64,    // sum of all temps * 10
    count: u32,
    min: i16,    // temp * 10
    max: i16,
}

fn round_toward_positive_int(sum: i64, count: u32) -> f32 {
    let scaled = (sum * 100 / count as i64 + 5) / 10;
    scaled as f32 / 100.0
}

Only convert to f32 at the end for display. Integer arithmetic is faster and avoids rounding errors during aggregation.

Small wins

A few micro-optimizations that add up:

Fast path for hashmap updates. Most lookups hit existing cities, so optimize for that case.

// fast path: city already exists
if let Some(city_stat) = city_stats.get_mut(city_name) {
    city_stat.sum += city_value as i64;
    city_stat.count += 1;
    city_stat.min = i16::min(city_stat.min, city_value);
    city_stat.max = i16::max(city_stat.max, city_value);
} else {
    // slow path: new city
    city_stats.insert(city_name, CityStats { ... });
}

Avoids the entry API overhead for the hot path.

Pre-allocate output string capacity to avoid reallocations during formatting.

let mut output = String::with_capacity(12_000);

The final code has #[inline(always)] on hot functions. But while debugging with perf, it's useful to have it off so you can see which functions are taking how long.

When printing the final output, skip UTF-8 validation with from_utf8_unchecked. We know city names are valid UTF-8 from the input, so no need to validate again.

</thoughts>

There are many other better implementations out there in Java and Rust. Some use really clever tricks I haven't thought of. I might have to take another jab at it, maybe exploring NUMA-aware allocation, GPU offloading, or other approaches.

But the final number I got is 2.45s on GCP n4-standard-16. Started at 142s, ended at 2.45s. Not bad for 8-12 hours of fun iterations.

Full code is on GitHub.

DM me on twitter if you have ideas or questions - @matmul.