Memory hierarchy

Most performance arguments end up being arguments about which tier of memory you have to reach for. Touching L1 cache is essentially free; touching another datacenter is essentially forever. Both numbers are real. The slope between them is what your engine spends its day climbing.

Memory hierarchy by latency

Six memory tiers from L1 cache to cross-region network, each shown with size, real latency, a log-scale bar of relative cost, and a scaled latency assuming an L1 cache hit took one second.

tierlog-scale costreal timeif L1 = 1 secondL1 cache~32 KB / core1 ns1 secondL2 cache~256 KB / core4 ns4 secondsL3 cache~8 MB shared13 ns13 secondsMain memory64 GB and up100 ns1.7 minutesLocal SSD1 TB and up100 μs1.2 daysCross-region networkessentially unlimited70 ms2.2 years

Six tiers, real latency on the right, scaled to a human-relatable second on the far right. Bars are log-scaled because the absolute range spans eight orders of magnitude.

Reading the picture

If reaching L1 cache took one second, reaching another datacenter would take two years. That ratio is what every performance argument is implicitly about. When someone says "this is slow," they mean "this is making me reach further down the hierarchy than I want to be reaching."

The bars are on a log scale because the absolute numbers span eight orders of magnitude. Linear bars would show L1 through L3 as identical and everything else as a wall. Log bars show the actual gaps.

The cliffs that matter

Three cliffs in the hierarchy do most of the practical damage.

L3 to RAM. About 7x slower. This is the cliff a cache miss falls off. A program that does pointer chasing through a tree of small heap allocations spends most of its time here. The fix is data layout: pack what the hot loop touches together, in cache-line-sized chunks. Columnar layouts win partly for this reason; pointer-chasy ADT-heavy code loses partly for this reason.

RAM to SSD. About 1000x slower. This is the cliff a page fault falls off, or a query that has to read from disk because the working set didn't fit in memory. Engines that "fit your data in RAM" are fast because they never take this cliff. Engines that page from SSD pay it on every miss.

Local datacenter to cross-region network. About 1000x slower. This is the cliff a distributed query falls off when it has to shuffle data between nodes. The cost of shuffle at scale is dominated by this number, not by any work the engine itself does.

Where Rust patterns sit on the chart

Rust's safety story is mostly upstream of memory hierarchy, but its performance story is downstream of it.

Vec<T> is one heap allocation containing contiguous elements. A hot loop over a Vec<T> stays in L1 for the elements it's processing. This is the cache-friendly case.

Vec<Box<T>> is a vector of pointers, each pointing to a separate heap allocation. The vector itself is contiguous; the values are scattered. A hot loop pointer-chases across cache lines. Slower for reasons that have nothing to do with the language.

HashMap<K, V> is a hash table where colliding entries land in nearby slots. Lookups are fast when the load factor is low and the keys hash uniformly. They get slow when the table is too big to fit in L3.

Arc<T> adds an atomic reference count next to the value. Cloning an Arc is a single atomic increment, which costs a memory barrier and possibly a cache-line invalidation on other cores. Cheap, but not free.

The honest substrate read

The JVM, C++, and Rust all live on the same chart. None of them can teleport you from RAM to L1. What they can do is help (or hurt) your ability to keep the hot loop in the cache.

The JVM's escape analysis, when it works, gets short-lived objects out of the GC heap and onto the stack, which is L1-friendly. When escape analysis doesn't kick in, you eat heap allocation and the cache-miss tax that follows. Rust and C++ let you choose stack or heap explicitly; the cost is on the page when you write it.

Profile-guided optimization is mostly about cache and branch behavior. A profile tells the compiler which functions are hot, so it inlines them and arranges the binary to keep hot code paths adjacent. The JVM's JIT does this continuously, with live data. AOT compilers do it once, with a profile run.

Further reading