Field guide · the wider vocabulary

The lexicon.

63 terms for the substrate around Rust. Runtimes, compilation, memory and ownership, query engine mechanics, hardware, concurrency, and the engines actually shipping in production. The same problems get solved on the JVM and in C++; this page exists so the comparisons elsewhere on the site have somewhere to land.

Section I·10 terms

Runtimes

What runs alongside the program to make it work. Where each language sits between heavy and thin.

  1. The set of code that runs alongside your program to make it work. Garbage collector, thread scheduler, exception handling, reflection support. The JVM has a large one. C++ ships with a small one (exception unwinding, stack guards, the standard library). Rust's is thinner still: a panic unwinder, plus an async runtime if you opt in.
  2. Compile to machine code while the program runs, after observing which paths are hot. The JVM's HotSpot does this. The advantage is that the compiler can specialize on actual runtime types and call frequencies; the cost is a warmup period and the memory the compiler itself occupies. See What the JIT knows.
  3. Compile to machine code before the program starts. C, C++, Go, and Rust all do this by default. Predictable from the first request, no warmup, but the compiler has to make every decision without runtime data. See What the JIT knows.
  4. The runtime tracks which objects are still reachable and frees the rest. Java, Go, Python, and JavaScript all do this. The historical cost was a small throughput tax and the possibility of a pause when the collector ran. Modern collectors like ZGC and Shenandoah keep pauses sub-millisecond on most workloads, which closes most of the historical gap.
  5. A JVM optimization that proves a heap-allocated object never leaves the current scope, so it can be stack-allocated instead. Without it, every Java object goes to the GC heap. With it, the JVM can match C on hot loops after warmup. This is one of several places where the JVM is better at runtime than the language reads on paper.
  6. TokioRust
    The de facto async runtime for Rust. Provides a thread pool, an I/O event loop, and the primitives async code reaches for. Rust ships no async runtime by default; you opt in to Tokio (or smol, or async-std) explicitly. The decoupling is deliberate so embedded Rust can skip it entirely.
  7. A program that schedules futures, polls them when they can make progress, and manages I/O events. Rust has no built-in async runtime; you pick one. Tokio is the default; smol and async-std exist. Java's equivalent is the JVM thread scheduler plus libraries; Go's is the goroutine scheduler baked into the runtime. See async, illustrated.
  8. A multi-threaded task scheduler where idle threads steal tasks from busy threads' queues. Tokio's default multi-thread runtime uses this. Same shape as Java's ForkJoinPool. The pattern keeps cores warm under uneven workloads without per-task scheduling overhead.
  9. Many lightweight tasks scheduled by a userspace runtime onto few OS threads. Go's goroutines, Java's virtual threads (Project Loom), Erlang's processes. Rust deliberately does not have green threads in the standard library; async/await covers the same use cases without the runtime cost.
  10. An interpreter that runs bytecode instead of machine code. The JVM, V8, CPython, and Ruby's MRI all do this. JVM and V8 add JIT compilation; CPython does not (PyPy is the JIT variant). The bytecode VM step is the cost most natively-compiled languages skip.
Section II·6 terms

Compilation

What the compiler does before your program runs. Where Rust spends the time the JVM spends at runtime.

  1. The compiler backend Rust, Clang, Swift, Julia, and several others share. LLVM takes a generic intermediate representation and produces machine code for x86, ARM, WASM, and more. Most of rustc's optimization passes are actually LLVM doing its job.
  2. The compiler generates a separate copy of a generic function for each concrete type it is called with. Vec<u32> and Vec<String> produce different machine code. The win is zero-cost generics; the cost is binary size and compile time. C++ templates work the same way.
  3. A function call resolved at runtime through a vtable lookup. Rust uses dynamic dispatch when you write dyn Trait. The cost is one indirection per call and no inlining across the boundary. Java's virtual methods do this by default; Rust makes it opt-in.
  4. Substituting a function's body at its call site instead of generating a call instruction. The compiler decides; you can hint with #[inline]. Inlining unlocks downstream optimizations (constant propagation, dead code elimination). The JVM's JIT inlines aggressively because it has runtime data.
  5. Optimization passes that run across crate boundaries at link time. Without LTO, the compiler can only optimize within a single crate. With LTO, function calls across crates can be inlined. The cost is much longer link times; the win is smaller, faster binaries.
  6. Compile twice. First with instrumentation, then run a representative workload to gather a profile, then compile again using the profile to guide inlining and branch prediction. AOT compilers opt in to PGO; the JVM's JIT does the same work continuously, for free, at runtime.
Section III·8 terms

Memory and ownership

Where values live, who owns them, and what the borrow checker is checking. The mechanism behind Rust's safety guarantee.

  1. The stack is the thread's per-call scratch space, freed when the function returns. The heap is the allocator's pool, freed when something explicitly drops it. Rust types live on the stack unless they own heap data (Vec, String, Box). Cost shape: stack is essentially free; heap costs an allocator call.
  2. Resource acquisition is initialization. The pattern where a value's destructor (Drop in Rust) frees the resource it owns when the value goes out of scope. Memory, file handles, locks, sockets all use the same pattern. C++ invented it; Rust adopted it and added ownership-tracking on top.
  3. The static analyzer that verifies Rust's ownership rules. It runs after type-checking, before code generation. The output is either “compiles” or “this reference outlives its source, this value is moved, you have two mutable references.” The single feature that makes Rust safe at compile time.
  4. The compile-time scope during which a reference is valid. Lifetimes are inferred in most cases. When you write 'a, you are naming a scope so the compiler can match it against another. Lifetimes are not run-time entities; they exist only to the borrow checker.
  5. A struct that wraps a pointer and adds a behavior. Box owns a heap allocation. Rc reference-counts shared ownership in single-threaded code. Arc does the same for multi-threaded code (atomic counter). C++'s shared_ptr is the same shape as Arc.
  6. Mutating data through a shared reference, mediated by a type that enforces the rules at runtime. RefCell for single-threaded (panics on conflict). Mutex for multi-threaded (blocks). Atomics for lock-free. The escape hatch from “shared means immutable.”
  7. A request to the heap allocator for a block of memory. Each one costs cycles plus the bookkeeping the allocator does. Most performance work in Rust and C++ is allocation-reduction work. See where allocations happen for the practical version.
  8. The rules governing what one thread sees when another thread writes. Rust's memory model is mostly inherited from C++11. Sequential consistency, acquire-release, and relaxed are the orderings you will meet. The model exists so compilers and CPUs can reorder freely as long as the rules are obeyed.
Section IV·12 terms

Query engine mechanics

The internal vocabulary of an analytical engine: how a query becomes a plan, becomes a tree of operators, becomes a stream of batches.

  1. A tree of operations that describes what a query should produce, independent of how. Filter, project, join, group-by. DataFusion's LogicalPlan, Spark's LogicalPlan, DuckDB's LogicalOperator all play the same role. See anatomy of a query.
  2. The tree of operations the engine actually executes. Hash join becomes a HashJoinExec. Filter becomes a FilterExec over a stream of record batches. The optimizer lowers logical to physical.
  3. The thing that rewrites a plan tree to be cheaper without changing what it computes. Push filters down, drop dead projections, reorder joins by estimated cardinality. The bulk of database research over the last forty years lives here. Graefe's Volcano paper (1993) is still required reading.
  4. Move filter operations as close to the data source as possible. A WHERE country = 'US' should hit the Parquet reader, not the post-load filter, so the engine never reads rows that fail the predicate. The absence of pushdown is one of the biggest performance gaps in handwritten data code.
  5. Only read the columns the query actually needs. A SELECT id, name against a fifty-column table should read two columns from disk, not fifty. Columnar layouts make this natural; row layouts make it impossible.
  6. Skip entire files or partitions when their metadata proves no row matches the query. A query for date = '2026-05-26' reads only that partition's files. Combined with predicate pushdown, the engine often touches less than one percent of the dataset.
  7. Process many values per loop iteration instead of one. At the hardware level, SIMD instructions on packed data. At the engine level, operating on a RecordBatch (an Arrow columnar chunk) rather than row at a time. Most modern engines are vectorized by default.
  8. Store all values of one field next to each other in memory, not interleaved with the rest of the row. Better for analytical queries (sum this column, average that one) and for compression. Worse for whole-row lookups. Arrow is the most widely adopted columnar in-memory format. See columnar vs row.
  9. Move data between processes, threads, or languages without copying the bytes. Arrow IPC ships a record batch across a Python-Rust boundary by sharing a memory mapping. The savings show up in benchmarks but also in operational simplicity.
  10. The columnar in-memory format that became the lingua franca of the modern data stack. pyarrow, arrow-rs, arrow-java, arrow-cpp all interoperate. If two engines speak Arrow, they hand data back and forth without serializing.
  11. Generate machine code or bytecode at runtime, specialized to the current query. Spark's Tungsten compiler generates Java bytecode for each query's hot loop. DuckDB and ClickHouse generate machine code directly. The win is removing per-row interpretation overhead; the cost is a compilation pause before execution.
  12. The network transfer of data between nodes during a distributed join or group-by. Each row is hashed by its key and sent to the node responsible for that key range. Shuffles are the dominant cost of most large distributed queries.
Section V·10 terms

Hardware and observation

What the chip actually does, and how to see what your program does on it. Most performance work happens here.

  1. The chunk of memory the CPU loads at once, typically 64 bytes. Algorithms that touch contiguous bytes hit a hot cache line; algorithms that hop around RAM eat cache misses. Columnar layouts win partly because the hot loop stays inside a few cache lines per column.
  2. Three tiers of SRAM between the CPU and main memory. L1 is per-core, roughly 32KB, accessible in about four cycles. L2 is per-core, roughly 256KB, about twelve cycles. L3 is shared across cores, roughly 8MB, about forty cycles. Main memory is about two hundred cycles. See memory hierarchy.
  3. A small cache of virtual-to-physical address translations. A TLB miss adds a page table walk to the cache miss cost. Programs that touch many small allocations spread across many memory pages eat TLB misses on every access.
  4. On multi-socket systems, each CPU has its own attached RAM. Accessing the other socket's RAM crosses an interconnect and costs more cycles. NUMA-aware code keeps data local to the thread that uses it.
  5. The CPU's guess about which side of an if will be taken next. Right guesses are nearly free; wrong guesses flush the pipeline. Tight loops that always go one way are fast for reasons that have nothing to do with the language.
  6. Single Instruction, Multiple Data. CPU instructions that operate on a vector of values at once. AVX2 on x86, NEON on ARM. The hot inner loops of any modern engine compile down to these. Rust's compiler auto-vectorizes more aggressively than most expect.
  7. The work the OS does to swap one thread for another on a CPU core. Register save and restore, possibly TLB flush, plus the cache pollution from the next thread's working set. Roughly one to ten microseconds. Async runtimes exist mostly to avoid context switches.
  8. A visualization of which functions were on the call stack, weighted by time. Generated from a stack-sampling profiler (perf, dtrace, pprof). Reading one is the fastest way to see where your program actually spends time. See flame graphs are mostly runtime.
  9. A flame graph for memory: which call paths allocated how much. Most performance work in Rust and C++ codebases is allocation-reduction work, because each allocation is real and so is the next one. dhat for Rust, jemalloc's profiler for C/C++.
  10. A Linux kernel feature that lets you attach small verified programs to kernel events. Used for observability (perf, bpftrace), networking (Cilium), and security (Falco). The modern replacement for SystemTap and DTrace on Linux.
Section VI·6 terms

Concurrency

The vocabulary of threads and atomics. Where Rust's compile-time guarantees end and the hardware's reordering rules begin.

  1. Rust's two marker traits encoding thread safety. Send means a value can be moved between threads. Sync means a &T can be shared between threads. The compiler enforces both. Misuse does not compile.
  2. A concurrent algorithm where no thread waits on a lock to make progress. Usually built on atomic primitives like CAS. Lock-free queues, hash maps, and counters exist in crossbeam and dashmap. Hard to get right; easy to get wrong in a way that looks correct.
  3. The atomic primitive: if memory at address X equals A, set it to B, otherwise report the actual value. A single CPU instruction. Most lock-free algorithms are built on CAS loops. The instruction is lock cmpxchg on x86, LL/SC on ARM.
  4. A CPU instruction that prevents the processor from reordering memory operations across that point. Required when implementing lock-free algorithms or interacting with hardware. Rust's Ordering::Acquire, Ordering::Release, and similar compile down to the appropriate barriers per platform.
  5. The strongest memory ordering: every thread observes operations in the same order they were issued globally. The slowest because it disables most reordering optimizations. The default for Atomic* operations when you don't specify an Ordering.
  6. The database technique where each write creates a new version of the row instead of overwriting. Readers see a consistent snapshot; writers don't block readers. Postgres, MySQL InnoDB, and most modern databases use this. The cost is keeping old versions around until no one needs them.
Section VII·11 terms

Engines in the wild

Production data engines, tagged by substrate. The same problems get solved on each side; the tradeoffs are different.

  1. The reference relational database. Written in C. MVCC, single-node by default (Citus and Patroni add distribution). Most query optimization research lands on Postgres first. The substrate against which most newer engines are measured.
  2. The reference distributed analytics engine. The RDD paper (Zaharia et al., 2012), then DataFrame, then Spark SQL on top of the Catalyst optimizer. Mature, battle-tested, surrounded by an enormous connector ecosystem. The substrate question of “what would Spark look like in Rust” is part of why this site exists.
  3. The de facto distributed log. Most data infrastructure flows in or out of a Kafka topic at some point. The broker runs on the JVM; client libraries exist in every language. The JVM's mature networking and concurrency runtime carry a lot of weight here.
  4. A distributed SQL query engine on the JVM. Federates across data sources (Hive, S3, MySQL, Kafka). Built for interactive analytical queries over heterogeneous storage. Velox is intended to plug under Trino's execution layer.
  5. A single-node analytical database, embedded like SQLite. C++ internals, exhaustively vectorized. The CIDR 2020 paper is short and worth reading. DuckDB is the answer when “your laptop is big enough” and you want one binary, no cluster, no JVM.
  6. A columnar OLAP database with industrial-strength vectorization. Used heavily for observability and large-volume aggregate workloads. Sometimes faster than anything else at the things it is good at.
  7. Meta's vectorized execution engine, designed to be reused under existing systems (Presto, Spark). C++. The interesting bet is sharing one well-tuned execution kernel across multiple engines instead of building each in isolation.
  8. PolarsRust
    A DataFrame library that competes with pandas. Internally a query engine with lazy evaluation, query planning, and SIMD-heavy execution. The Python bindings are a thin layer over the Rust core.
  9. A SQL and plan-execution engine on top of Apache Arrow. Modular: pluggable catalogs, custom operators, custom function registries. Many Rust query engines build on it.
  10. SailRust
    A Spark-compatible distributed query engine on top of DataFusion. Open source under Apache 2.0. The Sail tour in this guide walks through real source.
    Disclosure: LakeSail, the company behind Sail, is where the author of this site works on go-to-market, not engineering. Sail appears across this site as a real Rust codebase to read; the field guide does not depend on it.
Anchor essay
What the JIT knows

Three computation curves and where Rust sits. The honest read on why the JVM is faster on a warm hot loop and why Rust is more predictable on the cold one.

Apply the lexicon
Rust in query engines

Where these terms show up in a real engine. Arrow, columnar memory, zero-copy, the four crates worth knowing.