Rust in query engines
The modern data stack rewrote itself in Rust between roughly 2020 and 2025. The reason is not ideology. It is that a query engine is the worst-case workload for a language: huge data volumes, hot inner loops, memory pressure, multi-threaded execution, no tolerance for pauses. Java pays GC. Python pays the interpreter. C++ pays at the human level (memory safety bugs, build complexity). Rust pays for none of them and keeps you honest about the costs you do pay.
The four crates that anchor the ecosystem
| Crate | Role |
|---|---|
arrow-rs | The columnar memory format. Every modern engine speaks Arrow. Zero-copy interop with Python, Java, C++, Go. |
datafusion | A SQL + plan execution engine on top of Arrow. The base layer most Rust engines build on. |
polars | A DataFrame library that competes with pandas. Internally a query engine with lazy evaluation, query planning, and SIMD. |
duckdb-rs | Rust bindings to DuckDB. C++ internals, but the Rust ecosystem treats it as a peer. |
If you know these four, you know the surface area. Sail builds on DataFusion. Polars uses Arrow. DuckDB and Polars are competitors-in-spirit at the single-node DataFrame layer.
Where Sail sits
Sail is a distributed query engine that speaks the Spark Connect protocol. It is built on DataFusion 53, which gives it the logical plan, optimizer, and physical execution. Sail adds:
- A Spark-compatible SQL dialect (
sail-sql-parserusing Chumsky parser combinators). - A distributed actor system for driver / worker coordination (
sail-execution). - Arrow Flight transport for shuffles and result streaming (
sail-flight). - A pluggable catalog (memory, Hive Metastore, Glue, Unity, OneLake, Iceberg).
- A PyO3 bridge for Python UDFs (
sail-python-udf).
If you only learn one query-engine codebase in Rust, learn Sail or DataFusion. They share most patterns: typed plans, optimizer rules as data, Arrow record batches as the unit of work, async over Tokio at the transport layer.
Why Rust specifically
Three properties of the language matter for query engines:
- No GC. A latency-sensitive sort cannot pause for arbitrary GC time. Rust's deterministic destructors keep tail latency tight.
- Zero-copy and explicit allocation. Arrow's columnar buffers can be shared between threads, languages, and processes without copies. Rust's
Arc<T>and slices make this safe. - Generics and monomorphization. Hot inner loops (sum, filter, hash) get specialized per data type at compile time. No interpreter overhead per row.
The cost is the borrow checker and longer compile times. The Sail repo's target/ directory is around 80GB and a cold build takes about thirty minutes. That is the price.
Where the JVM and C++ still lead
Honesty section. Three places where the other substrates are genuinely ahead, today, in production.
Mature distributed engines on the JVM. Spark, Flink, and Kafka are decades of compounded work. The connector ecosystems alone are something the Rust side has not replicated. Spark's Catalyst optimizer has been refined since 2014. Flink's exactly-once streaming with checkpoint recovery is the canonical answer to correctness-under-failure. Sail aims for Spark protocol compatibility specifically because rewriting the application surface is real work, even when the engine underneath is faster.
Single-node aggregation on C++. DuckDB and ClickHouse are exhaustively vectorized in ways the Rust ecosystem is still catching up to. The DuckDB CIDR 2020 paper is a short and honest read on the cost model. For workloads where your laptop is big enough, the C++ engines are often the closest thing to a finished product.
Reuse across engines on C++. Velox is Meta's bet that one well-tuned execution kernel under Presto, Spark, and others will outperform a fragmented landscape. The Rust equivalent of that bet has not landed yet.
The Rust side wins on memory safety, deployment artifacts, and predictable cold-path latency. The JVM and C++ sides win on maturity and on workloads that reward warmup. Both observations are true at the same time. The full read on why each substrate wins where it does is in What the JIT knows.
What to look for as an orchestrator
When an agent proposes a data-engine-adjacent change in Rust, watch for these specific patterns:
| Pattern | Why it matters |
|---|---|
RecordBatch flowing through the operator | Arrow is the unit of work. If the agent invented a per-row API, push back. |
Stream<Item = Result<RecordBatch>> | Operators are async streams of batches. |
ExecutionPlan trait implementations | DataFusion's contract for a physical operator. |
SchemaRef = Arc<Schema> | Schemas are shared, immutable, refcounted. |
LogicalPlan -> PhysicalPlan lowering | The optimizer transforms the plan; the executor runs it. |
The closer agent-written engine code stays to these idioms, the less you have to rewrite later.
Recommended reading
- DataFusion in Apache Arrow — the canonical reference.
- Sail GitHub — the production Rust query engine this site biases toward.
- Polars User Guide — the DataFrame side of the ecosystem.
- Arrow Columnar Format — read this before writing engine code.