Rust in AI inference
Training happens in Python. Inference increasingly does not. The latency, memory footprint, and deployment story for serving models pushes teams toward Rust faster than almost any other domain. Three things to know: what the libraries are, why Rust wins on inference, and what to watch for in agent-written ML-adjacent code.
The libraries worth knowing
| Crate | What it does |
|---|---|
candle | Hugging Face's pure-Rust ML framework. Minimal, fast, supports CUDA + Metal + WASM. Run Llama-class models from a single binary. |
burn | A more flexible ML framework with multiple backends (Wgpu, NDArray, LibTorch, Candle). Comfortable for both training and inference. |
ort | ONNX Runtime bindings. The pragmatic choice when your model already exports to ONNX. |
tokenizers | HF's tokenizer library. The Python side is a thin wrapper around this Rust core. |
llama-cpp-rs | Rust bindings to llama.cpp. The fastest path to running quantized models locally. |
safetensors | The model weight format that replaces pickle for safety + speed. |
For most projects, the choice is: Candle (if you want pure Rust), ort (if your model exports to ONNX), or llama-cpp-rs (if you want quantized local inference).
Why Rust matters for inference
Five reasons, in order of how much they matter in practice:
- No Python overhead. A 100ms model call in Python has 5 to 20ms of interpreter and FFI overhead. In Rust you pay essentially zero overhead. At scale (millions of requests), this is real money.
- Smaller deployment artifacts. A Rust binary that includes the model and the runtime can be around 50MB. The equivalent Python deployment is closer to a gigabyte before model weights.
- Deterministic memory. No GC pauses, no Python reference cycles, no surprise allocations. Predictable tail latency.
- Cross-platform out of the box. Cargo compiles for x86, ARM, WASM, and various GPU backends without docker gymnastics.
- Type-safe tensor operations (in Candle and Burn). Shape mismatches show up at compile time in some cases, at construction time always.
The cost: the ML ecosystem in Rust is younger and less complete than Python's. Some specialized layers, optimizers, and quantization tricks exist only in Python.
Where C++ and Python still lead
Three honest tradeoffs.
llama.cpp and the quantized model world. The single best path to running quantized large models locally is still C++. llama.cpp is the reference implementation; the Rust binding (llama-cpp-rs) is a thin wrapper. For 4-bit and 8-bit quantization, the precision tricks live on the C++ side, and the Rust ecosystem mostly calls into them.
Python plus CUDA, for the actual training. Training and most experimentation happen in PyTorch on Python with CUDA underneath. The ecosystem of pretrained checkpoints, fine-tuning scripts, distributed-training plumbing, and reproducible-research culture all lives there. Rust handles inference well; nobody trains in Rust by choice yet.
vLLM and the high-throughput serving layer. vLLM (Python plus custom CUDA kernels) is currently the most efficient open-source LLM serving framework. The continuous-batching and paged-attention tricks at the heart of it are not yet matched by anything in Rust. Inference at scale today usually routes through systems like vLLM, TensorRT-LLM, or SGLang first; Rust comes in around the edges (tokenizers, request routing, model loading, the binary that boots the whole thing).
Rust's win in inference is on cold start, deployment size, and tail latency for small-to-medium models. It is not yet the right choice for serving frontier-class models at frontier-class throughput. See What the JIT knows for the broader read on which curve Rust sits on and why.
What "AI in Rust" actually looks like in code
Two snippets to set expectations. First, a Candle inference call:
use candle_core::{Device, Tensor};
use candle_nn::{Module, VarBuilder};
let device = Device::cuda_if_available(0)?;
let vb = VarBuilder::from_pth("model.safetensors", DType::F32, &device)?;
let model = MyModel::new(vb)?;
let input = Tensor::from_slice(&prompt_tokens, (1, prompt_tokens.len()), &device)?;
let logits = model.forward(&input)?;
let next_token = sample_argmax(&logits)?;Idiomatic Rust: explicit device, explicit dtype, every fallible op returns Result and propagates with ?. No autograd by default (Candle is inference-first).
Second, a tokenizer call:
use tokenizers::Tokenizer;
let tokenizer = Tokenizer::from_file("tokenizer.json")?;
let encoding = tokenizer.encode("Hello, world!", false)?;
let ids: Vec<u32> = encoding.get_ids().to_vec();This is the same Rust crate that backs Hugging Face's Python tokenizers library. Calling it from Rust is the same code minus the FFI layer.
What to look for as an orchestrator
When an agent writes ML-adjacent Rust:
| Pattern | Watch for |
|---|---|
| Loading model weights | safetensors over pickle-based formats. Pickle is unsafe by design. |
| Tensor allocation in a loop | Allocations dominate latency. Reuse output buffers. |
| Async around model calls | Model inference is sync CPU/GPU work; wrap with tokio::task::spawn_blocking. |
| Device selection | Should be explicit (Device::Cuda vs Device::Cpu), not silently default. |
| Quantization claims | Verify the numerics; agents will hand-wave precision losses. |
A note on the ecosystem direction
The Rust ML story is moving. Candle adds backends, Burn ships training improvements, ort 2.x is rebuilding the API. If the agent suggests a specific version, check the current state, not what was true six months ago.
For a personal or product project, the safe bet for inference today is one of: ort + ONNX, Candle + safetensors, or llama-cpp-rs. Pick based on whether the model exports to ONNX, has weights you can load directly, or needs quantization.