DataFrame ops in Sail
Sail represents a Spark query plan as a Rust enum. Each variant is a node in the query DAG: read, project, filter, join, sort, aggregate. This is the canonical "tagged union" use of Rust enums, applied at production scale, and it teaches you most of how Rust handles polymorphic data.
The plan as an enum
The top of the plan hierarchy:
/// Unresolved logical plan node for Sail.
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase", rename_all_fields = "camelCase")]
pub enum Plan {
Query(QueryPlan),
Command(CommandPlan),
}Two variants at the root: queries (read data) and commands (mutate catalog). Each carries one struct of data.
The interesting work happens inside QueryNode:
#[derive(Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase", rename_all_fields = "camelCase")]
pub enum QueryNode {
Read {
#[serde(flatten)]
read_type: ReadType,
is_streaming: bool,
},
Project {
input: Option<Box<QueryPlan>>,
expressions: Vec<Expr>,
},
Filter {
input: Box<QueryPlan>,
condition: Expr,
},
Join(Join),
SetOperation(SetOperation),
Sort {
input: Box<QueryPlan>,
order: Vec<SortOrder>,
is_global: bool,
},
Limit {
input: Box<QueryPlan>,
skip: Option<Expr>,
limit: Option<Expr>,
},
Aggregate(Aggregate),
// ~40 more variants
}Four things to notice:
1. Recursive variants use Box
input: Box<QueryPlan> and input: Option<Box<QueryPlan>> are the way Rust represents a recursive type. Without Box, the size of QueryNode would be infinite (a Filter contains a QueryPlan, which contains a Filter, which contains a QueryPlan…). Box<T> is a heap allocation that breaks the cycle. The variant stores a pointer, not the data.
Option<Box<QueryPlan>> for Project::input means "this project might be a leaf (no input) or have an upstream plan." That is what Option is for.
2. Two variant styles, two purposes
Project { // struct variant
input: Option<Box<QueryPlan>>,
expressions: Vec<Expr>,
},
Join(Join), // tuple variant with a named structWhen the data has structure (multiple named fields), use struct variant. When it is one named "payload" struct, use a tuple variant with the struct name. Sail consistently does this. Reading agent-written enums against this rubric quickly flags inconsistencies.
3. #[serde(rename_all = "camelCase")] for JSON interop
Rust enums use PascalCase for variants and snake_case for fields. JSON usually wants camelCase. The Serde derive bridges them at the boundary. The Rust code stays idiomatic, the JSON stays clean.
4. #[serde(flatten)] for nested-but-not-nested
Read {
#[serde(flatten)]
read_type: ReadType,
is_streaming: bool,
},flatten says "when serializing this field, do not nest it under read_type in the JSON; merge its fields into the parent." Useful when the wire format and the Rust type want different shapes.
Pattern matching on plan nodes
The natural way to operate on this enum is match. From the Spark Connect server:
let stream = match op {
plan::OpType::Root(relation) => {
service::handle_execute_relation(&ctx, relation, metadata).await?
}
plan::OpType::Command(Command { command_type: command }) => {
let command = command.required("command")?;
handle_command(&ctx, command, metadata).await?
}
plan::OpType::CompressedOperation(_) => {
return Err(Status::unimplemented("compressed operation plan"));
}
};Three patterns to notice:
- The match is exhaustive. Add a variant to
OpTypeand this code fails to compile until you handle it. That is the single most useful safety net in Rust enum-driven design. - Struct destructuring inside variants.
Command { command_type: command }pulls out one named field. Noop.command_type.unwrap()chains. - Unhandled cases return a typed error, not a panic.
Status::unimplemented(...)is the gRPC contract for "I see this, I cannot do it yet."
A small iterator chain on the plan
When you walk the plan, the natural Rust idiom is iterators:
pub fn quote_names_if_needed<T: AsRef<str>>(names: &[T]) -> String {
names
.iter()
.map(|name| quote_name_if_needed(name.as_ref()))
.collect::<Vec<_>>()
.join(".")
}T: AsRef<str> is the generic that lets this work on &[String], &[&str], and &[Arc<str>] without three separate impls. .iter().map(...).collect::<Vec<_>>().join(".") is the canonical "transform each, then flatten with separator."
For agent-written Rust, the smell test on a five-line for-loop is "could this be one chained iterator pipeline?" Often the answer is yes.
What this teaches you
When you read agent-written enums, hold them up to this template. Most of the time, the gap is between "the agent wrote a struct with a kind: String field" and "the right shape is an enum with one variant per kind."