Skip to content

Streamline query internal representation#1226

Merged
zonotope merged 35 commits into
mainfrom
refactor/streamline-query
May 10, 2026
Merged

Streamline query internal representation#1226
zonotope merged 35 commits into
mainfrom
refactor/streamline-query

Conversation

@zonotope
Copy link
Copy Markdown
Contributor

@zonotope zonotope commented May 8, 2026

This branch reshapes fluree-db-query's intermediate representation to push more invariants into the type system, peel the catch-all QueryOptions apart so each modifier lives where it's actually used, reorganize a couple of misnamed modules, and fix a SELECT DISTINCT * correctness regression along the way. Net change: ~75 files, +1.9k / -1.9k. The runtime executor is unchanged in behavior; the API surface around Query, QueryOutput, ReasoningConfig, and SubqueryPattern shifts noticeably and is documented under "Migration notes" below.

Why

The old Query IR mixed three different kinds of state in one place and let the type system express several illegal combinations:

  • QueryOptions was a 7-field grab-bag carrying LIMIT, OFFSET, ORDER BY, GROUP BY, aggregates, HAVING, post-binds, DISTINCT, and the rewriter's reasoning configuration. The first eight are surface modifiers; the last is rewriter input. They had no business riding together, and the bag made it possible to construct nonsense (e.g. an ASK query carrying GROUP BY).
  • selectOne and selectDistinct were both modeled as boolean flags on the parser surface, which let an internal caller construct "selectOne distinct" — meaningless because Restriction::One changes the output shape (bare row vs. one-element array) while Distinct does not.
  • The aggregation phase was modeled as four parallel Vec-typed fields (group_by, aggregates, having, post_binds) that had to be checked for joint validity at every consumer. "GROUP BY non-empty implies aggregates may be empty (dedup mode)", "HAVING requires aggregates", "post-binds require aggregates" were prose contracts.
  • The expression-evaluation code lived under a module called expression that also contained a re-exported submodule eval — two names for the same thing.

The cumulative effect was that operator-tree construction and dependency analysis spent significant prose in if !x.is_empty() && y.is_some() && ... gates, and a single conceptual change ("does this query group?") required updating four parallel data structures plus their cross-checks.

What changed

Type-level invariants

  • Restriction enum (fluree_db_query::ir::Restriction) replaces the selectDistinct/selectOne boolean pair on QueryOutput::Select. Variants Distinct and One are mutually exclusive, expressed as Option<Restriction>. Restriction::One carries an explicit doc note distinguishing it from query.limit = Some(1) because they differ in output shape (bare row vs. one-element array).
  • Grouping enum (fluree_db_query::ir::Grouping) replaces the parallel-Vec encoding of the aggregation phase. Variants:
    • Implicit { aggregation, having } — single implicit group, always carries an Aggregation.
    • Explicit { group_by, aggregation, having } — partitioned by NonEmpty<VarId>; aggregation is Option<Aggregation> because GROUP BY without aggregates is a legal dedup-by-key form.
      Helpers: Grouping::assemble(group_by, aggregates, binds, having) -> Option<Self> selects the right variant from loose lowering pieces; Grouping::aggregates(), Grouping::binds(), Grouping::having(), Grouping::aggregation(), Grouping::group_by_vars() give variant-agnostic iteration. The previous "is HAVING legal here?" / "is binds.is_empty() ok?" checks at consumer sites collapse into pattern matches that the compiler exhausts.
  • Aggregation sub-struct bundles aggregates: NonEmpty<AggregateSpec> with binds: Vec<(VarId, Expression)> so that post-aggregation binds can only exist where an aggregation stage is present. Eliminates the "GROUP BY without aggregates plus post-binds" representable-but-meaningless combination.
  • NonEmpty<T> promoted to fluree-db-core (fluree_db_core::NonEmpty) with structural non-emptiness via head: T, tail: Vec<T>. Used by Grouping for group_by and aggregate lists. Public field representation is intentional — supports destructure-rebuild in revert.rs::RevertSource::Commits.

Query field reshuffle

QueryOptions's modifiers were peeled off one axis at a time and lifted onto Query directly:

pub struct Query {
    pub context: ParsedContext,
    pub orig_context: Option<serde_json::Value>,
    pub output: QueryOutput,
    pub patterns: Vec<Pattern>,
    pub grouping: Option<Grouping>,         // GROUP BY + aggregates + HAVING phase
    pub ordering: Vec<SortSpec>,             // ORDER BY (was QueryOptions.order_by)
    pub limit: Option<usize>,                // was QueryOptions.limit
    pub offset: Option<usize>,               // was QueryOptions.offset
    pub reasoning: ReasoningConfig,          // rewriter input (renamed from `options`)
    pub post_values: Option<Pattern>,
}

The wrapper that remained — only two fields, both rewriter-side configuration — was renamed to express its actual purpose:

pub struct ReasoningConfig {
    pub modes: ReasoningModes,                     // was `reasoning`
    pub schema_bundle: Option<Arc<SchemaBundleFlakes>>,
}

ExecutableQuery.options: QueryOptions became ExecutableQuery.reasoning: ReasoningConfig to match.

Module reorganization

  • fluree-db-query/src/expression*fluree-db-query/src/eval/. The expression module was a kitchen sink containing both the AST (which is IR) and the evaluator (which is runtime); the IR types moved up to crate::ir::expression, and the runtime eval directory now contains just the dispatcher and operator implementations (arithmetic, cast, compare, datetime, string, numeric, logical, conditional, geo, hash, rdf, fluree, fulltext, etc.).
  • fluree-db-query/src/ir/options.rsfluree-db-query/src/ir/reasoning.rs. Module name now matches the struct it exports (ReasoningConfig).
  • FilterValue enum removed in favor of the existing FlakeValue — the duplicate value-domain type is gone.
  • Projection type hierarchy was reshaped to make invalid projections unrepresentable (Tuple / Scalar / Wildcard carry their column shape directly).
  • graph_select was moved off the top-level into QueryOutput::Select, where it actually applies.

Operator-tree cleanup

  • implicit_single_aggregate(query) -> Option<&AggregateSpec> helper replaces the let Some(Grouping::Implicit { aggregation: Aggregation { aggregates, binds }, having: None }) = … else { return None }; if aggregates.len() != 1 || … gate that was duplicated across eleven fast-path detectors. Each detector is now ~5 lines shorter and shares the gate definition.
  • build_operator_tree(query, stats, planning) lost its options: &QueryOptions parameter — the function (and its inner) never read it. Same drop applied to all 33 detect_* private helpers and try_build_count_plan.
  • compute_variable_deps(query) is now single-arg for the same reason.
  • Grouping::aggregates() drops a Box<dyn Iterator> allocation per call in favor of impl Iterator, matching the shape of binds().
  • dependency.rs::compute_variable_deps reads post-aggregation binds via &[(VarId, Expression)] borrowed off Aggregation::binds instead of materializing a Vec<&(VarId, Expression)>.

SPARQL lowering

  • lower_base_modifiers(modifiers) -> Result<BaseModifiers> replaces the prior (modifiers, &mut QueryOptions) mutation pattern. Returns BaseModifiers { limit, offset, ordering } for the caller to lift onto Query directly. All four call sites (SELECT, CONSTRUCT, DESCRIBE, ASK) destructure cleanly. The redundant lower_construct_modifiers wrapper went away.
  • ASK lowering now explicitly forces limit: Some(1), offset: None, ordering: Vec::new(), grouping: None and discards the parsed solution modifiers (per SPARQL semantics — ASK answers a single boolean and the modifiers are inert).
  • Grouping::assemble(...) is used uniformly across all three lowering paths (parse/lower.rs, sparql/lower/mod.rs, sparql/lower/select.rs) to construct the grouping phase from loose pieces, replacing three near-identical variant-selection ladders.

SubqueryPattern field naming

SubqueryPattern.order_byordering and SubqueryPattern::with_order_bywith_ordering, mirroring the top-level Query.ordering rename. The remaining SubqueryPattern.distinct: bool is left as a bool intentionally — selectOne doesn't apply to subqueries (subqueries always produce a row stream), so the binary Restriction axis isn't meaningful there.

Bugfix: SELECT DISTINCT * regression

fluree-db-sparql/src/lower/mod.rs was silently downgrading SELECT DISTINCT * to SELECT * because the wildcard match arm dropped the distinct flag. Added a QueryOutput::wildcard_distinct() constructor and split the match so (SelectVariables::Star, true) produces a wildcard with Restriction::Distinct.

Drive-by cleanup

  • Removed unused input_vars methods after confirming referenced_vars covers every consumer.
  • decode_lookup_error lifted from a method into a free function to break a circular dep through expression.
  • Several manual Debug/Clone/PartialEq impls replaced by #[derive(...)].
  • is_bound consolidated to a single function across two consumers.
  • Imports tightened: operator_tree.rs collapsed eight separate use crate::ir::* lines into one grouped import; dependency.rs test module dedup'd.
  • Module-level doc comments were updated wherever they still referenced the old QueryOptions/expression framing — including ir.rs, ir/query.rs, ir/reasoning.rs, and the Restriction::One doc.

Migration notes (for external consumers)

The public API of fluree-db-query shifts in several ways. None of these are subtle — the compiler will tell you about each one — but the rename targets are worth knowing in advance:

Before After
query.options.limit query.limit
query.options.offset query.offset
query.options.order_by query.ordering
query.options.reasoning query.reasoning.modes
query.options.schema_bundle query.reasoning.schema_bundle
QueryOptions ReasoningConfig
QueryOptions::with_limit(n) set Query.limit = Some(n) directly
QueryOptions::with_offset(n) set Query.offset = Some(n) directly
QueryOptions::with_order_by(specs) set Query.ordering = specs directly
QueryOptions::with_reasoning(modes) ReasoningConfig::with_modes(modes)
QueryOptions::has_modifiers() removed (callers checked these inline)
ExecutableQuery.options ExecutableQuery.reasoning
build_operator_tree(query, options, …) build_operator_tree(query, …) (3 args)
compute_variable_deps(query, options) compute_variable_deps(query) (1 arg)
SubqueryPattern.order_by SubqueryPattern.ordering
SubqueryPattern::with_order_by(specs) SubqueryPattern::with_ordering(specs)
crate::expression::* (eval runtime) crate::eval::*
crate::ir::options::ReasoningConfig crate::ir::reasoning::ReasoningConfig
FilterValue FlakeValue (already existed)
QueryOutput::Select { distinct: bool, … } QueryOutput::Select { restriction: Option<Restriction>, … }

New IR types you'll see at use sites: Restriction, Grouping, Aggregation, AggregateFn, AggregateSpec, ReasoningConfig, ReasoningModes (renamed), BaseModifiers (SPARQL lower-internal), and fluree_db_core::NonEmpty<T>.

Query literals carry more fields than before (every modifier moved up). Test fixtures and with_patterns-style copies were updated accordingly.

Known limitations / follow-ups

  • JSON-LD ASK doesn't scrub modifiers: SPARQL ASK lowering forces limit=1, ordering=[], grouping=None to honor the boolean-output semantics. The JSON-LD lowering path (parse/lower.rs::lower_query) doesn't yet apply the same scrub — a JSON-LD ASK with groupBy or orderBy will carry those fields through to execution where they execute and produce wasted work (the format layer ignores all but "any non-empty batch?"). The semantics are preserved; only efficiency is affected. Parser-side cleanup is planned as a follow-up.
  • SubqueryPattern still parallel to Query: SubqueryPattern carries its own select, patterns, limit, offset, distinct: bool, ordering, grouping fields rather than embedding a Query. Convergence into a shared "QueryBody" struct was discussed but not pursued — the two surfaces have different invariants (subqueries are always row streams, never ASK; their projection is a Vec<VarId> rather than the richer QueryOutput).
  • SPARQL subselect HAVING is not lowered: documented in-place (lower/select.rs::lower_subselect); pre-existing limitation, not introduced by this branch.
  • build_operator_tree's top-of-function aggregates_vec / group_by_vec / post_binds_vec extraction: still materializes loose Vecs for downstream operator constructors that consume owned data. Acceptable trade-off for one-shot setup; flagged for a future round when the operator constructors gain Arc<[T]>-shaped APIs.
  • Subquery per-parent-row clones in subquery.rs::execute_subquery_for_row: still rebuilds the operator stack each parent row. Real fix requires hoisting grouping/aggregation setup into SubqueryOperator construction with shared ownership. Out of scope for this branch.

zonotope added 30 commits May 5, 2026 16:09
This mirrors the actual query ast and makes the invalid state of a wildcard
select and a Some graph_select unrepresentable. Also updates nomenclature and
changes the name of the "graph crawl" concept to "hydration" to match other
databases and ORMs
@zonotope zonotope requested review from aaj3f and bplatz May 8, 2026 08:51
@zonotope zonotope changed the title Refactor/streamline query Streamline query internal representation May 8, 2026
Copy link
Copy Markdown
Contributor

@bplatz bplatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

@zonotope zonotope merged commit 7566839 into main May 10, 2026
16 of 17 checks passed
@zonotope zonotope deleted the refactor/streamline-query branch May 10, 2026 03:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants