Streamline query internal representation by zonotope · Pull Request #1226 · fluree/db

zonotope · 2026-05-08T08:51:43Z

This branch reshapes fluree-db-query's intermediate representation to push more invariants into the type system, peel the catch-all QueryOptions apart so each modifier lives where it's actually used, reorganize a couple of misnamed modules, and fix a SELECT DISTINCT * correctness regression along the way. Net change: ~75 files, +1.9k / -1.9k. The runtime executor is unchanged in behavior; the API surface around Query, QueryOutput, ReasoningConfig, and SubqueryPattern shifts noticeably and is documented under "Migration notes" below.

Why

The old Query IR mixed three different kinds of state in one place and let the type system express several illegal combinations:

QueryOptions was a 7-field grab-bag carrying LIMIT, OFFSET, ORDER BY, GROUP BY, aggregates, HAVING, post-binds, DISTINCT, and the rewriter's reasoning configuration. The first eight are surface modifiers; the last is rewriter input. They had no business riding together, and the bag made it possible to construct nonsense (e.g. an ASK query carrying GROUP BY).
selectOne and selectDistinct were both modeled as boolean flags on the parser surface, which let an internal caller construct "selectOne distinct" — meaningless because Restriction::One changes the output shape (bare row vs. one-element array) while Distinct does not.
The aggregation phase was modeled as four parallel Vec-typed fields (group_by, aggregates, having, post_binds) that had to be checked for joint validity at every consumer. "GROUP BY non-empty implies aggregates may be empty (dedup mode)", "HAVING requires aggregates", "post-binds require aggregates" were prose contracts.
The expression-evaluation code lived under a module called expression that also contained a re-exported submodule eval — two names for the same thing.

The cumulative effect was that operator-tree construction and dependency analysis spent significant prose in if !x.is_empty() && y.is_some() && ... gates, and a single conceptual change ("does this query group?") required updating four parallel data structures plus their cross-checks.

What changed

Type-level invariants

Restriction enum (fluree_db_query::ir::Restriction) replaces the selectDistinct/selectOne boolean pair on QueryOutput::Select. Variants Distinct and One are mutually exclusive, expressed as Option<Restriction>. Restriction::One carries an explicit doc note distinguishing it from query.limit = Some(1) because they differ in output shape (bare row vs. one-element array).
Grouping enum (fluree_db_query::ir::Grouping) replaces the parallel-Vec encoding of the aggregation phase. Variants:
- Implicit { aggregation, having } — single implicit group, always carries an Aggregation.
- Explicit { group_by, aggregation, having } — partitioned by NonEmpty<VarId>; aggregation is Option<Aggregation> because GROUP BY without aggregates is a legal dedup-by-key form.
  Helpers: Grouping::assemble(group_by, aggregates, binds, having) -> Option<Self> selects the right variant from loose lowering pieces; Grouping::aggregates(), Grouping::binds(), Grouping::having(), Grouping::aggregation(), Grouping::group_by_vars() give variant-agnostic iteration. The previous "is HAVING legal here?" / "is binds.is_empty() ok?" checks at consumer sites collapse into pattern matches that the compiler exhausts.
Aggregation sub-struct bundles aggregates: NonEmpty<AggregateSpec> with binds: Vec<(VarId, Expression)> so that post-aggregation binds can only exist where an aggregation stage is present. Eliminates the "GROUP BY without aggregates plus post-binds" representable-but-meaningless combination.
NonEmpty<T> promoted to fluree-db-core (fluree_db_core::NonEmpty) with structural non-emptiness via head: T, tail: Vec<T>. Used by Grouping for group_by and aggregate lists. Public field representation is intentional — supports destructure-rebuild in revert.rs::RevertSource::Commits.

Query field reshuffle

QueryOptions's modifiers were peeled off one axis at a time and lifted onto Query directly:

pub struct Query {
    pub context: ParsedContext,
    pub orig_context: Option<serde_json::Value>,
    pub output: QueryOutput,
    pub patterns: Vec<Pattern>,
    pub grouping: Option<Grouping>,         // GROUP BY + aggregates + HAVING phase
    pub ordering: Vec<SortSpec>,             // ORDER BY (was QueryOptions.order_by)
    pub limit: Option<usize>,                // was QueryOptions.limit
    pub offset: Option<usize>,               // was QueryOptions.offset
    pub reasoning: ReasoningConfig,          // rewriter input (renamed from `options`)
    pub post_values: Option<Pattern>,
}

The wrapper that remained — only two fields, both rewriter-side configuration — was renamed to express its actual purpose:

pub struct ReasoningConfig {
    pub modes: ReasoningModes,                     // was `reasoning`
    pub schema_bundle: Option<Arc<SchemaBundleFlakes>>,
}

ExecutableQuery.options: QueryOptions became ExecutableQuery.reasoning: ReasoningConfig to match.

Module reorganization

fluree-db-query/src/expression* → fluree-db-query/src/eval/. The expression module was a kitchen sink containing both the AST (which is IR) and the evaluator (which is runtime); the IR types moved up to crate::ir::expression, and the runtime eval directory now contains just the dispatcher and operator implementations (arithmetic, cast, compare, datetime, string, numeric, logical, conditional, geo, hash, rdf, fluree, fulltext, etc.).
fluree-db-query/src/ir/options.rs → fluree-db-query/src/ir/reasoning.rs. Module name now matches the struct it exports (ReasoningConfig).
FilterValue enum removed in favor of the existing FlakeValue — the duplicate value-domain type is gone.
Projection type hierarchy was reshaped to make invalid projections unrepresentable (Tuple / Scalar / Wildcard carry their column shape directly).
graph_select was moved off the top-level into QueryOutput::Select, where it actually applies.

Operator-tree cleanup

implicit_single_aggregate(query) -> Option<&AggregateSpec> helper replaces the let Some(Grouping::Implicit { aggregation: Aggregation { aggregates, binds }, having: None }) = … else { return None }; if aggregates.len() != 1 || … gate that was duplicated across eleven fast-path detectors. Each detector is now ~5 lines shorter and shares the gate definition.
build_operator_tree(query, stats, planning) lost its options: &QueryOptions parameter — the function (and its inner) never read it. Same drop applied to all 33 detect_* private helpers and try_build_count_plan.
compute_variable_deps(query) is now single-arg for the same reason.
Grouping::aggregates() drops a Box<dyn Iterator> allocation per call in favor of impl Iterator, matching the shape of binds().
dependency.rs::compute_variable_deps reads post-aggregation binds via &[(VarId, Expression)] borrowed off Aggregation::binds instead of materializing a Vec<&(VarId, Expression)>.

SPARQL lowering

lower_base_modifiers(modifiers) -> Result<BaseModifiers> replaces the prior (modifiers, &mut QueryOptions) mutation pattern. Returns BaseModifiers { limit, offset, ordering } for the caller to lift onto Query directly. All four call sites (SELECT, CONSTRUCT, DESCRIBE, ASK) destructure cleanly. The redundant lower_construct_modifiers wrapper went away.
ASK lowering now explicitly forces limit: Some(1), offset: None, ordering: Vec::new(), grouping: None and discards the parsed solution modifiers (per SPARQL semantics — ASK answers a single boolean and the modifiers are inert).
Grouping::assemble(...) is used uniformly across all three lowering paths (parse/lower.rs, sparql/lower/mod.rs, sparql/lower/select.rs) to construct the grouping phase from loose pieces, replacing three near-identical variant-selection ladders.

`SubqueryPattern` field naming

SubqueryPattern.order_by → ordering and SubqueryPattern::with_order_by → with_ordering, mirroring the top-level Query.ordering rename. The remaining SubqueryPattern.distinct: bool is left as a bool intentionally — selectOne doesn't apply to subqueries (subqueries always produce a row stream), so the binary Restriction axis isn't meaningful there.

Bugfix: `SELECT DISTINCT *` regression

fluree-db-sparql/src/lower/mod.rs was silently downgrading SELECT DISTINCT * to SELECT * because the wildcard match arm dropped the distinct flag. Added a QueryOutput::wildcard_distinct() constructor and split the match so (SelectVariables::Star, true) produces a wildcard with Restriction::Distinct.

Drive-by cleanup

Removed unused input_vars methods after confirming referenced_vars covers every consumer.
decode_lookup_error lifted from a method into a free function to break a circular dep through expression.
Several manual Debug/Clone/PartialEq impls replaced by #[derive(...)].
is_bound consolidated to a single function across two consumers.
Imports tightened: operator_tree.rs collapsed eight separate use crate::ir::* lines into one grouped import; dependency.rs test module dedup'd.
Module-level doc comments were updated wherever they still referenced the old QueryOptions/expression framing — including ir.rs, ir/query.rs, ir/reasoning.rs, and the Restriction::One doc.

Migration notes (for external consumers)

The public API of fluree-db-query shifts in several ways. None of these are subtle — the compiler will tell you about each one — but the rename targets are worth knowing in advance:

Before	After
`query.options.limit`	`query.limit`
`query.options.offset`	`query.offset`
`query.options.order_by`	`query.ordering`
`query.options.reasoning`	`query.reasoning.modes`
`query.options.schema_bundle`	`query.reasoning.schema_bundle`
`QueryOptions`	`ReasoningConfig`
`QueryOptions::with_limit(n)`	set `Query.limit = Some(n)` directly
`QueryOptions::with_offset(n)`	set `Query.offset = Some(n)` directly
`QueryOptions::with_order_by(specs)`	set `Query.ordering = specs` directly
`QueryOptions::with_reasoning(modes)`	`ReasoningConfig::with_modes(modes)`
`QueryOptions::has_modifiers()`	removed (callers checked these inline)
`ExecutableQuery.options`	`ExecutableQuery.reasoning`
`build_operator_tree(query, options, …)`	`build_operator_tree(query, …)` (3 args)
`compute_variable_deps(query, options)`	`compute_variable_deps(query)` (1 arg)
`SubqueryPattern.order_by`	`SubqueryPattern.ordering`
`SubqueryPattern::with_order_by(specs)`	`SubqueryPattern::with_ordering(specs)`
`crate::expression::*` (eval runtime)	`crate::eval::*`
`crate::ir::options::ReasoningConfig`	`crate::ir::reasoning::ReasoningConfig`
`FilterValue`	`FlakeValue` (already existed)
`QueryOutput::Select { distinct: bool, … }`	`QueryOutput::Select { restriction: Option<Restriction>, … }`

New IR types you'll see at use sites: Restriction, Grouping, Aggregation, AggregateFn, AggregateSpec, ReasoningConfig, ReasoningModes (renamed), BaseModifiers (SPARQL lower-internal), and fluree_db_core::NonEmpty<T>.

Query literals carry more fields than before (every modifier moved up). Test fixtures and with_patterns-style copies were updated accordingly.

Known limitations / follow-ups

JSON-LD ASK doesn't scrub modifiers: SPARQL ASK lowering forces limit=1, ordering=[], grouping=None to honor the boolean-output semantics. The JSON-LD lowering path (parse/lower.rs::lower_query) doesn't yet apply the same scrub — a JSON-LD ASK with groupBy or orderBy will carry those fields through to execution where they execute and produce wasted work (the format layer ignores all but "any non-empty batch?"). The semantics are preserved; only efficiency is affected. Parser-side cleanup is planned as a follow-up.
SubqueryPattern still parallel to Query: SubqueryPattern carries its own select, patterns, limit, offset, distinct: bool, ordering, grouping fields rather than embedding a Query. Convergence into a shared "QueryBody" struct was discussed but not pursued — the two surfaces have different invariants (subqueries are always row streams, never ASK; their projection is a Vec<VarId> rather than the richer QueryOutput).
SPARQL subselect HAVING is not lowered: documented in-place (lower/select.rs::lower_subselect); pre-existing limitation, not introduced by this branch.
build_operator_tree's top-of-function aggregates_vec / group_by_vec / post_binds_vec extraction: still materializes loose Vecs for downstream operator constructors that consume owned data. Acceptable trade-off for one-shot setup; flagged for a future round when the operator constructors gain Arc<[T]>-shaped APIs.
Subquery per-parent-row clones in subquery.rs::execute_subquery_for_row: still rebuilds the operator stack each parent row. Real fix requires hoisting grouping/aggregation setup into SubqueryOperator construction with shared ownership. Out of scope for this branch.

This mirrors the actual query ast and makes the invalid state of a wildcard select and a Some graph_select unrepresentable. Also updates nomenclature and changes the name of the "graph crawl" concept to "hydration" to match other databases and ORMs

…uery

bplatz

This is great!

…uery

zonotope added 30 commits May 5, 2026 16:09

update projection type hierarchy to eliminate invalid states

094efd4

remove FilterValue enum in favor of FlakeValue

a3f2d91

rename top level expression module to eval

75f723e

make decode_lookup_error a free function

8ff3620

combine two eval modules

e2e507c

import Hash(Map|Set)

c3e9161

use existing is_bound function

06e09bc

use descriptive comments

17c95fe

eliminate duplicate function

67492c3

derive instead of hand rolling trait implementations

36ab7d2

remove dead code

af72dc7

add binary expression evaluation helper

e97875e

remove unused input_vars methods

63b738c

add Restriction type to describe query output constraints

810faaf

Merge remote-tracking branch 'origin/main' into refactor/streamline-q…

d312a4f

…uery

move nonempty type to publicly accessible location

b9bc032

add first element accessor to NonEmpty

6abd949

add Grouping field to encode how results are grouped and aggregated

ef7762a

add an Aggregation type to combine aggregate and post_binds

416c94c

ir related aggregate types from aggregate to grouping ir module

a5951e1

add constructor for grouping

e88871c

cleanup

ace6a1c

update module doc

1245898

move order_by option to ordering field on top level query

f18ceef

move limit and offset to top level query; cleanup

38ca763

rename query options to reasoning

753cfd0

import cleanup

757408d

rename order_by to ordering in subquery

0674a5c

update comments

c58d1c5

zonotope added 2 commits May 8, 2026 00:15

add group_by_vars helper

884dac0

update comments

ccf5bbd

zonotope requested review from aaj3f and bplatz May 8, 2026 08:51

zonotope changed the title ~~Refactor/streamline query~~ Streamline query internal representation May 8, 2026

zonotope added 2 commits May 8, 2026 10:29

fmt

177833b

Merge remote-tracking branch 'origin/main' into refactor/streamline-q…

14c821d

…uery

bplatz approved these changes May 9, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into refactor/streamline-q…

411f2a7

…uery

zonotope merged commit 7566839 into main May 10, 2026
16 of 17 checks passed

zonotope deleted the refactor/streamline-query branch May 10, 2026 03:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streamline query internal representation#1226

Streamline query internal representation#1226
zonotope merged 35 commits into
mainfrom
refactor/streamline-query

zonotope commented May 8, 2026

Uh oh!

bplatz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zonotope commented May 8, 2026

Why

What changed

Type-level invariants

Query field reshuffle

Module reorganization

Operator-tree cleanup

SPARQL lowering

SubqueryPattern field naming

Bugfix: SELECT DISTINCT * regression

Drive-by cleanup

Migration notes (for external consumers)

Known limitations / follow-ups

Uh oh!

bplatz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`SubqueryPattern` field naming

Bugfix: `SELECT DISTINCT *` regression