Skip to content

feat(query-engine): SQL Top-K via CountMinSketchWithHeap#389

Merged
milindsrivastava1997 merged 4 commits into
mainfrom
388-feat-query-engine-sql-top-k-via-countminsketchwithheap
Jun 6, 2026
Merged

feat(query-engine): SQL Top-K via CountMinSketchWithHeap#389
milindsrivastava1997 merged 4 commits into
mainfrom
388-feat-query-engine-sql-top-k-via-countminsketchwithheap

Conversation

@akanksha-akkihal

Copy link
Copy Markdown
Collaborator

Overview

Adds an end-to-end SQL execution path for approximate top-k queries by event count, aligned with the existing PromQL topk(k, ...) implementation. Queries of the form:

SELECT srcip, COUNT(pkt_len) AS transfer_events
FROM netflow_table
WHERE time BETWEEN DATEADD(s, -11, NOW()) AND DATEADD(s, -10, NOW())
GROUP BY srcip
ORDER BY transfer_events DESC
LIMIT 10

are now recognized as top-k queries during SQL planning, mapped to Statistic::Topk, and executed using CountMinSketchWithHeap. Instead of materializing all group-by keys and applying ORDER BY / LIMIT as a post-processing step, the engine retrieves candidate keys directly from the sketch heap, estimates counts from the CMS, sorts them, and returns the top-k results.

Changes by Area

Precompute (accumulator_factory.rs)

  • Introduced CmsWithHeapAccumulatorUpdater for CountMinSketchWithHeap.
  • Replaces the previous routing to CmsAccumulatorUpdater, which treated the sketch as a plain CMS and could not enumerate heap candidates.
  • Added count_events support (default true), allowing COUNT-style workloads to update the sketch with a constant weight of 1.0 per observation rather than using the sample value as the weight.
  • Added cms_heap_params() to construct heap-backed sketches from planner-generated parameters.
  • Supports both planner/Arroyo naming (depth, width, heapsize) and legacy naming (row_num, col_num, heap_size), with defaults of 3, 1024, and 32.

SQL Engine (sql.rs)

  • Added detect_sql_topk() to identify SQL top-k patterns and promote them to Statistic::Topk.

  • Detection requires:

    • COUNT(...)
    • GROUP BY
    • ORDER BY <aggregate_alias> DESC
    • LIMIT k
  • Extracts and stores k in query_kwargs for downstream execution.

  • Context generation uses Statistic::Topk and empty grouping_labels, matching the intended sketch layout where the GROUP BY column is represented as the sketch's aggregated dimension rather than a storage partition key.

  • Updated handle_query_sql() so top-k queries execute through the sketch heap path with:

    • enable_topk_limiting = true
    • enable_topk_formatting = false
  • SQL top-k queries bypass SqlPostProcessing, since ordering and truncation are now performed directly by the query pipeline.

  • Added unit tests covering:

    • Top-k query detection
    • Positive and negative detection cases
    • End-to-end execution against a seeded sketch

Query Pipeline (mod.rs)

  • Split the previous enable_topk flag into two independent controls:

    • enable_topk_limiting
    • enable_topk_formatting
  • enable_topk_limiting sorts top-k candidates and truncates results to k.

  • enable_topk_formatting preserves the existing PromQL behavior of prepending metric names to labels.

  • Heap candidate collection no longer truncates during enumeration.

  • The full candidate set is collected first, then sorted and truncated within the pipeline, ensuring consistent ordering behavior across execution paths.

Call Sites

  • PromQL

    • enable_topk_limiting = true
    • enable_topk_formatting = true
    • Existing behavior unchanged.
  • SQL Top-K

    • enable_topk_limiting = true
    • enable_topk_formatting = false
  • SQL (non-top-k) / Elastic

    • enable_topk_limiting = false
    • enable_topk_formatting = false
    • Existing execution and post-processing paths unchanged.

Tests and Supporting Changes

  • Updated engine_factories.rs to accommodate the new execute_query_pipeline signature.
  • Added SQL top-k detection tests covering supported and unsupported query shapes.
  • Added pipeline tests validating context construction and heap-based execution using pre-seeded sketches.
  • Preserved existing PromQL top-k behavior while validating SQL-specific execution semantics.

Comment thread asap-query-engine/src/engines/simple_engine/mod.rs
Comment thread asap-query-engine/src/engines/simple_engine/mod.rs
Comment thread asap-query-engine/src/engines/simple_engine/mod.rs Outdated
Comment thread asap-query-engine/src/engines/simple_engine/sql.rs
akanksha-akkihal and others added 4 commits June 6, 2026 01:34
… flags (#391)

* split enable_topk into limiting vs formatting flags

* add PromQL topk pipeline and Prometheus wire format tests

* formatting
@akanksha-akkihal akanksha-akkihal force-pushed the 388-feat-query-engine-sql-top-k-via-countminsketchwithheap branch from ccbec5e to 11bc4c9 Compare June 6, 2026 01:36
@milindsrivastava1997 milindsrivastava1997 merged commit 4668d8f into main Jun 6, 2026
8 checks passed
@milindsrivastava1997 milindsrivastava1997 deleted the 388-feat-query-engine-sql-top-k-via-countminsketchwithheap branch June 6, 2026 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(query-engine): SQL Top-K via CountMinSketchWithHeap (COUNT + ORDER BY + LIMIT)

2 participants