Skip to content

perf: DirectIndexAgg for low-cardinality integer GROUP BY#157

Open
poyrazK wants to merge 8 commits into
mainfrom
perf/group-by-optimization
Open

perf: DirectIndexAgg for low-cardinality integer GROUP BY#157
poyrazK wants to merge 8 commits into
mainfrom
perf/group-by-optimization

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 21, 2026

Summary

  • Add DirectIndexAgg class: O(1) direct array indexing for integer GROUP BY keys (slot = key - min_key) instead of string-based hash lookup
  • Add steal() methods to ColumnVector/NumericVector/StringVector for O(1) batch transfers via vector swap
  • Fix vectorized path activation for GROUP BY with unanalyzed tables by hoisting estimated_rows declaration to enclosing scope
  • Enable vectorized path when GROUP BY exists but ANALYZE hasn't been run (num_rows==0), allowing DirectIndexAgg optimization to activate

Performance

  • Q1 GROUP BY: ~2.6M/s → ~28-31M/s (10x improvement)
  • Q6 (no GROUP BY): unchanged at ~5.9G/s

Test plan

  • make -j4 duckdb_comparison_bench
  • ./build/duckdb_comparison_bench --benchmark_filter=BM_CloudSQL_Q[16]/100000 --benchmark_repetitions=3

Summary by CodeRabbit

  • New Features

    • Parallel table scan execution support
    • Optimized GROUP BY operations for integer grouping columns
    • Memory-mapped columns for zero-copy reads
  • Performance

    • Enhanced query execution planning with improved vectorization decisions

Review Change Stack

poyrazK added 4 commits May 18, 2026 17:12
Key changes:
- query_executor.cpp: Remove parallel_ requirement from use_vectorized gate
  (Fix 3: enable vectorized path for large scans without parallel mode)
- query_executor.cpp: Resolve input_col_idx from aggregate function arguments
  (Fix 5: enable SUM/AVG to work on specific columns, not just COUNT(*))
- vectorized_operator.hpp: Add AVG support in update_accumulators and produce_output_batch
- operator.cpp: Replace std::map with std::unordered_map + binary key encoding
  (Fix 1+2: O(log n) → O(1) hash, string concat → length-prefixed binary)

Benchmark baseline (Q1 @ 100k rows):
- Before fixes: ~152k rows/sec
- After fixes: ~166k rows/sec
- DuckDB: ~197M rows/sec

The vectorized path is now enabled but VectorizedGroupByOperator needs
further optimization (batch-oriented key construction) to close gap.
This commits the full parallel scan implementation:
- VectorizedSeqScanOperator now accepts optional ThreadPool parameter
- When enabled (>50k rows, >1 thread), splits table into range chunks
  and reads each range in parallel via thread pool
- Added steal() to ColumnVector/NumericVector/StringVector for O(1)
  batch transfer via vector swap (avoids per-element copy)
- Parallel results are collected and returned via steal() to out_batch

The slow benchmark (2.6M/s) vs fast (598M/s for Q6/10k) suggests the
GROUP BY aggregation is the bottleneck, not I/O. The parallel scan
overhead (file opens/closes per task) likely exceeds I/O benefit at
100k row scale.
Implements direct array indexing for integer GROUP BY keys instead
of string-based hashing. When GROUP BY is on a single integer column,
uses slot = key - min_key for O(1) lookup without hash computation.

Also adds steal() method to ColumnVector/NumericVector/StringVector
for O(1) batch transfers via vector swap.

Note: Benchmark still shows ~2.6M/s suggesting the optimization path
may not be triggered, or there's another bottleneck in the execution
path that needs investigation.
The estimated_rows variable was being redeclared inside a nested block,
causing the GROUP BY heuristic check to always see stale (0) values.
Also adds the condition to enable vectorized path when GROUP BY exists
but num_rows==0 (ANALYZE not run), allowing DirectIndexAgg optimization
to activate.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Warning

Rate limit exceeded

@github-actions[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1bf6c4f5-97c8-4939-97ef-c27442d6fb65

📥 Commits

Reviewing files that changed from the base of the PR and between 864bbca and adff7ce.

📒 Files selected for processing (1)
  • include/executor/vectorized_operator.hpp
📝 Walkthrough

Walkthrough

This PR accelerates vectorized query execution by adding parallel scan support with column stealing, introducing memory-mapped storage for zero-copy reads, and implementing a direct-index aggregation fast path for low-cardinality integer GROUP BY operations. Plan selection now uses estimated row cardinality to drive vectorization decisions and provisions a ThreadPool for parallel scans.

Changes

Vectorized Execution and Parallel Processing

Layer / File(s) Summary
ColumnVector steal() interface and implementations
include/executor/types.hpp
ColumnVector declares a new pure-virtual steal(ColumnVector&&) method. NumericVector<T> and StringVector implement steal via dynamic-cast type-checking, then swap internal buffers (data, null bitmap, size); they throw std::runtime_error on type mismatch.
Memory-mapped column storage abstraction
include/storage/memory_mapped_column.hpp, src/storage/memory_mapped_column.cpp
New MemoryMappedColumn class enables zero-copy reads by mmap-backing data and null regions. Provides map(data_path, null_path, element_size, row_count) to open/mmap files and unmap() for RAII cleanup; inline helpers (data_at, null_at) return per-row pointers via pointer arithmetic.
Parallel vectorized scan with column stealing
include/executor/vectorized_operator.hpp
VectorizedSeqScanOperator accepts an optional ThreadPool in its constructor and detects parallel eligibility (pool present, multiple threads, table sufficiently large). next_batch_parallel() partitions remaining rows by thread, submits concurrent ColumnarTable::read_batch tasks, waits for completion, and assembles output by stealing columns from per-thread result batches.
Direct-index aggregation fast path for GROUP BY
include/executor/vectorized_operator.hpp
DirectIndexAgg helper provides slot-based accumulation for integer key ranges. VectorizedGroupByOperator detects single-column integer GROUP BY eligibility during construction, routes input via process_input_batch_direct() to directly update per-slot count/sum accumulators, and emits results via produce_output_batch_direct() with COUNT/SUM/AVG in fixed-size batches.
Hash-based aggregation AVG support
include/executor/vectorized_operator.hpp
Hash-based group-by path now treats AVG similarly to SUM by incrementing state.counts for non-null input values. Hash output generation adds an AggregateType::Avg case that computes averages from stored sums and counts, emitting NULL when count is zero.
Group-by key encoding optimization
src/executor/operator.cpp
AggregateOperator switches the grouping container from ordered std::map to std::unordered_map with 1024-entry pre-allocation. Group-by key generation replaces string concatenation with binary-style length-prefixed encoding using non-NULL markers and explicit NULL handling.
Vectorized execution planning and wiring
src/executor/query_executor.cpp
Cost-based vectorization now tracks estimated_rows across scan/filter selectivity estimation, removes the parallel_ gate, and forces vectorized GROUP BY when table stats report zero rows (pre-ANALYZE). Vectorized plan construction creates a shared ThreadPool sized by std::thread::hardware_concurrency() and passes it to VectorizedSeqScanOperator. Aggregate metadata derives input_col_idx by resolving the first aggregate argument column name in the operator output schema; defaults to -1 for non-column arguments like COUNT(*).
Benchmark storage manager wiring
benchmarks/duckdb_comparison_bench.cpp
CloudSQLContext initialization calls executor->set_storage_manager(storage.get()) to enable the executor to use the provided storage manager.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • poyrazK/cloudSQL#59: Both PRs modify VectorizedGroupByOperator aggregation logic; this PR adds direct-index fast path and AVG handling while the referenced PR introduces the operator with hash-based COUNT/SUM and MIN/MAX support.
  • poyrazK/cloudSQL#10: This PR builds on the referenced PR's Phase 8 vectorized/columnar execution and analytics test foundations by extending vectorized operators with parallel scan and direct-index aggregation capabilities.

Poem

🐰 A rabbit hops through columns swift,
Stealing buffers left and right,
Direct indexing does the lift,
Vectorized queries burn so bright!
With mmap'd regions, zero-copy flows—
Query speed: that's how it grows! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main performance optimization: DirectIndexAgg for low-cardinality integer GROUP BY, which aligns with the primary changes in vectorized_operator.hpp.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/group-by-optimization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
src/executor/operator.cpp (1)

577-605: ⚡ Quick win

Hoist key outside the loop and prefer try_emplace to avoid per-row allocations.

std::string key; is default-constructed inside the loop and then reserve(64) is called every iteration, which forces a heap allocation per input row (SSO capacity is well under 64). Hoisting the buffer above the while and using clear() to reset preserves capacity across rows and removes that per-row malloc/free pair from this hot path. While here, find followed by emplace does two hash probes on the miss path; try_emplace collapses it to one and avoids constructing the key string twice.

♻️ Proposed refactor
     Tuple tuple;
     auto child_schema = child_->output_schema();
+    std::string key;
+    key.reserve(64);  // reused across rows; capacity is preserved by clear()
     while (child_->next(tuple)) {
-        std::string key;
-        key.reserve(64);  // Pre-reserve to avoid repeated allocations
+        key.clear();
         std::vector<common::Value> gb_vals;
@@
-        auto it = groups_map.find(key);
-        if (it == groups_map.end()) {
-            it = groups_map.emplace(key, GroupState(aggregates_.size())).first;
-            it->second.group_values = std::move(gb_vals);
+        auto [it, inserted] = groups_map.try_emplace(key, GroupState(aggregates_.size()));
+        if (inserted) {
+            it->second.group_values = std::move(gb_vals);
         }

Also note the comment on Line 585 ("no string allocation") is slightly misleading — val.to_string() still allocates for non-null values; only the per-iteration key buffer can be made allocation-free.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/operator.cpp` around lines 577 - 605, Hoist the std::string key
buffer out of the per-row loop and reserve its capacity once (e.g.,
key.reserve(64) before the loop) then call key.clear() each iteration to
preserve capacity and avoid per-row heap allocations; keep gb_vals as a local
per-row vector but reuse it similarly if needed. Replace the find+emplace
pattern on groups_map with a single try_emplace call (using key as the
lookup/insert key and constructing GroupState(aggregates_.size()) only on miss)
to eliminate the double hash probe and duplicate key construction; after
insertion/update, assign it->second.group_values = std::move(gb_vals) as before.
Apply these changes around the existing symbols key, group_by_, gb_vals,
groups_map, try_emplace, GroupState, and aggregates_.
src/executor/query_executor.cpp (1)

1515-1517: ⚡ Quick win

Thread pool recreation on every vectorized query.

A new ThreadPool is created for each vectorized query execution (line 1515). Thread pool creation incurs overhead from spawning worker threads. For workloads with many short queries, this repeated construction/destruction degrades throughput. Consider caching a shared ThreadPool at the QueryExecutor instance level for reuse across queries.

♻️ Suggested approach

Add a member to QueryExecutor:

std::shared_ptr<ThreadPool> thread_pool_;

Initialize once in the constructor:

QueryExecutor::QueryExecutor(...) 
    : ..., thread_pool_(std::make_shared<ThreadPool>(std::thread::hardware_concurrency())) {}

Reuse in build_vectorized_plan:

-auto thread_pool = std::make_shared<executor::ThreadPool>(std::thread::hardware_concurrency());
 std::unique_ptr<VectorizedOperator> current_root =
-    std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool);
+    std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool_);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1515 - 1517, The vectorized
path currently creates a new ThreadPool per query; add a
std::shared_ptr<executor::ThreadPool> thread_pool_ member to QueryExecutor,
initialize it once in QueryExecutor's constructor (using
std::thread::hardware_concurrency() or configured size), and update
build_vectorized_plan (where ThreadPool is currently created and used to
construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a
fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove
the local std::make_shared<executor::ThreadPool> creation in
build_vectorized_plan.
include/executor/vectorized_operator.hpp (1)

80-83: 💤 Low value

Consider making parallel scan threshold configurable.

The hardcoded 50,000 row threshold for enabling parallel scan may not be optimal for all workloads. The crossover point depends on thread pool size, column count, column width, and system characteristics. Consider making this threshold configurable or using a more sophisticated cost model.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 80 - 83, The code
currently hardcodes the parallel-scan enable threshold (50000) in the block that
sets num_threads_ and parallel_enabled_; replace this magic number with a
configurable parameter (e.g., parallel_scan_threshold_) that is read from
operator configuration or a global runtime config and fall back to 50000 if
unset, expose it via the VectorizedOperator constructor or a setter so callers
can tune it, and update the check to set parallel_enabled_ = table_->row_count()
> parallel_scan_threshold_; optionally document the new parameter where operator
options are defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/executor/types.hpp`:
- Around line 247-252: The doc for virtual void steal(ColumnVector&& other)
promises that 'other' is emptied after stealing but current implementations only
swap buffers; update each steal implementation (e.g., ColumnVector::steal and
StringVector::steal) to perform the swap and then explicitly clear or reset the
moved-from 'other' (call its clear/reset method, set size/length to zero, and
release/empty any auxiliary buffers) so that after the call 'other' is empty
while 'this' holds the original data; ensure you use the class-specific
clear/reset APIs to leave 'other' in a valid empty state.

In `@include/executor/vectorized_operator.hpp`:
- Around line 385-392: The GroupSlot struct has sums_float64 sized as
MAX_AGGREGATES / 2 which can be indexed up to aggregates_.size() in
process_input_batch_direct(), causing OOB access; change sums_float64 to match
sums_int64 (size MAX_AGGREGATES) so all aggregate indexes
(0..aggregates_.size()-1) are valid, and update any related assumptions/comments
referencing half-size to reflect the new full-size array; ensure GroupSlot,
sums_float64, MAX_AGGREGATES and process_input_batch_direct usage remain
consistent.
- Around line 417-431: The find_or_insert method currently grows slots_ to cover
min_key_/max_key_ which can allocate huge memory for sparse/wide keys; modify
find_or_insert(int64_t key1, int64_t key2) to check the computed new_size
(max_key_ - min_key_ + 1) against a configured safe threshold (e.g.,
MAX_SLOT_RANGE) before resizing, and if it exceeds the threshold switch to or
fallback on a sparse container (e.g., an unordered_map or existing hash-based
path) or return an error/limit-hit indicator instead of resizing; ensure you
reference and update min_key_, max_key_, and slots_ only when within limits and
add clear handling in callers of find_or_insert for the fallback/limit
condition.

In `@include/storage/memory_mapped_column.hpp`:
- Around line 45-54: The helpers data_at() and null_at() currently do pointer
arithmetic without validating row_idx; update both functions to return nullptr
if data_region_.addr or null_region_.addr is null OR if row_idx >= row_count_,
ensuring you check bounds before computing static_cast<char*>(data_region_.addr)
+ row_idx * element_size_ (and similarly for null_region_); this prevents
out-of-bounds pointers — use the existing members row_count_, element_size_,
data_region_.addr and null_region_.addr to perform the checks and only perform
the pointer arithmetic when within range.

In `@src/executor/query_executor.cpp`:
- Around line 1709-1711: The code unsafely casts the size_t from
current_root->output_schema().find_column(col->name()) to int; first capture the
result into a size_t (e.g., auto idx =
current_root->output_schema().find_column(col->name())), check explicitly for
the not-found sentinel (idx == size_t(-1)) and handle that case (set
info.input_col_idx = -1 or raise/log an error), otherwise safely
static_cast<int>(idx) into info.input_col_idx; update references to
info.input_col_idx and use col->name() in the check to preserve current
behavior.

---

Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 80-83: The code currently hardcodes the parallel-scan enable
threshold (50000) in the block that sets num_threads_ and parallel_enabled_;
replace this magic number with a configurable parameter (e.g.,
parallel_scan_threshold_) that is read from operator configuration or a global
runtime config and fall back to 50000 if unset, expose it via the
VectorizedOperator constructor or a setter so callers can tune it, and update
the check to set parallel_enabled_ = table_->row_count() >
parallel_scan_threshold_; optionally document the new parameter where operator
options are defined.

In `@src/executor/operator.cpp`:
- Around line 577-605: Hoist the std::string key buffer out of the per-row loop
and reserve its capacity once (e.g., key.reserve(64) before the loop) then call
key.clear() each iteration to preserve capacity and avoid per-row heap
allocations; keep gb_vals as a local per-row vector but reuse it similarly if
needed. Replace the find+emplace pattern on groups_map with a single try_emplace
call (using key as the lookup/insert key and constructing
GroupState(aggregates_.size()) only on miss) to eliminate the double hash probe
and duplicate key construction; after insertion/update, assign
it->second.group_values = std::move(gb_vals) as before. Apply these changes
around the existing symbols key, group_by_, gb_vals, groups_map, try_emplace,
GroupState, and aggregates_.

In `@src/executor/query_executor.cpp`:
- Around line 1515-1517: The vectorized path currently creates a new ThreadPool
per query; add a std::shared_ptr<executor::ThreadPool> thread_pool_ member to
QueryExecutor, initialize it once in QueryExecutor's constructor (using
std::thread::hardware_concurrency() or configured size), and update
build_vectorized_plan (where ThreadPool is currently created and used to
construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a
fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove
the local std::make_shared<executor::ThreadPool> creation in
build_vectorized_plan.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5a44b60-77fa-4086-ab79-c50bf70f0427

📥 Commits

Reviewing files that changed from the base of the PR and between fa275e2 and 864bbca.

📒 Files selected for processing (7)
  • benchmarks/duckdb_comparison_bench.cpp
  • include/executor/types.hpp
  • include/executor/vectorized_operator.hpp
  • include/storage/memory_mapped_column.hpp
  • src/executor/operator.cpp
  • src/executor/query_executor.cpp
  • src/storage/memory_mapped_column.cpp

Comment on lines +247 to +252
/**
* @brief Steals data from another column vector by swapping internal buffers.
* After steal(), 'other' is emptied and 'this' holds the original data from 'other'.
* Throws std::runtime_error if types are incompatible.
*/
virtual void steal(ColumnVector&& other) = 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Semantic contract mismatch between documentation and implementation.

The documentation states "After steal(), 'other' is emptied," but the implementations (lines 339-345, 411-417) only swap internal buffers without clearing other. After steal(), both vectors remain valid with swapped contents rather than other being empty. This violates the documented contract and could mislead callers who rely on other being emptied.

🔄 Proposed fix: clear 'other' after swapping

Update implementations to explicitly clear other after swapping:

 void steal(ColumnVector&& other) override {
     auto* other_num = dynamic_cast<NumericVector<T>*>(&other);
     if (!other_num) throw std::runtime_error("NumericVector::steal: type mismatch");
     data_.swap(other_num->data_);
     null_bitmap_.swap(other_num->null_bitmap_);
     std::swap(size_, other_num->size_);
+    other_num->data_.clear();
+    other_num->null_bitmap_.clear();
+    other_num->size_ = 0;
 }

Apply the same pattern to StringVector::steal at lines 411-417.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/types.hpp` around lines 247 - 252, The doc for virtual void
steal(ColumnVector&& other) promises that 'other' is emptied after stealing but
current implementations only swap buffers; update each steal implementation
(e.g., ColumnVector::steal and StringVector::steal) to perform the swap and then
explicitly clear or reset the moved-from 'other' (call its clear/reset method,
set size/length to zero, and release/empty any auxiliary buffers) so that after
the call 'other' is empty while 'this' holds the original data; ensure you use
the class-specific clear/reset APIs to leave 'other' in a valid empty state.

Comment on lines +385 to +392
struct GroupSlot {
bool valid = false;
int64_t key1 = 0;
int64_t key2 = 0;
int64_t counts[MAX_AGGREGATES] = {0};
int64_t sums_int64[MAX_AGGREGATES] = {0};
double sums_float64[MAX_AGGREGATES / 2] = {0.0};
bool has_float_value[MAX_AGGREGATES] = {false};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: sums_float64 array is undersized for MAX_AGGREGATES.

sums_float64 is declared with size MAX_AGGREGATES / 2 (line 391) while sums_int64 has size MAX_AGGREGATES. The code at lines 596-597 in process_input_batch_direct() indexes slot.sums_float64[i] where i ranges from 0 to aggregates_.size() - 1. If aggregates_.size() exceeds 4 (half of MAX_AGGREGATES), this causes out-of-bounds array access and undefined behavior.

🐛 Proposed fix: match array sizes
-        double sums_float64[MAX_AGGREGATES / 2] = {0.0};
+        double sums_float64[MAX_AGGREGATES] = {0.0};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct GroupSlot {
bool valid = false;
int64_t key1 = 0;
int64_t key2 = 0;
int64_t counts[MAX_AGGREGATES] = {0};
int64_t sums_int64[MAX_AGGREGATES] = {0};
double sums_float64[MAX_AGGREGATES / 2] = {0.0};
bool has_float_value[MAX_AGGREGATES] = {false};
struct GroupSlot {
bool valid = false;
int64_t key1 = 0;
int64_t key2 = 0;
int64_t counts[MAX_AGGREGATES] = {0};
int64_t sums_int64[MAX_AGGREGATES] = {0};
double sums_float64[MAX_AGGREGATES] = {0.0};
bool has_float_value[MAX_AGGREGATES] = {false};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 385 - 392, The
GroupSlot struct has sums_float64 sized as MAX_AGGREGATES / 2 which can be
indexed up to aggregates_.size() in process_input_batch_direct(), causing OOB
access; change sums_float64 to match sums_int64 (size MAX_AGGREGATES) so all
aggregate indexes (0..aggregates_.size()-1) are valid, and update any related
assumptions/comments referencing half-size to reflect the new full-size array;
ensure GroupSlot, sums_float64, MAX_AGGREGATES and process_input_batch_direct
usage remain consistent.

Comment thread include/executor/vectorized_operator.hpp
Comment on lines +45 to +54
const void* data_at(size_t row_idx) const {
if (!data_region_.addr) return nullptr;
return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
}

// Direct pointer to null bit at row index
const uint8_t* null_at(size_t row_idx) const {
if (!null_region_.addr) return nullptr;
return static_cast<uint8_t*>(null_region_.addr) + row_idx;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add bounds validation to pointer arithmetic helpers.

data_at() and null_at() perform pointer arithmetic without validating that row_idx < row_count_. Returning out-of-bounds pointers can lead to undefined behavior when dereferenced by callers. Even if the caller is expected to validate, defensive bounds checking here prevents subtle bugs.

🛡️ Proposed fix: add bounds checking
 const void* data_at(size_t row_idx) const {
-    if (!data_region_.addr) return nullptr;
+    if (!data_region_.addr || row_idx >= row_count_) return nullptr;
     return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
 }

 const uint8_t* null_at(size_t row_idx) const {
-    if (!null_region_.addr) return nullptr;
+    if (!null_region_.addr || row_idx >= row_count_) return nullptr;
     return static_cast<uint8_t*>(null_region_.addr) + row_idx;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const void* data_at(size_t row_idx) const {
if (!data_region_.addr) return nullptr;
return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
}
// Direct pointer to null bit at row index
const uint8_t* null_at(size_t row_idx) const {
if (!null_region_.addr) return nullptr;
return static_cast<uint8_t*>(null_region_.addr) + row_idx;
}
const void* data_at(size_t row_idx) const {
if (!data_region_.addr || row_idx >= row_count_) return nullptr;
return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
}
// Direct pointer to null bit at row index
const uint8_t* null_at(size_t row_idx) const {
if (!null_region_.addr || row_idx >= row_count_) return nullptr;
return static_cast<uint8_t*>(null_region_.addr) + row_idx;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/storage/memory_mapped_column.hpp` around lines 45 - 54, The helpers
data_at() and null_at() currently do pointer arithmetic without validating
row_idx; update both functions to return nullptr if data_region_.addr or
null_region_.addr is null OR if row_idx >= row_count_, ensuring you check bounds
before computing static_cast<char*>(data_region_.addr) + row_idx * element_size_
(and similarly for null_region_); this prevents out-of-bounds pointers — use the
existing members row_count_, element_size_, data_region_.addr and
null_region_.addr to perform the checks and only perform the pointer arithmetic
when within range.

Comment on lines +1709 to +1711
info.input_col_idx = static_cast<int>(
current_root->output_schema().find_column(col->name()));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unsafe cast of size_t to int for column index.

find_column() returns size_t(-1) to indicate "not found" (line 1710). Casting size_t(-1) directly to int is implementation-defined and may not reliably produce -1. On platforms where sizeof(size_t) > sizeof(int), the cast can produce unexpected values. Check for "not found" explicitly before casting.

🔧 Proposed fix: explicit not-found check
 if (!func->args().empty()) {
     const auto& arg = func->args()[0];
     if (arg->type() == parser::ExprType::Column) {
         const auto* col = dynamic_cast<const parser::ColumnExpr*>(arg.get());
         if (col != nullptr) {
-            info.input_col_idx = static_cast<int>(
-                current_root->output_schema().find_column(col->name()));
+            size_t col_idx = current_root->output_schema().find_column(col->name());
+            info.input_col_idx = (col_idx != static_cast<size_t>(-1)) 
+                ? static_cast<int>(col_idx) 
+                : -1;
         }
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
info.input_col_idx = static_cast<int>(
current_root->output_schema().find_column(col->name()));
}
size_t col_idx = current_root->output_schema().find_column(col->name());
info.input_col_idx = (col_idx != static_cast<size_t>(-1))
? static_cast<int>(col_idx)
: -1;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1709 - 1711, The code unsafely
casts the size_t from current_root->output_schema().find_column(col->name()) to
int; first capture the result into a size_t (e.g., auto idx =
current_root->output_schema().find_column(col->name())), check explicitly for
the not-found sentinel (idx == size_t(-1)) and handle that case (set
info.input_col_idx = -1 or raise/log an error), otherwise safely
static_cast<int>(idx) into info.input_col_idx; update references to
info.input_col_idx and use col->name() in the check to preserve current
behavior.

poyrazK and others added 3 commits May 21, 2026 16:21
…_insert

- Remove unused valid_slot_indices_ member (produces_output_batch_direct
  iterates direct_group_keys_ instead)
- Remove orphaned valid_slots() accessor
- Fix find_or_insert to mark slot.valid=true on first insertion (was
  relying on accidental post-condition in caller)
- Open-addressing hash table with linear probing and FNV-1a 64-bit hash
- Binary key encoding: [type_tag (1B)][len (4B)][data...] - avoids string allocation
- HashBucket stores key_bytes inline for fast comparison, is_new flag for iteration
- Added to VectorizedGroupByOperator as fallback when DirectIndexAgg doesn't apply
- Fixed grow() to use key_data instead of casting key_int64

Note: process_input_batch_open_addressing() is written but not yet wired
into process_input_batch() - the old hash path is still being used.
This is intentional to allow incremental testing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant