perf: DirectIndexAgg for low-cardinality integer GROUP BY by poyrazK · Pull Request #157 · poyrazK/cloudSQL

poyrazK · 2026-05-21T12:54:38Z

Summary

Add DirectIndexAgg class: O(1) direct array indexing for integer GROUP BY keys (slot = key - min_key) instead of string-based hash lookup
Add steal() methods to ColumnVector/NumericVector/StringVector for O(1) batch transfers via vector swap
Fix vectorized path activation for GROUP BY with unanalyzed tables by hoisting estimated_rows declaration to enclosing scope
Enable vectorized path when GROUP BY exists but ANALYZE hasn't been run (num_rows==0), allowing DirectIndexAgg optimization to activate

Performance

Q1 GROUP BY: ~2.6M/s → ~28-31M/s (10x improvement)
Q6 (no GROUP BY): unchanged at ~5.9G/s

Test plan

make -j4 duckdb_comparison_bench
./build/duckdb_comparison_bench --benchmark_filter=BM_CloudSQL_Q[16]/100000 --benchmark_repetitions=3

Summary by CodeRabbit

New Features
- Parallel table scan execution support
- Optimized GROUP BY operations for integer grouping columns
- Memory-mapped columns for zero-copy reads
Performance
- Enhanced query execution planning with improved vectorization decisions

Key changes: - query_executor.cpp: Remove parallel_ requirement from use_vectorized gate (Fix 3: enable vectorized path for large scans without parallel mode) - query_executor.cpp: Resolve input_col_idx from aggregate function arguments (Fix 5: enable SUM/AVG to work on specific columns, not just COUNT(*)) - vectorized_operator.hpp: Add AVG support in update_accumulators and produce_output_batch - operator.cpp: Replace std::map with std::unordered_map + binary key encoding (Fix 1+2: O(log n) → O(1) hash, string concat → length-prefixed binary) Benchmark baseline (Q1 @ 100k rows): - Before fixes: ~152k rows/sec - After fixes: ~166k rows/sec - DuckDB: ~197M rows/sec The vectorized path is now enabled but VectorizedGroupByOperator needs further optimization (batch-oriented key construction) to close gap.

This commits the full parallel scan implementation: - VectorizedSeqScanOperator now accepts optional ThreadPool parameter - When enabled (>50k rows, >1 thread), splits table into range chunks and reads each range in parallel via thread pool - Added steal() to ColumnVector/NumericVector/StringVector for O(1) batch transfer via vector swap (avoids per-element copy) - Parallel results are collected and returned via steal() to out_batch The slow benchmark (2.6M/s) vs fast (598M/s for Q6/10k) suggests the GROUP BY aggregation is the bottleneck, not I/O. The parallel scan overhead (file opens/closes per task) likely exceeds I/O benefit at 100k row scale.

Implements direct array indexing for integer GROUP BY keys instead of string-based hashing. When GROUP BY is on a single integer column, uses slot = key - min_key for O(1) lookup without hash computation. Also adds steal() method to ColumnVector/NumericVector/StringVector for O(1) batch transfers via vector swap. Note: Benchmark still shows ~2.6M/s suggesting the optimization path may not be triggered, or there's another bottleneck in the execution path that needs investigation.

The estimated_rows variable was being redeclared inside a nested block, causing the GROUP BY heuristic check to always see stale (0) values. Also adds the condition to enable vectorized path when GROUP BY exists but num_rows==0 (ANALYZE not run), allowing DirectIndexAgg optimization to activate.

coderabbitai · 2026-05-21T12:54:47Z

Warning

Rate limit exceeded

@github-actions[bot] has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1bf6c4f5-97c8-4939-97ef-c27442d6fb65

📥 Commits

Reviewing files that changed from the base of the PR and between 864bbca and adff7ce.

📒 Files selected for processing (1)

include/executor/vectorized_operator.hpp

📝 Walkthrough

Walkthrough

This PR accelerates vectorized query execution by adding parallel scan support with column stealing, introducing memory-mapped storage for zero-copy reads, and implementing a direct-index aggregation fast path for low-cardinality integer GROUP BY operations. Plan selection now uses estimated row cardinality to drive vectorization decisions and provisions a ThreadPool for parallel scans.

Changes

Vectorized Execution and Parallel Processing

Layer / File(s)	Summary
ColumnVector steal() interface and implementations `include/executor/types.hpp`	`ColumnVector` declares a new pure-virtual `steal(ColumnVector&&)` method. `NumericVector<T>` and `StringVector` implement steal via dynamic-cast type-checking, then swap internal buffers (data, null bitmap, size); they throw `std::runtime_error` on type mismatch.
Memory-mapped column storage abstraction `include/storage/memory_mapped_column.hpp`, `src/storage/memory_mapped_column.cpp`	New `MemoryMappedColumn` class enables zero-copy reads by mmap-backing data and null regions. Provides `map(data_path, null_path, element_size, row_count)` to open/mmap files and `unmap()` for RAII cleanup; inline helpers (`data_at`, `null_at`) return per-row pointers via pointer arithmetic.
Parallel vectorized scan with column stealing `include/executor/vectorized_operator.hpp`	`VectorizedSeqScanOperator` accepts an optional `ThreadPool` in its constructor and detects parallel eligibility (pool present, multiple threads, table sufficiently large). `next_batch_parallel()` partitions remaining rows by thread, submits concurrent `ColumnarTable::read_batch` tasks, waits for completion, and assembles output by stealing columns from per-thread result batches.
Direct-index aggregation fast path for GROUP BY `include/executor/vectorized_operator.hpp`	`DirectIndexAgg` helper provides slot-based accumulation for integer key ranges. `VectorizedGroupByOperator` detects single-column integer GROUP BY eligibility during construction, routes input via `process_input_batch_direct()` to directly update per-slot count/sum accumulators, and emits results via `produce_output_batch_direct()` with COUNT/SUM/AVG in fixed-size batches.
Hash-based aggregation AVG support `include/executor/vectorized_operator.hpp`	Hash-based group-by path now treats `AVG` similarly to `SUM` by incrementing `state.counts` for non-null input values. Hash output generation adds an `AggregateType::Avg` case that computes averages from stored sums and counts, emitting NULL when count is zero.
Group-by key encoding optimization `src/executor/operator.cpp`	`AggregateOperator` switches the grouping container from ordered `std::map` to `std::unordered_map` with 1024-entry pre-allocation. Group-by key generation replaces string concatenation with binary-style length-prefixed encoding using non-NULL markers and explicit NULL handling.
Vectorized execution planning and wiring `src/executor/query_executor.cpp`	Cost-based vectorization now tracks `estimated_rows` across scan/filter selectivity estimation, removes the `parallel_` gate, and forces vectorized GROUP BY when table stats report zero rows (pre-ANALYZE). Vectorized plan construction creates a shared `ThreadPool` sized by `std::thread::hardware_concurrency()` and passes it to `VectorizedSeqScanOperator`. Aggregate metadata derives `input_col_idx` by resolving the first aggregate argument column name in the operator output schema; defaults to `-1` for non-column arguments like `COUNT(*)`.
Benchmark storage manager wiring `benchmarks/duckdb_comparison_bench.cpp`	`CloudSQLContext` initialization calls `executor->set_storage_manager(storage.get())` to enable the executor to use the provided storage manager.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

poyrazK/cloudSQL#59: Both PRs modify VectorizedGroupByOperator aggregation logic; this PR adds direct-index fast path and AVG handling while the referenced PR introduces the operator with hash-based COUNT/SUM and MIN/MAX support.
poyrazK/cloudSQL#10: This PR builds on the referenced PR's Phase 8 vectorized/columnar execution and analytics test foundations by extending vectorized operators with parallel scan and direct-index aggregation capabilities.

Poem

🐰 A rabbit hops through columns swift,
Stealing buffers left and right,
Direct indexing does the lift,
Vectorized queries burn so bright!
With mmap'd regions, zero-copy flows—
Query speed: that's how it grows! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the main performance optimization: DirectIndexAgg for low-cardinality integer GROUP BY, which aligns with the primary changes in vectorized_operator.hpp.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/group-by-optimization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

src/executor/operator.cpp (1)
577-605: ⚡ Quick win

Hoist key outside the loop and prefer try_emplace to avoid per-row allocations.

std::string key; is default-constructed inside the loop and then reserve(64) is called every iteration, which forces a heap allocation per input row (SSO capacity is well under 64). Hoisting the buffer above the while and using clear() to reset preserves capacity across rows and removes that per-row malloc/free pair from this hot path. While here, find followed by emplace does two hash probes on the miss path; try_emplace collapses it to one and avoids constructing the key string twice.
♻️ Proposed refactor
     Tuple tuple;
     auto child_schema = child_->output_schema();
+    std::string key;
+    key.reserve(64);  // reused across rows; capacity is preserved by clear()
     while (child_->next(tuple)) {
-        std::string key;
-        key.reserve(64);  // Pre-reserve to avoid repeated allocations
+        key.clear();
         std::vector<common::Value> gb_vals;
@@
-        auto it = groups_map.find(key);
-        if (it == groups_map.end()) {
-            it = groups_map.emplace(key, GroupState(aggregates_.size())).first;
-            it->second.group_values = std::move(gb_vals);
+        auto [it, inserted] = groups_map.try_emplace(key, GroupState(aggregates_.size()));
+        if (inserted) {
+            it->second.group_values = std::move(gb_vals);
         }
Also note the comment on Line 585 ("no string allocation") is slightly misleading — val.to_string() still allocates for non-null values; only the per-iteration key buffer can be made allocation-free.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/operator.cpp` around lines 577 - 605, Hoist the std::string key
buffer out of the per-row loop and reserve its capacity once (e.g.,
key.reserve(64) before the loop) then call key.clear() each iteration to
preserve capacity and avoid per-row heap allocations; keep gb_vals as a local
per-row vector but reuse it similarly if needed. Replace the find+emplace
pattern on groups_map with a single try_emplace call (using key as the
lookup/insert key and constructing GroupState(aggregates_.size()) only on miss)
to eliminate the double hash probe and duplicate key construction; after
insertion/update, assign it->second.group_values = std::move(gb_vals) as before.
Apply these changes around the existing symbols key, group_by_, gb_vals,
groups_map, try_emplace, GroupState, and aggregates_.
src/executor/query_executor.cpp (1)
1515-1517: ⚡ Quick win

Thread pool recreation on every vectorized query.

A new ThreadPool is created for each vectorized query execution (line 1515). Thread pool creation incurs overhead from spawning worker threads. For workloads with many short queries, this repeated construction/destruction degrades throughput. Consider caching a shared ThreadPool at the QueryExecutor instance level for reuse across queries.
♻️ Suggested approach

Add a member to QueryExecutor:
std::shared_ptr<ThreadPool> thread_pool_;
Initialize once in the constructor:
QueryExecutor::QueryExecutor(...) 
    : ..., thread_pool_(std::make_shared<ThreadPool>(std::thread::hardware_concurrency())) {}
Reuse in build_vectorized_plan:
-auto thread_pool = std::make_shared<executor::ThreadPool>(std::thread::hardware_concurrency());
 std::unique_ptr<VectorizedOperator> current_root =
-    std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool);
+    std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool_);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/executor/query_executor.cpp` around lines 1515 - 1517, The vectorized
path currently creates a new ThreadPool per query; add a
std::shared_ptr<executor::ThreadPool> thread_pool_ member to QueryExecutor,
initialize it once in QueryExecutor's constructor (using
std::thread::hardware_concurrency() or configured size), and update
build_vectorized_plan (where ThreadPool is currently created and used to
construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a
fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove
the local std::make_shared<executor::ThreadPool> creation in
build_vectorized_plan.
include/executor/vectorized_operator.hpp (1)
80-83: 💤 Low value

Consider making parallel scan threshold configurable.

The hardcoded 50,000 row threshold for enabling parallel scan may not be optimal for all workloads. The crossover point depends on thread pool size, column count, column width, and system characteristics. Consider making this threshold configurable or using a more sophisticated cost model.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/executor/vectorized_operator.hpp` around lines 80 - 83, The code
currently hardcodes the parallel-scan enable threshold (50000) in the block that
sets num_threads_ and parallel_enabled_; replace this magic number with a
configurable parameter (e.g., parallel_scan_threshold_) that is read from
operator configuration or a global runtime config and fall back to 50000 if
unset, expose it via the VectorizedOperator constructor or a setter so callers
can tune it, and update the check to set parallel_enabled_ = table_->row_count()
> parallel_scan_threshold_; optionally document the new parameter where operator
options are defined.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/executor/types.hpp`:
- Around line 247-252: The doc for virtual void steal(ColumnVector&& other)
promises that 'other' is emptied after stealing but current implementations only
swap buffers; update each steal implementation (e.g., ColumnVector::steal and
StringVector::steal) to perform the swap and then explicitly clear or reset the
moved-from 'other' (call its clear/reset method, set size/length to zero, and
release/empty any auxiliary buffers) so that after the call 'other' is empty
while 'this' holds the original data; ensure you use the class-specific
clear/reset APIs to leave 'other' in a valid empty state.

In `@include/executor/vectorized_operator.hpp`:
- Around line 385-392: The GroupSlot struct has sums_float64 sized as
MAX_AGGREGATES / 2 which can be indexed up to aggregates_.size() in
process_input_batch_direct(), causing OOB access; change sums_float64 to match
sums_int64 (size MAX_AGGREGATES) so all aggregate indexes
(0..aggregates_.size()-1) are valid, and update any related assumptions/comments
referencing half-size to reflect the new full-size array; ensure GroupSlot,
sums_float64, MAX_AGGREGATES and process_input_batch_direct usage remain
consistent.
- Around line 417-431: The find_or_insert method currently grows slots_ to cover
min_key_/max_key_ which can allocate huge memory for sparse/wide keys; modify
find_or_insert(int64_t key1, int64_t key2) to check the computed new_size
(max_key_ - min_key_ + 1) against a configured safe threshold (e.g.,
MAX_SLOT_RANGE) before resizing, and if it exceeds the threshold switch to or
fallback on a sparse container (e.g., an unordered_map or existing hash-based
path) or return an error/limit-hit indicator instead of resizing; ensure you
reference and update min_key_, max_key_, and slots_ only when within limits and
add clear handling in callers of find_or_insert for the fallback/limit
condition.

In `@include/storage/memory_mapped_column.hpp`:
- Around line 45-54: The helpers data_at() and null_at() currently do pointer
arithmetic without validating row_idx; update both functions to return nullptr
if data_region_.addr or null_region_.addr is null OR if row_idx >= row_count_,
ensuring you check bounds before computing static_cast<char*>(data_region_.addr)
+ row_idx * element_size_ (and similarly for null_region_); this prevents
out-of-bounds pointers — use the existing members row_count_, element_size_,
data_region_.addr and null_region_.addr to perform the checks and only perform
the pointer arithmetic when within range.

In `@src/executor/query_executor.cpp`:
- Around line 1709-1711: The code unsafely casts the size_t from
current_root->output_schema().find_column(col->name()) to int; first capture the
result into a size_t (e.g., auto idx =
current_root->output_schema().find_column(col->name())), check explicitly for
the not-found sentinel (idx == size_t(-1)) and handle that case (set
info.input_col_idx = -1 or raise/log an error), otherwise safely
static_cast<int>(idx) into info.input_col_idx; update references to
info.input_col_idx and use col->name() in the check to preserve current
behavior.

---

Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 80-83: The code currently hardcodes the parallel-scan enable
threshold (50000) in the block that sets num_threads_ and parallel_enabled_;
replace this magic number with a configurable parameter (e.g.,
parallel_scan_threshold_) that is read from operator configuration or a global
runtime config and fall back to 50000 if unset, expose it via the
VectorizedOperator constructor or a setter so callers can tune it, and update
the check to set parallel_enabled_ = table_->row_count() >
parallel_scan_threshold_; optionally document the new parameter where operator
options are defined.

In `@src/executor/operator.cpp`:
- Around line 577-605: Hoist the std::string key buffer out of the per-row loop
and reserve its capacity once (e.g., key.reserve(64) before the loop) then call
key.clear() each iteration to preserve capacity and avoid per-row heap
allocations; keep gb_vals as a local per-row vector but reuse it similarly if
needed. Replace the find+emplace pattern on groups_map with a single try_emplace
call (using key as the lookup/insert key and constructing
GroupState(aggregates_.size()) only on miss) to eliminate the double hash probe
and duplicate key construction; after insertion/update, assign
it->second.group_values = std::move(gb_vals) as before. Apply these changes
around the existing symbols key, group_by_, gb_vals, groups_map, try_emplace,
GroupState, and aggregates_.

In `@src/executor/query_executor.cpp`:
- Around line 1515-1517: The vectorized path currently creates a new ThreadPool
per query; add a std::shared_ptr<executor::ThreadPool> thread_pool_ member to
QueryExecutor, initialize it once in QueryExecutor's constructor (using
std::thread::hardware_concurrency() or configured size), and update
build_vectorized_plan (where ThreadPool is currently created and used to
construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a
fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove
the local std::make_shared<executor::ThreadPool> creation in
build_vectorized_plan.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b5a44b60-77fa-4086-ab79-c50bf70f0427

📥 Commits

Reviewing files that changed from the base of the PR and between fa275e2 and 864bbca.

📒 Files selected for processing (7)

benchmarks/duckdb_comparison_bench.cpp
include/executor/types.hpp
include/executor/vectorized_operator.hpp
include/storage/memory_mapped_column.hpp
src/executor/operator.cpp
src/executor/query_executor.cpp
src/storage/memory_mapped_column.cpp

coderabbitai · 2026-05-21T13:01:05Z

+    /**
+     * @brief Steals data from another column vector by swapping internal buffers.
+     * After steal(), 'other' is emptied and 'this' holds the original data from 'other'.
+     * Throws std::runtime_error if types are incompatible.
+     */
+    virtual void steal(ColumnVector&& other) = 0;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Semantic contract mismatch between documentation and implementation.

The documentation states "After steal(), 'other' is emptied," but the implementations (lines 339-345, 411-417) only swap internal buffers without clearing other. After steal(), both vectors remain valid with swapped contents rather than other being empty. This violates the documented contract and could mislead callers who rely on other being emptied.

🔄 Proposed fix: clear 'other' after swapping

Update implementations to explicitly clear other after swapping:

void steal(ColumnVector&& other) override { auto* other_num = dynamic_cast<NumericVector<T>*>(&other); if (!other_num) throw std::runtime_error("NumericVector::steal: type mismatch"); data_.swap(other_num->data_); null_bitmap_.swap(other_num->null_bitmap_); std::swap(size_, other_num->size_); + other_num->data_.clear(); + other_num->null_bitmap_.clear(); + other_num->size_ = 0; }

Apply the same pattern to StringVector::steal at lines 411-417.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@include/executor/types.hpp` around lines 247 - 252, The doc for virtual void steal(ColumnVector&& other) promises that 'other' is emptied after stealing but current implementations only swap buffers; update each steal implementation (e.g., ColumnVector::steal and StringVector::steal) to perform the swap and then explicitly clear or reset the moved-from 'other' (call its clear/reset method, set size/length to zero, and release/empty any auxiliary buffers) so that after the call 'other' is empty while 'this' holds the original data; ensure you use the class-specific clear/reset APIs to leave 'other' in a valid empty state.

coderabbitai · 2026-05-21T13:01:05Z

+    struct GroupSlot {
+        bool valid = false;
+        int64_t key1 = 0;
+        int64_t key2 = 0;
+        int64_t counts[MAX_AGGREGATES] = {0};
+        int64_t sums_int64[MAX_AGGREGATES] = {0};
+        double sums_float64[MAX_AGGREGATES / 2] = {0.0};
+        bool has_float_value[MAX_AGGREGATES] = {false};


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: sums_float64 array is undersized for MAX_AGGREGATES.

sums_float64 is declared with size MAX_AGGREGATES / 2 (line 391) while sums_int64 has size MAX_AGGREGATES. The code at lines 596-597 in process_input_batch_direct() indexes slot.sums_float64[i] where i ranges from 0 to aggregates_.size() - 1. If aggregates_.size() exceeds 4 (half of MAX_AGGREGATES), this causes out-of-bounds array access and undefined behavior.

🐛 Proposed fix: match array sizes

- double sums_float64[MAX_AGGREGATES / 2] = {0.0}; + double sums_float64[MAX_AGGREGATES] = {0.0};

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

struct GroupSlot {

bool valid = false;

int64_t key1 = 0;

int64_t key2 = 0;

int64_t counts[MAX_AGGREGATES] = {0};

int64_t sums_int64[MAX_AGGREGATES] = {0};

double sums_float64[MAX_AGGREGATES / 2] = {0.0};

bool has_float_value[MAX_AGGREGATES] = {false};

struct GroupSlot {

bool valid = false;

int64_t key1 = 0;

int64_t key2 = 0;

int64_t counts[MAX_AGGREGATES] = {0};

int64_t sums_int64[MAX_AGGREGATES] = {0};

double sums_float64[MAX_AGGREGATES] = {0.0};

bool has_float_value[MAX_AGGREGATES] = {false};

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@include/executor/vectorized_operator.hpp` around lines 385 - 392, The GroupSlot struct has sums_float64 sized as MAX_AGGREGATES / 2 which can be indexed up to aggregates_.size() in process_input_batch_direct(), causing OOB access; change sums_float64 to match sums_int64 (size MAX_AGGREGATES) so all aggregate indexes (0..aggregates_.size()-1) are valid, and update any related assumptions/comments referencing half-size to reflect the new full-size array; ensure GroupSlot, sums_float64, MAX_AGGREGATES and process_input_batch_direct usage remain consistent.

coderabbitai · 2026-05-21T13:01:05Z

+    const void* data_at(size_t row_idx) const {
+        if (!data_region_.addr) return nullptr;
+        return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
+    }
+
+    // Direct pointer to null bit at row index
+    const uint8_t* null_at(size_t row_idx) const {
+        if (!null_region_.addr) return nullptr;
+        return static_cast<uint8_t*>(null_region_.addr) + row_idx;
+    }


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add bounds validation to pointer arithmetic helpers.

data_at() and null_at() perform pointer arithmetic without validating that row_idx < row_count_. Returning out-of-bounds pointers can lead to undefined behavior when dereferenced by callers. Even if the caller is expected to validate, defensive bounds checking here prevents subtle bugs.

🛡️ Proposed fix: add bounds checking

const void* data_at(size_t row_idx) const { - if (!data_region_.addr) return nullptr; + if (!data_region_.addr || row_idx >= row_count_) return nullptr; return static_cast<char*>(data_region_.addr) + row_idx * element_size_; } const uint8_t* null_at(size_t row_idx) const { - if (!null_region_.addr) return nullptr; + if (!null_region_.addr || row_idx >= row_count_) return nullptr; return static_cast<uint8_t*>(null_region_.addr) + row_idx; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const void* data_at(size_t row_idx) const {

if (!data_region_.addr) return nullptr;

return static_cast<char*>(data_region_.addr) + row_idx * element_size_;

}

// Direct pointer to null bit at row index

const uint8_t* null_at(size_t row_idx) const {

if (!null_region_.addr) return nullptr;

return static_cast<uint8_t*>(null_region_.addr) + row_idx;

}

const void* data_at(size_t row_idx) const {

if (!data_region_.addr || row_idx >= row_count_) return nullptr;

return static_cast<char*>(data_region_.addr) + row_idx * element_size_;

}

// Direct pointer to null bit at row index

const uint8_t* null_at(size_t row_idx) const {

if (!null_region_.addr || row_idx >= row_count_) return nullptr;

return static_cast<uint8_t*>(null_region_.addr) + row_idx;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@include/storage/memory_mapped_column.hpp` around lines 45 - 54, The helpers data_at() and null_at() currently do pointer arithmetic without validating row_idx; update both functions to return nullptr if data_region_.addr or null_region_.addr is null OR if row_idx >= row_count_, ensuring you check bounds before computing static_cast<char*>(data_region_.addr) + row_idx * element_size_ (and similarly for null_region_); this prevents out-of-bounds pointers — use the existing members row_count_, element_size_, data_region_.addr and null_region_.addr to perform the checks and only perform the pointer arithmetic when within range.

coderabbitai · 2026-05-21T13:01:05Z

+                            info.input_col_idx = static_cast<int>(
+                                current_root->output_schema().find_column(col->name()));
+                        }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Unsafe cast of size_t to int for column index.

find_column() returns size_t(-1) to indicate "not found" (line 1710). Casting size_t(-1) directly to int is implementation-defined and may not reliably produce -1. On platforms where sizeof(size_t) > sizeof(int), the cast can produce unexpected values. Check for "not found" explicitly before casting.

🔧 Proposed fix: explicit not-found check

if (!func->args().empty()) { const auto& arg = func->args()[0]; if (arg->type() == parser::ExprType::Column) { const auto* col = dynamic_cast<const parser::ColumnExpr*>(arg.get()); if (col != nullptr) { - info.input_col_idx = static_cast<int>( - current_root->output_schema().find_column(col->name())); + size_t col_idx = current_root->output_schema().find_column(col->name()); + info.input_col_idx = (col_idx != static_cast<size_t>(-1)) + ? static_cast<int>(col_idx) + : -1; } } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

info.input_col_idx = static_cast<int>(

current_root->output_schema().find_column(col->name()));

}

size_t col_idx = current_root->output_schema().find_column(col->name());

info.input_col_idx = (col_idx != static_cast<size_t>(-1))

? static_cast<int>(col_idx)

: -1;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/executor/query_executor.cpp` around lines 1709 - 1711, The code unsafely casts the size_t from current_root->output_schema().find_column(col->name()) to int; first capture the result into a size_t (e.g., auto idx = current_root->output_schema().find_column(col->name())), check explicitly for the not-found sentinel (idx == size_t(-1)) and handle that case (set info.input_col_idx = -1 or raise/log an error), otherwise safely static_cast<int>(idx) into info.input_col_idx; update references to info.input_col_idx and use col->name() in the check to preserve current behavior.

…_insert - Remove unused valid_slot_indices_ member (produces_output_batch_direct iterates direct_group_keys_ instead) - Remove orphaned valid_slots() accessor - Fix find_or_insert to mark slot.valid=true on first insertion (was relying on accidental post-condition in caller)

- Open-addressing hash table with linear probing and FNV-1a 64-bit hash - Binary key encoding: [type_tag (1B)][len (4B)][data...] - avoids string allocation - HashBucket stores key_bytes inline for fast comparison, is_new flag for iteration - Added to VectorizedGroupByOperator as fallback when DirectIndexAgg doesn't apply - Fixed grow() to use key_data instead of casting key_int64 Note: process_input_batch_open_addressing() is written but not yet wired into process_input_batch() - the old hash path is still being used. This is intentional to allow incremental testing.

poyrazK added 4 commits May 18, 2026 17:12

style: automated clang-format fixes

864bbca

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

poyrazK and others added 3 commits May 21, 2026 16:21

style: automated clang-format fixes

adff7ce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: DirectIndexAgg for low-cardinality integer GROUP BY#157

perf: DirectIndexAgg for low-cardinality integer GROUP BY#157
poyrazK wants to merge 8 commits into
mainfrom
perf/group-by-optimization

poyrazK commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-                            info.input_col_idx = static_cast<int>(
-                                current_root->output_schema().find_column(col->name()));
-                        }
+                            size_t col_idx = current_root->output_schema().find_column(col->name());
+                            info.input_col_idx = (col_idx != static_cast<size_t>(-1))
+                                ? static_cast<int>(col_idx)
+                                : -1;
+                        }

Conversation

poyrazK commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

poyrazK commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading