perf: DirectIndexAgg for low-cardinality integer GROUP BY#157
Conversation
Key changes: - query_executor.cpp: Remove parallel_ requirement from use_vectorized gate (Fix 3: enable vectorized path for large scans without parallel mode) - query_executor.cpp: Resolve input_col_idx from aggregate function arguments (Fix 5: enable SUM/AVG to work on specific columns, not just COUNT(*)) - vectorized_operator.hpp: Add AVG support in update_accumulators and produce_output_batch - operator.cpp: Replace std::map with std::unordered_map + binary key encoding (Fix 1+2: O(log n) → O(1) hash, string concat → length-prefixed binary) Benchmark baseline (Q1 @ 100k rows): - Before fixes: ~152k rows/sec - After fixes: ~166k rows/sec - DuckDB: ~197M rows/sec The vectorized path is now enabled but VectorizedGroupByOperator needs further optimization (batch-oriented key construction) to close gap.
This commits the full parallel scan implementation: - VectorizedSeqScanOperator now accepts optional ThreadPool parameter - When enabled (>50k rows, >1 thread), splits table into range chunks and reads each range in parallel via thread pool - Added steal() to ColumnVector/NumericVector/StringVector for O(1) batch transfer via vector swap (avoids per-element copy) - Parallel results are collected and returned via steal() to out_batch The slow benchmark (2.6M/s) vs fast (598M/s for Q6/10k) suggests the GROUP BY aggregation is the bottleneck, not I/O. The parallel scan overhead (file opens/closes per task) likely exceeds I/O benefit at 100k row scale.
Implements direct array indexing for integer GROUP BY keys instead of string-based hashing. When GROUP BY is on a single integer column, uses slot = key - min_key for O(1) lookup without hash computation. Also adds steal() method to ColumnVector/NumericVector/StringVector for O(1) batch transfers via vector swap. Note: Benchmark still shows ~2.6M/s suggesting the optimization path may not be triggered, or there's another bottleneck in the execution path that needs investigation.
The estimated_rows variable was being redeclared inside a nested block, causing the GROUP BY heuristic check to always see stale (0) values. Also adds the condition to enable vectorized path when GROUP BY exists but num_rows==0 (ANALYZE not run), allowing DirectIndexAgg optimization to activate.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR accelerates vectorized query execution by adding parallel scan support with column stealing, introducing memory-mapped storage for zero-copy reads, and implementing a direct-index aggregation fast path for low-cardinality integer GROUP BY operations. Plan selection now uses estimated row cardinality to drive vectorization decisions and provisions a ThreadPool for parallel scans. ChangesVectorized Execution and Parallel Processing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (3)
src/executor/operator.cpp (1)
577-605: ⚡ Quick winHoist
keyoutside the loop and prefertry_emplaceto avoid per-row allocations.
std::string key;is default-constructed inside the loop and thenreserve(64)is called every iteration, which forces a heap allocation per input row (SSO capacity is well under 64). Hoisting the buffer above thewhileand usingclear()to reset preserves capacity across rows and removes that per-row malloc/free pair from this hot path. While here,findfollowed byemplacedoes two hash probes on the miss path;try_emplacecollapses it to one and avoids constructing the key string twice.♻️ Proposed refactor
Tuple tuple; auto child_schema = child_->output_schema(); + std::string key; + key.reserve(64); // reused across rows; capacity is preserved by clear() while (child_->next(tuple)) { - std::string key; - key.reserve(64); // Pre-reserve to avoid repeated allocations + key.clear(); std::vector<common::Value> gb_vals; @@ - auto it = groups_map.find(key); - if (it == groups_map.end()) { - it = groups_map.emplace(key, GroupState(aggregates_.size())).first; - it->second.group_values = std::move(gb_vals); + auto [it, inserted] = groups_map.try_emplace(key, GroupState(aggregates_.size())); + if (inserted) { + it->second.group_values = std::move(gb_vals); }Also note the comment on Line 585 ("no string allocation") is slightly misleading —
val.to_string()still allocates for non-null values; only the per-iterationkeybuffer can be made allocation-free.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/executor/operator.cpp` around lines 577 - 605, Hoist the std::string key buffer out of the per-row loop and reserve its capacity once (e.g., key.reserve(64) before the loop) then call key.clear() each iteration to preserve capacity and avoid per-row heap allocations; keep gb_vals as a local per-row vector but reuse it similarly if needed. Replace the find+emplace pattern on groups_map with a single try_emplace call (using key as the lookup/insert key and constructing GroupState(aggregates_.size()) only on miss) to eliminate the double hash probe and duplicate key construction; after insertion/update, assign it->second.group_values = std::move(gb_vals) as before. Apply these changes around the existing symbols key, group_by_, gb_vals, groups_map, try_emplace, GroupState, and aggregates_.src/executor/query_executor.cpp (1)
1515-1517: ⚡ Quick winThread pool recreation on every vectorized query.
A new
ThreadPoolis created for each vectorized query execution (line 1515). Thread pool creation incurs overhead from spawning worker threads. For workloads with many short queries, this repeated construction/destruction degrades throughput. Consider caching a sharedThreadPoolat theQueryExecutorinstance level for reuse across queries.♻️ Suggested approach
Add a member to
QueryExecutor:std::shared_ptr<ThreadPool> thread_pool_;Initialize once in the constructor:
QueryExecutor::QueryExecutor(...) : ..., thread_pool_(std::make_shared<ThreadPool>(std::thread::hardware_concurrency())) {}Reuse in
build_vectorized_plan:-auto thread_pool = std::make_shared<executor::ThreadPool>(std::thread::hardware_concurrency()); std::unique_ptr<VectorizedOperator> current_root = - std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool); + std::make_unique<VectorizedSeqScanOperator>(base_table_name, col_table, thread_pool_);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/executor/query_executor.cpp` around lines 1515 - 1517, The vectorized path currently creates a new ThreadPool per query; add a std::shared_ptr<executor::ThreadPool> thread_pool_ member to QueryExecutor, initialize it once in QueryExecutor's constructor (using std::thread::hardware_concurrency() or configured size), and update build_vectorized_plan (where ThreadPool is currently created and used to construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove the local std::make_shared<executor::ThreadPool> creation in build_vectorized_plan.include/executor/vectorized_operator.hpp (1)
80-83: 💤 Low valueConsider making parallel scan threshold configurable.
The hardcoded 50,000 row threshold for enabling parallel scan may not be optimal for all workloads. The crossover point depends on thread pool size, column count, column width, and system characteristics. Consider making this threshold configurable or using a more sophisticated cost model.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@include/executor/vectorized_operator.hpp` around lines 80 - 83, The code currently hardcodes the parallel-scan enable threshold (50000) in the block that sets num_threads_ and parallel_enabled_; replace this magic number with a configurable parameter (e.g., parallel_scan_threshold_) that is read from operator configuration or a global runtime config and fall back to 50000 if unset, expose it via the VectorizedOperator constructor or a setter so callers can tune it, and update the check to set parallel_enabled_ = table_->row_count() > parallel_scan_threshold_; optionally document the new parameter where operator options are defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@include/executor/types.hpp`:
- Around line 247-252: The doc for virtual void steal(ColumnVector&& other)
promises that 'other' is emptied after stealing but current implementations only
swap buffers; update each steal implementation (e.g., ColumnVector::steal and
StringVector::steal) to perform the swap and then explicitly clear or reset the
moved-from 'other' (call its clear/reset method, set size/length to zero, and
release/empty any auxiliary buffers) so that after the call 'other' is empty
while 'this' holds the original data; ensure you use the class-specific
clear/reset APIs to leave 'other' in a valid empty state.
In `@include/executor/vectorized_operator.hpp`:
- Around line 385-392: The GroupSlot struct has sums_float64 sized as
MAX_AGGREGATES / 2 which can be indexed up to aggregates_.size() in
process_input_batch_direct(), causing OOB access; change sums_float64 to match
sums_int64 (size MAX_AGGREGATES) so all aggregate indexes
(0..aggregates_.size()-1) are valid, and update any related assumptions/comments
referencing half-size to reflect the new full-size array; ensure GroupSlot,
sums_float64, MAX_AGGREGATES and process_input_batch_direct usage remain
consistent.
- Around line 417-431: The find_or_insert method currently grows slots_ to cover
min_key_/max_key_ which can allocate huge memory for sparse/wide keys; modify
find_or_insert(int64_t key1, int64_t key2) to check the computed new_size
(max_key_ - min_key_ + 1) against a configured safe threshold (e.g.,
MAX_SLOT_RANGE) before resizing, and if it exceeds the threshold switch to or
fallback on a sparse container (e.g., an unordered_map or existing hash-based
path) or return an error/limit-hit indicator instead of resizing; ensure you
reference and update min_key_, max_key_, and slots_ only when within limits and
add clear handling in callers of find_or_insert for the fallback/limit
condition.
In `@include/storage/memory_mapped_column.hpp`:
- Around line 45-54: The helpers data_at() and null_at() currently do pointer
arithmetic without validating row_idx; update both functions to return nullptr
if data_region_.addr or null_region_.addr is null OR if row_idx >= row_count_,
ensuring you check bounds before computing static_cast<char*>(data_region_.addr)
+ row_idx * element_size_ (and similarly for null_region_); this prevents
out-of-bounds pointers — use the existing members row_count_, element_size_,
data_region_.addr and null_region_.addr to perform the checks and only perform
the pointer arithmetic when within range.
In `@src/executor/query_executor.cpp`:
- Around line 1709-1711: The code unsafely casts the size_t from
current_root->output_schema().find_column(col->name()) to int; first capture the
result into a size_t (e.g., auto idx =
current_root->output_schema().find_column(col->name())), check explicitly for
the not-found sentinel (idx == size_t(-1)) and handle that case (set
info.input_col_idx = -1 or raise/log an error), otherwise safely
static_cast<int>(idx) into info.input_col_idx; update references to
info.input_col_idx and use col->name() in the check to preserve current
behavior.
---
Nitpick comments:
In `@include/executor/vectorized_operator.hpp`:
- Around line 80-83: The code currently hardcodes the parallel-scan enable
threshold (50000) in the block that sets num_threads_ and parallel_enabled_;
replace this magic number with a configurable parameter (e.g.,
parallel_scan_threshold_) that is read from operator configuration or a global
runtime config and fall back to 50000 if unset, expose it via the
VectorizedOperator constructor or a setter so callers can tune it, and update
the check to set parallel_enabled_ = table_->row_count() >
parallel_scan_threshold_; optionally document the new parameter where operator
options are defined.
In `@src/executor/operator.cpp`:
- Around line 577-605: Hoist the std::string key buffer out of the per-row loop
and reserve its capacity once (e.g., key.reserve(64) before the loop) then call
key.clear() each iteration to preserve capacity and avoid per-row heap
allocations; keep gb_vals as a local per-row vector but reuse it similarly if
needed. Replace the find+emplace pattern on groups_map with a single try_emplace
call (using key as the lookup/insert key and constructing
GroupState(aggregates_.size()) only on miss) to eliminate the double hash probe
and duplicate key construction; after insertion/update, assign
it->second.group_values = std::move(gb_vals) as before. Apply these changes
around the existing symbols key, group_by_, gb_vals, groups_map, try_emplace,
GroupState, and aggregates_.
In `@src/executor/query_executor.cpp`:
- Around line 1515-1517: The vectorized path currently creates a new ThreadPool
per query; add a std::shared_ptr<executor::ThreadPool> thread_pool_ member to
QueryExecutor, initialize it once in QueryExecutor's constructor (using
std::thread::hardware_concurrency() or configured size), and update
build_vectorized_plan (where ThreadPool is currently created and used to
construct VectorizedSeqScanOperator) to reuse thread_pool_ instead of making a
fresh pool; ensure lifetime is managed by the QueryExecutor instance and remove
the local std::make_shared<executor::ThreadPool> creation in
build_vectorized_plan.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b5a44b60-77fa-4086-ab79-c50bf70f0427
📒 Files selected for processing (7)
benchmarks/duckdb_comparison_bench.cppinclude/executor/types.hppinclude/executor/vectorized_operator.hppinclude/storage/memory_mapped_column.hppsrc/executor/operator.cppsrc/executor/query_executor.cppsrc/storage/memory_mapped_column.cpp
| /** | ||
| * @brief Steals data from another column vector by swapping internal buffers. | ||
| * After steal(), 'other' is emptied and 'this' holds the original data from 'other'. | ||
| * Throws std::runtime_error if types are incompatible. | ||
| */ | ||
| virtual void steal(ColumnVector&& other) = 0; |
There was a problem hiding this comment.
Semantic contract mismatch between documentation and implementation.
The documentation states "After steal(), 'other' is emptied," but the implementations (lines 339-345, 411-417) only swap internal buffers without clearing other. After steal(), both vectors remain valid with swapped contents rather than other being empty. This violates the documented contract and could mislead callers who rely on other being emptied.
🔄 Proposed fix: clear 'other' after swapping
Update implementations to explicitly clear other after swapping:
void steal(ColumnVector&& other) override {
auto* other_num = dynamic_cast<NumericVector<T>*>(&other);
if (!other_num) throw std::runtime_error("NumericVector::steal: type mismatch");
data_.swap(other_num->data_);
null_bitmap_.swap(other_num->null_bitmap_);
std::swap(size_, other_num->size_);
+ other_num->data_.clear();
+ other_num->null_bitmap_.clear();
+ other_num->size_ = 0;
}Apply the same pattern to StringVector::steal at lines 411-417.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@include/executor/types.hpp` around lines 247 - 252, The doc for virtual void
steal(ColumnVector&& other) promises that 'other' is emptied after stealing but
current implementations only swap buffers; update each steal implementation
(e.g., ColumnVector::steal and StringVector::steal) to perform the swap and then
explicitly clear or reset the moved-from 'other' (call its clear/reset method,
set size/length to zero, and release/empty any auxiliary buffers) so that after
the call 'other' is empty while 'this' holds the original data; ensure you use
the class-specific clear/reset APIs to leave 'other' in a valid empty state.
| struct GroupSlot { | ||
| bool valid = false; | ||
| int64_t key1 = 0; | ||
| int64_t key2 = 0; | ||
| int64_t counts[MAX_AGGREGATES] = {0}; | ||
| int64_t sums_int64[MAX_AGGREGATES] = {0}; | ||
| double sums_float64[MAX_AGGREGATES / 2] = {0.0}; | ||
| bool has_float_value[MAX_AGGREGATES] = {false}; |
There was a problem hiding this comment.
Critical: sums_float64 array is undersized for MAX_AGGREGATES.
sums_float64 is declared with size MAX_AGGREGATES / 2 (line 391) while sums_int64 has size MAX_AGGREGATES. The code at lines 596-597 in process_input_batch_direct() indexes slot.sums_float64[i] where i ranges from 0 to aggregates_.size() - 1. If aggregates_.size() exceeds 4 (half of MAX_AGGREGATES), this causes out-of-bounds array access and undefined behavior.
🐛 Proposed fix: match array sizes
- double sums_float64[MAX_AGGREGATES / 2] = {0.0};
+ double sums_float64[MAX_AGGREGATES] = {0.0};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| struct GroupSlot { | |
| bool valid = false; | |
| int64_t key1 = 0; | |
| int64_t key2 = 0; | |
| int64_t counts[MAX_AGGREGATES] = {0}; | |
| int64_t sums_int64[MAX_AGGREGATES] = {0}; | |
| double sums_float64[MAX_AGGREGATES / 2] = {0.0}; | |
| bool has_float_value[MAX_AGGREGATES] = {false}; | |
| struct GroupSlot { | |
| bool valid = false; | |
| int64_t key1 = 0; | |
| int64_t key2 = 0; | |
| int64_t counts[MAX_AGGREGATES] = {0}; | |
| int64_t sums_int64[MAX_AGGREGATES] = {0}; | |
| double sums_float64[MAX_AGGREGATES] = {0.0}; | |
| bool has_float_value[MAX_AGGREGATES] = {false}; |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@include/executor/vectorized_operator.hpp` around lines 385 - 392, The
GroupSlot struct has sums_float64 sized as MAX_AGGREGATES / 2 which can be
indexed up to aggregates_.size() in process_input_batch_direct(), causing OOB
access; change sums_float64 to match sums_int64 (size MAX_AGGREGATES) so all
aggregate indexes (0..aggregates_.size()-1) are valid, and update any related
assumptions/comments referencing half-size to reflect the new full-size array;
ensure GroupSlot, sums_float64, MAX_AGGREGATES and process_input_batch_direct
usage remain consistent.
| const void* data_at(size_t row_idx) const { | ||
| if (!data_region_.addr) return nullptr; | ||
| return static_cast<char*>(data_region_.addr) + row_idx * element_size_; | ||
| } | ||
|
|
||
| // Direct pointer to null bit at row index | ||
| const uint8_t* null_at(size_t row_idx) const { | ||
| if (!null_region_.addr) return nullptr; | ||
| return static_cast<uint8_t*>(null_region_.addr) + row_idx; | ||
| } |
There was a problem hiding this comment.
Add bounds validation to pointer arithmetic helpers.
data_at() and null_at() perform pointer arithmetic without validating that row_idx < row_count_. Returning out-of-bounds pointers can lead to undefined behavior when dereferenced by callers. Even if the caller is expected to validate, defensive bounds checking here prevents subtle bugs.
🛡️ Proposed fix: add bounds checking
const void* data_at(size_t row_idx) const {
- if (!data_region_.addr) return nullptr;
+ if (!data_region_.addr || row_idx >= row_count_) return nullptr;
return static_cast<char*>(data_region_.addr) + row_idx * element_size_;
}
const uint8_t* null_at(size_t row_idx) const {
- if (!null_region_.addr) return nullptr;
+ if (!null_region_.addr || row_idx >= row_count_) return nullptr;
return static_cast<uint8_t*>(null_region_.addr) + row_idx;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const void* data_at(size_t row_idx) const { | |
| if (!data_region_.addr) return nullptr; | |
| return static_cast<char*>(data_region_.addr) + row_idx * element_size_; | |
| } | |
| // Direct pointer to null bit at row index | |
| const uint8_t* null_at(size_t row_idx) const { | |
| if (!null_region_.addr) return nullptr; | |
| return static_cast<uint8_t*>(null_region_.addr) + row_idx; | |
| } | |
| const void* data_at(size_t row_idx) const { | |
| if (!data_region_.addr || row_idx >= row_count_) return nullptr; | |
| return static_cast<char*>(data_region_.addr) + row_idx * element_size_; | |
| } | |
| // Direct pointer to null bit at row index | |
| const uint8_t* null_at(size_t row_idx) const { | |
| if (!null_region_.addr || row_idx >= row_count_) return nullptr; | |
| return static_cast<uint8_t*>(null_region_.addr) + row_idx; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@include/storage/memory_mapped_column.hpp` around lines 45 - 54, The helpers
data_at() and null_at() currently do pointer arithmetic without validating
row_idx; update both functions to return nullptr if data_region_.addr or
null_region_.addr is null OR if row_idx >= row_count_, ensuring you check bounds
before computing static_cast<char*>(data_region_.addr) + row_idx * element_size_
(and similarly for null_region_); this prevents out-of-bounds pointers — use the
existing members row_count_, element_size_, data_region_.addr and
null_region_.addr to perform the checks and only perform the pointer arithmetic
when within range.
| info.input_col_idx = static_cast<int>( | ||
| current_root->output_schema().find_column(col->name())); | ||
| } |
There was a problem hiding this comment.
Unsafe cast of size_t to int for column index.
find_column() returns size_t(-1) to indicate "not found" (line 1710). Casting size_t(-1) directly to int is implementation-defined and may not reliably produce -1. On platforms where sizeof(size_t) > sizeof(int), the cast can produce unexpected values. Check for "not found" explicitly before casting.
🔧 Proposed fix: explicit not-found check
if (!func->args().empty()) {
const auto& arg = func->args()[0];
if (arg->type() == parser::ExprType::Column) {
const auto* col = dynamic_cast<const parser::ColumnExpr*>(arg.get());
if (col != nullptr) {
- info.input_col_idx = static_cast<int>(
- current_root->output_schema().find_column(col->name()));
+ size_t col_idx = current_root->output_schema().find_column(col->name());
+ info.input_col_idx = (col_idx != static_cast<size_t>(-1))
+ ? static_cast<int>(col_idx)
+ : -1;
}
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| info.input_col_idx = static_cast<int>( | |
| current_root->output_schema().find_column(col->name())); | |
| } | |
| size_t col_idx = current_root->output_schema().find_column(col->name()); | |
| info.input_col_idx = (col_idx != static_cast<size_t>(-1)) | |
| ? static_cast<int>(col_idx) | |
| : -1; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/executor/query_executor.cpp` around lines 1709 - 1711, The code unsafely
casts the size_t from current_root->output_schema().find_column(col->name()) to
int; first capture the result into a size_t (e.g., auto idx =
current_root->output_schema().find_column(col->name())), check explicitly for
the not-found sentinel (idx == size_t(-1)) and handle that case (set
info.input_col_idx = -1 or raise/log an error), otherwise safely
static_cast<int>(idx) into info.input_col_idx; update references to
info.input_col_idx and use col->name() in the check to preserve current
behavior.
…_insert - Remove unused valid_slot_indices_ member (produces_output_batch_direct iterates direct_group_keys_ instead) - Remove orphaned valid_slots() accessor - Fix find_or_insert to mark slot.valid=true on first insertion (was relying on accidental post-condition in caller)
- Open-addressing hash table with linear probing and FNV-1a 64-bit hash - Binary key encoding: [type_tag (1B)][len (4B)][data...] - avoids string allocation - HashBucket stores key_bytes inline for fast comparison, is_new flag for iteration - Added to VectorizedGroupByOperator as fallback when DirectIndexAgg doesn't apply - Fixed grow() to use key_data instead of casting key_int64 Note: process_input_batch_open_addressing() is written but not yet wired into process_input_batch() - the old hash path is still being used. This is intentional to allow incremental testing.
Summary
Performance
Test plan
make -j4 duckdb_comparison_bench./build/duckdb_comparison_bench --benchmark_filter=BM_CloudSQL_Q[16]/100000 --benchmark_repetitions=3Summary by CodeRabbit
New Features
Performance