Partition-aware cardinality estimation for Iceberg#778
Partition-aware cardinality estimation for Iceberg#778peterboncz wants to merge 4 commits intoduckdb:v1.5-variegatafrom
Conversation
Implements IcebergGetPartitionStats for iceberg_scan, wiring it into DuckDB's partition statistics API. For each data file the function now surfaces the per-file row count together with per-column min/max bounds (read from the Iceberg manifest's lower_bounds/upper_bounds maps) via a new IcebergPartitionRowGroup. This allows the planner to prune partitions using column-level predicates and to produce accurate per-partition cardinality estimates, rather than falling back to a single table-level estimate. For tables with delete files the function scans the delete manifests independently to compute per-data-file net row counts. Positional delete entries that carry referenced_data_file (V3 puffin deletion vectors and optimised V2 positional deletes) are resolved exactly — the deleted count is subtracted from the data file's gross record_count and the result is reported as COUNT_EXACT. Equality deletes and V2 positional deletes without a specific target file remain COUNT_APPROXIMATE, since their impact cannot be determined per-file from manifest metadata alone. Tests are added to verify that EXPLAIN cardinality reflects partition pruning (~5,000 full scan → ~1,000 with a single-partition filter on filtering_on_partition_bounds) and that deletion vectors produce correct net estimates (deletion_vectors: EXPLAIN ~50,000, actual count(*) = 50,000). A new data generator and test (partitioned_deletion_vectors) specifically exercises the combination of partition pruning and deletion vectors, showing the per-file net count (~500) rather than the gross count (~1,000) after applying a partition filter.
Tmonster
left a comment
There was a problem hiding this comment.
Thanks! A couple of questions and it looks like there is a merge conflict?
Could this also be used to optimize count(*) queries on a single partition? If we return COUNT_EXACT for the count type, I suppose it should be possible right?
| # - seq=1 has 500 rows deleted via V3 puffin deletion vectors (col%2=0) | ||
| # - seq=2 and seq=3 have no deletions | ||
| # | ||
| # Without this PR's fix: |
There was a problem hiding this comment.
Comments specific to this PR can be removed, they loose context once the PR is merged
| if (!entry.data_file.lower_bounds.empty() || !entry.data_file.upper_bounds.empty()) { | ||
| stats.partition_row_group = make_shared_ptr<IcebergPartitionRowGroup>(schema, entry.data_file); | ||
| } | ||
| result.push_back(std::move(stats)); |
There was a problem hiding this comment.
looks like we are pushing back a PartitionStatistic for every manifest_entry? There could be multiple manifest entries that belong to the same partition because they are inserted over many snapshots.
Can we add a test for this?
|
The linked "duckdb core PR" is a motherduckdb PR, not accessible. |
| function.table_scan_progress = nullptr; | ||
| function.get_bind_info = IcebergBindInfo; | ||
| function.get_virtual_columns = IcebergVirtualColumns; | ||
| function.get_partition_stats = IcebergGetPartitionStats; |
There was a problem hiding this comment.
I think this already has an implementation on main, you'll need to merge with that
| entries.clear(); | ||
| reader.Read(STANDARD_VECTOR_SIZE, entries); | ||
| for (auto &e : entries) { | ||
| if (e.data_file.content == IcebergManifestEntryContentType::POSITION_DELETES && |
There was a problem hiding this comment.
Just a note:
This condition is very optimistic, as referenced_data_file was seemingly added in V3 and "backported" to V2, given it's listed under the v3 spec changes: https://iceberg.apache.org/spec/#version-3
| auto &snapshot = *file_list.GetSnapshot(); | ||
|
|
||
| // Re-scan delete manifests independently (don't consume the shared delete_manifest_reader) | ||
| auto delete_scan = |
There was a problem hiding this comment.
This seems wasteful, as we have already called GetTotalFileCount, which will have read all the manifests already. We can use that data on the file_list already, no?
| for (auto &e : entries) { | ||
| if (e.data_file.content == IcebergManifestEntryContentType::POSITION_DELETES && | ||
| !e.data_file.referenced_data_file.empty()) { | ||
| deletes_per_file[e.data_file.referenced_data_file] += e.data_file.record_count; |
There was a problem hiding this comment.
This doesn't take into account uncommitted data (for when we're inside a transaction)
It also doesn't account for invalidated positional-delete files, or deleted deletion-vector files.
Deletion vectors are maintained synchronously: Writers must merge DVs (and older position delete files) to ensure there is at most one DV per data file
Readers can safely ignore position delete files if there is a DV for a data file
Tishj
left a comment
There was a problem hiding this comment.
This should be retargeted to main and make the following changes:
I think we can use the file_list.positional_delete_data map and use that to remove the logic in place to compute deletes_per_file.
We might need to create a modified version of GetEqualityDeletesForFile to efficiently check if any equality deletes apply to the partition, rather than a single data file.
And I think we shouldn't be creating a IcebergPartitionRowGroup per data file, but rather create it per partition-spec-id ?
Note to self, and other reviewers:
relevant logic in core that deals with the created information here is RowGroupReorderer::GetOffsetAfterPruning just for future reference
|
Thanks @Tishj - indeed this is intended for the new main, but I was actually planning to keep this a bit out of sight for the moment. MotherDuck at this moment still has to release v1.5 and only after that will we have a main. But, I will work on your comments, thanks so much! |
NOTE: this is (obviously) a post-v1.5 PR. It is was created not only to enhance Iceberg with better stats, but also to enable additional work MotherDuck to support partition pruning in its hybrid query optimizer.
Implements IcebergGetPartitionStats for iceberg_scan, wiring it into DuckDB's partition statistics API.
For each data file the function returns the per-file row count together with per-column min/max bounds (read from the Iceberg manifest's lower_bounds/upper_bounds maps) via a new IcebergPartitionRowGroup. This allows the planner to prune partitions using column-level predicates and to produce accurate per-partition cardinality estimates, rather than falling back to a single table-level estimate.
For tables with delete files the function scans the delete manifests independently to compute per-data-file net row counts. Positional delete entries that carry referenced_data_file (V3 puffin deletion vectors and optimised V2 positional deletes) are resolved exactly. Equality deletes and V2 positional deletes without a specific target file remain COUNT_APPROXIMATE, since their impact cannot be determined per-file from manifest metadata alone.
Tests are added to verify that EXPLAIN cardinality reflects partition pruning (~5,000 full scan → ~1,000 with a single-partition filter on filtering_on_partition_bounds) and that deletion vectors produce correct net estimates (deletion_vectors: EXPLAIN ~50,000, actual count(*) = 50,000). A new data generator and test (partitioned_deletion_vectors) specifically exercises the combination of partition pruning and deletion vectors, showing the per-file net count (~500) rather than the gross count (~1,000) after applying a partition filter.
Depends on duckdb core PR: https://github.com/motherduckdb/duckdb/pull/54