Skip to content

Partition-aware cardinality estimation for Iceberg#778

Open
peterboncz wants to merge 4 commits intoduckdb:v1.5-variegatafrom
motherduckdb:pb/iceberg-partitioning-aware-cardinality-estimation
Open

Partition-aware cardinality estimation for Iceberg#778
peterboncz wants to merge 4 commits intoduckdb:v1.5-variegatafrom
motherduckdb:pb/iceberg-partitioning-aware-cardinality-estimation

Conversation

@peterboncz
Copy link
Copy Markdown
Contributor

@peterboncz peterboncz commented Mar 9, 2026

NOTE: this is (obviously) a post-v1.5 PR. It is was created not only to enhance Iceberg with better stats, but also to enable additional work MotherDuck to support partition pruning in its hybrid query optimizer.

Implements IcebergGetPartitionStats for iceberg_scan, wiring it into DuckDB's partition statistics API.

For each data file the function returns the per-file row count together with per-column min/max bounds (read from the Iceberg manifest's lower_bounds/upper_bounds maps) via a new IcebergPartitionRowGroup. This allows the planner to prune partitions using column-level predicates and to produce accurate per-partition cardinality estimates, rather than falling back to a single table-level estimate.

For tables with delete files the function scans the delete manifests independently to compute per-data-file net row counts. Positional delete entries that carry referenced_data_file (V3 puffin deletion vectors and optimised V2 positional deletes) are resolved exactly. Equality deletes and V2 positional deletes without a specific target file remain COUNT_APPROXIMATE, since their impact cannot be determined per-file from manifest metadata alone.

Tests are added to verify that EXPLAIN cardinality reflects partition pruning (~5,000 full scan → ~1,000 with a single-partition filter on filtering_on_partition_bounds) and that deletion vectors produce correct net estimates (deletion_vectors: EXPLAIN ~50,000, actual count(*) = 50,000). A new data generator and test (partitioned_deletion_vectors) specifically exercises the combination of partition pruning and deletion vectors, showing the per-file net count (~500) rather than the gross count (~1,000) after applying a partition filter.

Depends on duckdb core PR: https://github.com/motherduckdb/duckdb/pull/54

ywelsch and others added 3 commits February 23, 2026 09:17
Implements IcebergGetPartitionStats for iceberg_scan, wiring it into DuckDB's partition statistics API. For each data file the function now surfaces the per-file row count together with per-column min/max bounds (read from the Iceberg manifest's lower_bounds/upper_bounds maps) via a new IcebergPartitionRowGroup. This allows the planner to prune partitions using column-level predicates and to produce accurate per-partition cardinality estimates, rather than falling back to a single table-level estimate.

For tables with delete files the function scans the delete manifests independently to compute per-data-file net row counts. Positional delete entries that carry referenced_data_file (V3 puffin deletion vectors and optimised V2 positional deletes) are resolved exactly — the deleted count is subtracted from the data file's gross record_count and the result is reported as COUNT_EXACT. Equality deletes and V2 positional deletes without a specific target file remain COUNT_APPROXIMATE, since their impact cannot be determined per-file from manifest metadata alone.

Tests are added to verify that EXPLAIN cardinality reflects partition pruning (~5,000 full scan → ~1,000 with a single-partition filter on filtering_on_partition_bounds) and that deletion vectors produce correct net estimates (deletion_vectors: EXPLAIN ~50,000, actual count(*) = 50,000). A new data generator and test (partitioned_deletion_vectors) specifically exercises the combination of partition pruning and deletion vectors, showing the per-file net count (~500) rather than the gross count (~1,000) after applying a partition filter.
@peterboncz peterboncz changed the base branch from main to v1.5-variegata March 9, 2026 17:47
Copy link
Copy Markdown
Member

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A couple of questions and it looks like there is a merge conflict?

Could this also be used to optimize count(*) queries on a single partition? If we return COUNT_EXACT for the count type, I suppose it should be possible right?

# - seq=1 has 500 rows deleted via V3 puffin deletion vectors (col%2=0)
# - seq=2 and seq=3 have no deletions
#
# Without this PR's fix:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments specific to this PR can be removed, they loose context once the PR is merged

if (!entry.data_file.lower_bounds.empty() || !entry.data_file.upper_bounds.empty()) {
stats.partition_row_group = make_shared_ptr<IcebergPartitionRowGroup>(schema, entry.data_file);
}
result.push_back(std::move(stats));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we are pushing back a PartitionStatistic for every manifest_entry? There could be multiple manifest entries that belong to the same partition because they are inserted over many snapshots.
Can we add a test for this?

@Tishj
Copy link
Copy Markdown
Member

Tishj commented Mar 11, 2026

The linked "duckdb core PR" is a motherduckdb PR, not accessible.
I don't think this should be targeting 1.5-variegata, probably should be main

function.table_scan_progress = nullptr;
function.get_bind_info = IcebergBindInfo;
function.get_virtual_columns = IcebergVirtualColumns;
function.get_partition_stats = IcebergGetPartitionStats;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this already has an implementation on main, you'll need to merge with that

entries.clear();
reader.Read(STANDARD_VECTOR_SIZE, entries);
for (auto &e : entries) {
if (e.data_file.content == IcebergManifestEntryContentType::POSITION_DELETES &&
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note:
This condition is very optimistic, as referenced_data_file was seemingly added in V3 and "backported" to V2, given it's listed under the v3 spec changes: https://iceberg.apache.org/spec/#version-3

auto &snapshot = *file_list.GetSnapshot();

// Re-scan delete manifests independently (don't consume the shared delete_manifest_reader)
auto delete_scan =
Copy link
Copy Markdown
Member

@Tishj Tishj Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wasteful, as we have already called GetTotalFileCount, which will have read all the manifests already. We can use that data on the file_list already, no?

for (auto &e : entries) {
if (e.data_file.content == IcebergManifestEntryContentType::POSITION_DELETES &&
!e.data_file.referenced_data_file.empty()) {
deletes_per_file[e.data_file.referenced_data_file] += e.data_file.record_count;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't take into account uncommitted data (for when we're inside a transaction)
It also doesn't account for invalidated positional-delete files, or deleted deletion-vector files.

Deletion vectors are maintained synchronously: Writers must merge DVs (and older position delete files) to ensure there is at most one DV per data file
Readers can safely ignore position delete files if there is a DV for a data file

Copy link
Copy Markdown
Member

@Tishj Tishj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be retargeted to main and make the following changes:

I think we can use the file_list.positional_delete_data map and use that to remove the logic in place to compute deletes_per_file.
We might need to create a modified version of GetEqualityDeletesForFile to efficiently check if any equality deletes apply to the partition, rather than a single data file.

And I think we shouldn't be creating a IcebergPartitionRowGroup per data file, but rather create it per partition-spec-id ?

Note to self, and other reviewers:
relevant logic in core that deals with the created information here is RowGroupReorderer::GetOffsetAfterPruning just for future reference

@peterboncz
Copy link
Copy Markdown
Contributor Author

Thanks @Tishj - indeed this is intended for the new main, but I was actually planning to keep this a bit out of sight for the moment. MotherDuck at this moment still has to release v1.5 and only after that will we have a main.

But, I will work on your comments, thanks so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants