Partition-aware cardinality estimation for Iceberg by peterboncz · Pull Request #778 · duckdb/duckdb-iceberg

peterboncz · 2026-03-09T17:45:13Z

NOTE: this is (obviously) a post-v1.5 PR. It is was created not only to enhance Iceberg with better stats, but also to enable additional work MotherDuck to support partition pruning in its hybrid query optimizer.

Implements IcebergGetPartitionStats for iceberg_scan, wiring it into DuckDB's partition statistics API.

For each data file the function returns the per-file row count together with per-column min/max bounds (read from the Iceberg manifest's lower_bounds/upper_bounds maps) via a new IcebergPartitionRowGroup. This allows the planner to prune partitions using column-level predicates and to produce accurate per-partition cardinality estimates, rather than falling back to a single table-level estimate.

For tables with delete files the function scans the delete manifests independently to compute per-data-file net row counts. Positional delete entries that carry referenced_data_file (V3 puffin deletion vectors and optimised V2 positional deletes) are resolved exactly. Equality deletes and V2 positional deletes without a specific target file remain COUNT_APPROXIMATE, since their impact cannot be determined per-file from manifest metadata alone.

Tests are added to verify that EXPLAIN cardinality reflects partition pruning (~5,000 full scan → ~1,000 with a single-partition filter on filtering_on_partition_bounds) and that deletion vectors produce correct net estimates (deletion_vectors: EXPLAIN ~50,000, actual count(*) = 50,000). A new data generator and test (partitioned_deletion_vectors) specifically exercises the combination of partition pruning and deletion vectors, showing the per-file net count (~500) rather than the gross count (~1,000) after applying a partition filter.

Depends on duckdb core PR: https://github.com/motherduckdb/duckdb/pull/54

Implements IcebergGetPartitionStats for iceberg_scan, wiring it into DuckDB's partition statistics API. For each data file the function now surfaces the per-file row count together with per-column min/max bounds (read from the Iceberg manifest's lower_bounds/upper_bounds maps) via a new IcebergPartitionRowGroup. This allows the planner to prune partitions using column-level predicates and to produce accurate per-partition cardinality estimates, rather than falling back to a single table-level estimate. For tables with delete files the function scans the delete manifests independently to compute per-data-file net row counts. Positional delete entries that carry referenced_data_file (V3 puffin deletion vectors and optimised V2 positional deletes) are resolved exactly — the deleted count is subtracted from the data file's gross record_count and the result is reported as COUNT_EXACT. Equality deletes and V2 positional deletes without a specific target file remain COUNT_APPROXIMATE, since their impact cannot be determined per-file from manifest metadata alone. Tests are added to verify that EXPLAIN cardinality reflects partition pruning (~5,000 full scan → ~1,000 with a single-partition filter on filtering_on_partition_bounds) and that deletion vectors produce correct net estimates (deletion_vectors: EXPLAIN ~50,000, actual count(*) = 50,000). A new data generator and test (partitioned_deletion_vectors) specifically exercises the combination of partition pruning and deletion vectors, showing the per-file net count (~500) rather than the gross count (~1,000) after applying a partition filter.

Tmonster

Thanks! A couple of questions and it looks like there is a merge conflict?

Could this also be used to optimize count(*) queries on a single partition? If we return COUNT_EXACT for the count type, I suppose it should be possible right?

Tmonster · 2026-03-10T14:48:27Z

+#     - seq=1 has 500 rows deleted via V3 puffin deletion vectors (col%2=0)
+#     - seq=2 and seq=3 have no deletions
+#
+#   Without this PR's fix:


Comments specific to this PR can be removed, they loose context once the PR is merged

Tmonster · 2026-03-10T15:01:57Z

+		if (!entry.data_file.lower_bounds.empty() || !entry.data_file.upper_bounds.empty()) {
+			stats.partition_row_group = make_shared_ptr<IcebergPartitionRowGroup>(schema, entry.data_file);
+		}
+		result.push_back(std::move(stats));


looks like we are pushing back a PartitionStatistic for every manifest_entry? There could be multiple manifest entries that belong to the same partition because they are inserted over many snapshots.
Can we add a test for this?

Tishj · 2026-03-11T09:16:01Z

The linked "duckdb core PR" is a motherduckdb PR, not accessible.
I don't think this should be targeting 1.5-variegata, probably should be main

Tishj · 2026-03-11T09:22:52Z

 		function.table_scan_progress = nullptr;
 		function.get_bind_info = IcebergBindInfo;
 		function.get_virtual_columns = IcebergVirtualColumns;
+		function.get_partition_stats = IcebergGetPartitionStats;


I think this already has an implementation on main, you'll need to merge with that

Tishj · 2026-03-11T09:26:10Z

+			entries.clear();
+			reader.Read(STANDARD_VECTOR_SIZE, entries);
+			for (auto &e : entries) {
+				if (e.data_file.content == IcebergManifestEntryContentType::POSITION_DELETES &&


Just a note:
This condition is very optimistic, as referenced_data_file was seemingly added in V3 and "backported" to V2, given it's listed under the v3 spec changes: https://iceberg.apache.org/spec/#version-3

Tishj · 2026-03-11T09:27:37Z

+		auto &snapshot = *file_list.GetSnapshot();
+
+		// Re-scan delete manifests independently (don't consume the shared delete_manifest_reader)
+		auto delete_scan =


This seems wasteful, as we have already called GetTotalFileCount, which will have read all the manifests already. We can use that data on the file_list already, no?

Tishj · 2026-03-11T09:32:16Z

+			for (auto &e : entries) {
+				if (e.data_file.content == IcebergManifestEntryContentType::POSITION_DELETES &&
+				    !e.data_file.referenced_data_file.empty()) {
+					deletes_per_file[e.data_file.referenced_data_file] += e.data_file.record_count;


This doesn't take into account uncommitted data (for when we're inside a transaction)
It also doesn't account for invalidated positional-delete files, or deleted deletion-vector files.

Deletion vectors are maintained synchronously: Writers must merge DVs (and older position delete files) to ensure there is at most one DV per data file
Readers can safely ignore position delete files if there is a DV for a data file

Tishj

This should be retargeted to main and make the following changes:

I think we can use the file_list.positional_delete_data map and use that to remove the logic in place to compute deletes_per_file.
We might need to create a modified version of GetEqualityDeletesForFile to efficiently check if any equality deletes apply to the partition, rather than a single data file.

And I think we shouldn't be creating a IcebergPartitionRowGroup per data file, but rather create it per partition-spec-id ?

Note to self, and other reviewers:
relevant logic in core that deals with the created information here is RowGroupReorderer::GetOffsetAfterPruning just for future reference

peterboncz · 2026-03-16T23:03:35Z

Thanks @Tishj - indeed this is intended for the new main, but I was actually planning to keep this a bit out of sight for the moment. MotherDuck at this moment still has to release v1.5 and only after that will we have a main.

But, I will work on your comments, thanks so much!

ywelsch and others added 3 commits February 23, 2026 09:17

Fix TaskNotifier

9ba3355

feat: partition-aware cardinality estimation for iceberg_scan

95fc634

peterboncz changed the base branch from main to v1.5-variegata March 9, 2026 17:47

remove spurious change

cc83ab1

Tmonster reviewed Mar 10, 2026

View reviewed changes

Tishj reviewed Mar 11, 2026

View reviewed changes

Tishj requested changes Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partition-aware cardinality estimation for Iceberg#778

Partition-aware cardinality estimation for Iceberg#778
peterboncz wants to merge 4 commits intoduckdb:v1.5-variegatafrom
motherduckdb:pb/iceberg-partitioning-aware-cardinality-estimation

peterboncz commented Mar 9, 2026 •

edited

Loading

Uh oh!

Tmonster left a comment

Uh oh!

Tmonster Mar 10, 2026

Uh oh!

Tmonster Mar 10, 2026

Uh oh!

Tishj commented Mar 11, 2026

Uh oh!

Tishj Mar 11, 2026

Uh oh!

Tishj Mar 11, 2026

Uh oh!

Tishj Mar 11, 2026 •

edited

Loading

Uh oh!

Tishj Mar 11, 2026

Uh oh!

Tishj left a comment

Uh oh!

peterboncz commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

peterboncz commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tmonster left a comment

Choose a reason for hiding this comment

Uh oh!

Tmonster Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Tmonster Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Tishj commented Mar 11, 2026

Uh oh!

Tishj Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Tishj Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Tishj Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tishj Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Tishj left a comment

Choose a reason for hiding this comment

Uh oh!

peterboncz commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

peterboncz commented Mar 9, 2026 •

edited

Loading

Tishj Mar 11, 2026 •

edited

Loading