Skip to content

Enable file pruning on nested columns#820

Open
jvansanten wants to merge 7 commits intoduckdb:mainfrom
jvansanten:filter-nested-columns
Open

Enable file pruning on nested columns#820
jvansanten wants to merge 7 commits intoduckdb:mainfrom
jvansanten:filter-nested-columns

Conversation

@jvansanten
Copy link
Copy Markdown
Contributor

@jvansanten jvansanten commented Mar 19, 2026

After thinking more carefully about what I was trying to achieve in #766, I realized that most of my underlying problem was related to the way FilterCombiner treats nested columns. Given a struct_extract expression, it wraps the inner filter in a StructFilter and pushes it into the filter set as a filter on the containing column. This has two unhelpful side-effects:

  • Table filters can only address top-level columns. This probably makes sense for a row-oriented database where the containing struct needs to be deserialized to get at its members, but not for a column-oriented database. In particular, struct columns have no bounds in the manifest.
  • Filters on different fields of a struct are always combined into a single ConjunctionAndFilter.

This PR attempts to address both of these by inverting the combination and wrapping that FilterCombiner does to create a decomposed TableFilterSet that refers to individual columns. This is then used both for partition and range pruning in FileMatchesFilter().

I suspect it would have been cleaner to apply this transformation once to IcebergMultiFileList::table_filters rather than for every single data file, but there appears to be some tight coupling between IcebergMultiFileList::table_filters and IcebergMultiFileReader that I can't quite wrap my head around yet. Suggestions welcome!

jvansanten added a commit to jvansanten/duckdb-iceberg that referenced this pull request Mar 19, 2026
Revert "Add primitive support for conjunction filters"

This reverts commit 9584122.
This reverts commit c31e003.

Better implementation in duckdb#820
@Tishj
Copy link
Copy Markdown
Member

Tishj commented Mar 19, 2026

Changes look great on the outset, but this needs some tests still

@jvansanten jvansanten force-pushed the filter-nested-columns branch from 894940f to 18df3a4 Compare March 19, 2026 14:35
----
1 Alice {'street': 123 Main St, 'city': Metropolis, 'zip': 12345} [123-456-7890, 987-654-3210] {age=30, membership=gold}

# nested-column filters prune files from scan
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests, can we extend this to cover:

  • multiple types
  • deeper nesting
  • lists
  • maps
  • IS NULL / IS NOT NULL on child columns
  • IS NULL / IS NOT NULL on structs/lists/maps (not elements)
  • conjunction AND on combinations of the above items

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few new tests in test/sql/local/irc_any_catalog/reads/test_nested_column_pruning.test. Files are not currently pruned when filtering on elements of lists or maps, as PushDownFilterIntoExpr() leaves these as generic ExpressionFilters, nor with IS NULL or IS NOT NULL predicates, as FilterCombiner::TryPushdownExpression() does not treat them.

Note that many of these demonstrate cases where filter pushdown does _not_ work.
IcebergPredicateStats::DeserializeBounds refuses to set null bounds, making it impossible to distinguish between columns where the bounds are unset and columns that contain only null. Use has_not_null to disambiguate.
@jvansanten
Copy link
Copy Markdown
Contributor Author

Ahoy, is there anything else that needs to happen before this can be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants