Skip to content

feat: DH-21493: pushdown support for sorted regions#7955

Open
lbooker42 wants to merge 49 commits intodeephaven:mainfrom
lbooker42:nightly/DH-21493-sorted-pushdown
Open

feat: DH-21493: pushdown support for sorted regions#7955
lbooker42 wants to merge 49 commits intodeephaven:mainfrom
lbooker42:nightly/DH-21493-sorted-pushdown

Conversation

@lbooker42
Copy link
Copy Markdown
Contributor

This adds support for match and range filters to push down into binary search over sorted regions.

…delegate to the ColumnRegion for predicate pushdown.
…nderneath Unioned table to activate Column Region optimizations.
…ation' into nightly/DH-21522-parquettablelocation

# Conflicts:
#	engine/table/src/main/java/io/deephaven/engine/table/impl/PushdownResult.java
…lumnSourceManager to store table definition instead of List<ColumnDefinition>
…ation' into nightly/DH-21493-sorted-pushdown

# Conflicts:
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/RegionedPushdownFilterLocationContext.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionByte.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionChar.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionDouble.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionFloat.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionInt.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionLong.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionObject.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionShort.java
# Conflicts:
#	engine/table/src/main/java/io/deephaven/engine/table/impl/BasePushdownFilterContext.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/BasePushdownFilterContextImpl.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/PushdownFilterMatcher.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/dataindex/DataIndexPushdownManager.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/locations/impl/AbstractTableLocation.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/UnionSourceManager.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegion.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionByte.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionChar.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionDouble.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionFloat.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionInt.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionLong.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionObject.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionShort.java
#	engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/RegionedPushdownAction.java
#	engine/table/src/test/java/io/deephaven/engine/table/impl/TestPartitionAwareSourceTableNoMocks.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionByte.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionChar.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionDouble.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionFloat.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionInt.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionLong.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionObject.java
#	extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionShort.java
#	extensions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableFilterTest.java
#	extensions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java
@lbooker42 lbooker42 self-assigned this Apr 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

No docs changes detected for d8abd52


executedFilterCost = 0;
// Extract the effective filter and use it for populating the context.
final WhereFilter effectiveFilter = WhereFilterDelegating.maybeUnwrapFilter(filter);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is pretty significant, allowing wrapped/delegating filters to still be processed efficiently. Unfortunately, this means RowCapturingFilters row tracking is also often getting bypassed so a lot of tests had to be updated to show this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll may also need to update the Core+ tests once this merges; because staging/gplus may break because of the new found transparency of the RowSetCapturingFilter. It is worth running a local pTML and unit test from Core+ so we have that ready to go if necessary.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 39 out of 39 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lbooker42
Copy link
Copy Markdown
Contributor Author

An earlier PR (#7666) merged a change to the Page Stores that forces pushdown to the sub-regions of the store. This is good in many scenarios (e.g. dictionary) but not ideal for this scenario. It results in each sorted sub-region being searched even though they might have been skipped if we looked at the page store overall.

Do we need a hint (maybe in the RegionedPushDownAction.Region) about whether we should apply the pushdown to sub-regions or to the top-level page-store. This might be the only case and it might not matter too much. We might someday have sorted subregions as part of a unsorted page-store so might this is over-optimizing?

WhereFilter getWrappedFilter();

/**
* Returns the wrapped filter if and only if the wrapped filter is functionally identical to this filter. Filters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more explanation. I think things like barriers could be considered function; so maybe something more like "if the wrapped filter accepts the same rows as this filter." Then we can say e.g. a barrier does not change the rows something accepts as an example.


@Override
public WhereFilter maybeUnwrapFilter() {
if (filter instanceof WhereFilterDelegating) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the static method we defined on filter to avoid the duplication?

/**
* Performs a binary search on a given column region to find the positions (row keys) of specified sorted keys. The
* method returns the RowSet containing the matched row keys.
* Performs a binary search on a given column region to find the positions (row keys) of specified keys. The method
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we've changed the contract to be looser (though the existing kernel didn't take advantage of this with firstKey and lastKey being set). This makes sense, because we are sorting the unboxed array; so there is no need to have the sorted requirement in the contract.

However, we are not taking full advantage of the sorting. The findStartIndexAscending is returning -1; so we have no information to update firstKey when something isn't found. If we returned an Arrays.binarySearch style position of where the thing would be we could actually even short circuit the remaining to-find values if we are past the end of the known values.

final int end;

if (sortColumn.isAscending()) {
start = findStartIndexAscending(region, firstKey, lastKey, min, minInc);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -1 return for not found seems at odds with our match results in the binarySearchMatch case.

If I'm right, I want to make sure we have test coverage for any bugs that would be present because of that.

if (sortColumn.isAscending()) {
try (final ObjectTimsortKernel.ObjectSortKernelContext<Any> context =
ObjectTimsortKernel.createContext(searchValues.length)) {
context.sort(WritableObjectChunk.writableChunkWrap(searchValues));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to be changing searchValues. This should be documented or we should make a copy of it.

return input.copy();
}

// We will use the effective filter from the context, which may bypass row tracking but provides the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably more important to explicate why the barriers, etc. don't matter at this level rather than our testing framework.

try (final RowSet ignored = matches) {
return PushdownResult.of(
selection,
selection.subSetByKeyRange(matches.firstRowKey(), matches.lastRowKey()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems strange we would use the subsetBykeyRange here and intersect above. I think I would prefer intersect; and we should hope intersect does a good job, like what happens if a region decided to be non-contiguous in its rowset [of course we shouldn't have that in selection]. So maybe it doesn't matter, but I think we should pick one.


if (ctx.isRangeFilter() && effectiveFilter instanceof RangeFilter
&& ((RangeFilter) effectiveFilter).getRealFilter() instanceof SingleSidedComparableRangeFilter) {
final SingleSidedComparableRangeFilter rangeFilter =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not ComparableRangeFilter?

}
missingValues.add(value);
}
return missingValues;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we hit maxFailures should we throw or be otherwise unhappy?

missingValue,
missingValue, false,
false)) {
Assert.assertTrue(valuesFound.isEmpty());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't great. We should test the missing value as one side of this.

"Apple", "Carrot", "Dragonfruit"

"Banana" is missing, but you should still return something in MinMax if you had "Banana", "Eggplant"

Because you have (A, A) as your range and neither are inclusive, you could never find anything anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants