feat: DH-21493: pushdown support for sorted regions#7955
feat: DH-21493: pushdown support for sorted regions#7955lbooker42 wants to merge 49 commits intodeephaven:mainfrom
Conversation
…delegate to the ColumnRegion for predicate pushdown.
…nderneath Unioned table to activate Column Region optimizations.
…21493-sorted-pushdown
…t. Documentation.
…ation' into nightly/DH-21522-parquettablelocation # Conflicts: # engine/table/src/main/java/io/deephaven/engine/table/impl/PushdownResult.java
…lumnSourceManager to store table definition instead of List<ColumnDefinition>
…ation' into nightly/DH-21493-sorted-pushdown # Conflicts: # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/RegionedPushdownFilterLocationContext.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionByte.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionChar.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionDouble.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionFloat.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionInt.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionLong.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionObject.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionShort.java
# Conflicts: # engine/table/src/main/java/io/deephaven/engine/table/impl/BasePushdownFilterContext.java # engine/table/src/main/java/io/deephaven/engine/table/impl/BasePushdownFilterContextImpl.java # engine/table/src/main/java/io/deephaven/engine/table/impl/PushdownFilterMatcher.java # engine/table/src/main/java/io/deephaven/engine/table/impl/dataindex/DataIndexPushdownManager.java # engine/table/src/main/java/io/deephaven/engine/table/impl/locations/impl/AbstractTableLocation.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/UnionSourceManager.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegion.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionByte.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionChar.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionDouble.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionFloat.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionInt.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionLong.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionObject.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/ColumnRegionShort.java # engine/table/src/main/java/io/deephaven/engine/table/impl/sources/regioned/RegionedPushdownAction.java # engine/table/src/test/java/io/deephaven/engine/table/impl/TestPartitionAwareSourceTableNoMocks.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/location/ParquetTableLocation.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionByte.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionChar.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionDouble.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionFloat.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionInt.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionLong.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionObject.java # extensions/parquet/table/src/main/java/io/deephaven/parquet/table/region/ParquetColumnRegionShort.java # extensions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableFilterTest.java # extensions/parquet/table/src/test/java/io/deephaven/parquet/table/ParquetTableReadWriteTest.java
No docs changes detected for d8abd52 |
|
|
||
| executedFilterCost = 0; | ||
| // Extract the effective filter and use it for populating the context. | ||
| final WhereFilter effectiveFilter = WhereFilterDelegating.maybeUnwrapFilter(filter); |
There was a problem hiding this comment.
This change is pretty significant, allowing wrapped/delegating filters to still be processed efficiently. Unfortunately, this means RowCapturingFilters row tracking is also often getting bypassed so a lot of tests had to be updated to show this.
There was a problem hiding this comment.
We'll may also need to update the Core+ tests once this merges; because staging/gplus may break because of the new found transparency of the RowSetCapturingFilter. It is worth running a local pTML and unit test from Core+ so we have that ready to go if necessary.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 39 out of 39 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
An earlier PR (#7666) merged a change to the Page Stores that forces pushdown to the sub-regions of the store. This is good in many scenarios (e.g. dictionary) but not ideal for this scenario. It results in each sorted sub-region being searched even though they might have been skipped if we looked at the page store overall. Do we need a hint (maybe in the RegionedPushDownAction.Region) about whether we should apply the pushdown to sub-regions or to the top-level page-store. This might be the only case and it might not matter too much. We might someday have sorted subregions as part of a unsorted page-store so might this is over-optimizing? |
| WhereFilter getWrappedFilter(); | ||
|
|
||
| /** | ||
| * Returns the wrapped filter if and only if the wrapped filter is functionally identical to this filter. Filters |
There was a problem hiding this comment.
This needs more explanation. I think things like barriers could be considered function; so maybe something more like "if the wrapped filter accepts the same rows as this filter." Then we can say e.g. a barrier does not change the rows something accepts as an example.
|
|
||
| @Override | ||
| public WhereFilter maybeUnwrapFilter() { | ||
| if (filter instanceof WhereFilterDelegating) { |
There was a problem hiding this comment.
Can we use the static method we defined on filter to avoid the duplication?
| /** | ||
| * Performs a binary search on a given column region to find the positions (row keys) of specified sorted keys. The | ||
| * method returns the RowSet containing the matched row keys. | ||
| * Performs a binary search on a given column region to find the positions (row keys) of specified keys. The method |
There was a problem hiding this comment.
So we've changed the contract to be looser (though the existing kernel didn't take advantage of this with firstKey and lastKey being set). This makes sense, because we are sorting the unboxed array; so there is no need to have the sorted requirement in the contract.
However, we are not taking full advantage of the sorting. The findStartIndexAscending is returning -1; so we have no information to update firstKey when something isn't found. If we returned an Arrays.binarySearch style position of where the thing would be we could actually even short circuit the remaining to-find values if we are past the end of the known values.
| final int end; | ||
|
|
||
| if (sortColumn.isAscending()) { | ||
| start = findStartIndexAscending(region, firstKey, lastKey, min, minInc); |
There was a problem hiding this comment.
The -1 return for not found seems at odds with our match results in the binarySearchMatch case.
If I'm right, I want to make sure we have test coverage for any bugs that would be present because of that.
| if (sortColumn.isAscending()) { | ||
| try (final ObjectTimsortKernel.ObjectSortKernelContext<Any> context = | ||
| ObjectTimsortKernel.createContext(searchValues.length)) { | ||
| context.sort(WritableObjectChunk.writableChunkWrap(searchValues)); |
There was a problem hiding this comment.
We're going to be changing searchValues. This should be documented or we should make a copy of it.
| return input.copy(); | ||
| } | ||
|
|
||
| // We will use the effective filter from the context, which may bypass row tracking but provides the |
There was a problem hiding this comment.
Probably more important to explicate why the barriers, etc. don't matter at this level rather than our testing framework.
| try (final RowSet ignored = matches) { | ||
| return PushdownResult.of( | ||
| selection, | ||
| selection.subSetByKeyRange(matches.firstRowKey(), matches.lastRowKey()), |
There was a problem hiding this comment.
seems strange we would use the subsetBykeyRange here and intersect above. I think I would prefer intersect; and we should hope intersect does a good job, like what happens if a region decided to be non-contiguous in its rowset [of course we shouldn't have that in selection]. So maybe it doesn't matter, but I think we should pick one.
|
|
||
| if (ctx.isRangeFilter() && effectiveFilter instanceof RangeFilter | ||
| && ((RangeFilter) effectiveFilter).getRealFilter() instanceof SingleSidedComparableRangeFilter) { | ||
| final SingleSidedComparableRangeFilter rangeFilter = |
There was a problem hiding this comment.
Why not ComparableRangeFilter?
| } | ||
| missingValues.add(value); | ||
| } | ||
| return missingValues; |
There was a problem hiding this comment.
if we hit maxFailures should we throw or be otherwise unhappy?
| missingValue, | ||
| missingValue, false, | ||
| false)) { | ||
| Assert.assertTrue(valuesFound.isEmpty()); |
There was a problem hiding this comment.
This test isn't great. We should test the missing value as one side of this.
"Apple", "Carrot", "Dragonfruit"
"Banana" is missing, but you should still return something in MinMax if you had "Banana", "Eggplant"
Because you have (A, A) as your range and neither are inclusive, you could never find anything anyway.
This adds support for match and range filters to push down into binary search over sorted regions.