dynamic filtering development branch#516
Draft
jayshrivastava wants to merge 7 commits into
Draft
Conversation
742db96 to
1b6edbb
Compare
introduces the proto converter to the PhysicalExtensionCodec trait which helps dedupe dynamic filters
bbb829d to
b95034a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Closes #530
I would like to create a development branch to work on dynamic filtering. Datafusion 54, which is currently used in
maindoes not have this change: apache/datafusion#21055.This PR aims to create a development branch using apache/datafusion@01bf68c, which is the latest commit on
apache/datafusion:mainas of writing.This is almost like updating
mainto datafusion 55. I say almost because datafusion 55 is not released yet and there may be other changes that go in. See apache/datafusion#22393.Once Datafusion 55 releases, we can do the upgrade on this branch and merge it in to
datafusion-distributed:main. I don't want all the work in this PR to go to waste.Summary of Significant Changes
repartition_file_min_sizefrom 10 MiB to 1 MiB. The PR explicitly calls out TPC-DS SF1 dimension tables. Files may be duplicated across multiple partitions where but each partition reads a different byte range (this is hidden by ...., but we know from the correctness tests that nothing broke). A lot of tpcds queries now split acrosstarget_partitionsinstead of staying under-partitioned. In thetpcdsplan tests, we usetarget_partitions=3.Example:
dynamic_rg_pruning=eligibledisplay changes feat(parquet): intra-file early stopping via statistics + dynamic filters apache/datafusion#22450dynamic_rg_pruning=eligibleto leaf nodes. Lots of snapshots change because of this._inject_network_boundaries, we would assume that every CollectLeft HashJoin can be broadcast. So, I added some code (look forcollect_left_hash_join_requires_single_task) to check for aBroadcastExecbefore committing to doing a broadcast join. Without this change, correctness tests fail. I honestly think we previously did not have good coverage forLeftSemi, which is why this bug never appeared before.LeftSemibe broadcast? A left semi join means "emit a row on the left side of the join if it matches any row on the right". You cannot broadcast the left side of aLeftSemijoin because you duplicate the left side across tasks, which may result in duplicate rows.Example: TPCDS query 1 (note the new
LeftSemijoin)