Add support for Equality Deletes on DeleteFileIndex#3285
Add support for Equality Deletes on DeleteFileIndex#3285rambleraptor wants to merge 3 commits intoapache:mainfrom
Conversation
|
@rambleraptor I think we should add a regression test for schema evolution here. This pruning path assumes the current table type for an equality field is the same type that was used when the data file and equality delete were written, which is not always true after a legal promotion like For reference, Iceberg Java had to address the same schema-evolution issue in apache/iceberg#15268, where the fix was to avoid assuming the current schema is always the right one for equality-delete field resolution. |
|
@geruh @kevinjqliu @Fokko please take a look when you can! |
|
I've successfully tried this out with Flink (thanks @Fokko for the tip!) and it's working as I expect it to. Is it worth checking in the files created by Flink? |
geruh
left a comment
There was a problem hiding this comment.
Nice, thanks for opening @rambleraptor!!! Left some comments below.
Also, +1 to add to the flink testing and I believe there were talks about this being added to the TCK! While working on #2255, we tested all delete file combinations with flink.
| self._schema = schema | ||
| self._by_partition: dict[tuple[int, Record], PositionDeletes] = {} | ||
| self._by_path: dict[str, PositionDeletes] = {} | ||
| self._eq_by_partition: dict[tuple[int, Record], EqualityDeletes] = {} |
There was a problem hiding this comment.
Do we need these additional variables, can we not just add together here? I want to be careful about adding java-shaped work.
| @@ -1693,7 +1693,12 @@ def _task_to_record_batches( | |||
|
|
|||
| def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> dict[str, list[ChunkedArray]]: | |||
| from pyiceberg.types import IntegerType, NestedField | ||
|
|
||
|
|
||
| def _create_data_file(file_path: str = "s3://bucket/data.parquet", spec_id: int = 0) -> DataFile: |
There was a problem hiding this comment.
can we add a test similar for a unpartitioned equality delete and position delete at the same sequence num to ensure that equality deletes apply at seq < N?
Part of #3270
Rationale for this change
This adds support for getting equality deletes in the DeleteFileIndex.
I'm very purposefully ignoring them in
_read_all_delete_filesbecause they will crash.Are these changes tested?
I made some equality deletes by-hand and had PyIceberg read them to see the indexes. Worked as expected. If you know a way to create equality deletes, I can test those as well.
Are there any user-facing changes?