Skip to content

Add support for Equality Deletes on DeleteFileIndex#3285

Open
rambleraptor wants to merge 3 commits intoapache:mainfrom
rambleraptor:equality-delete-index
Open

Add support for Equality Deletes on DeleteFileIndex#3285
rambleraptor wants to merge 3 commits intoapache:mainfrom
rambleraptor:equality-delete-index

Conversation

@rambleraptor
Copy link
Copy Markdown
Contributor

Part of #3270

Rationale for this change

This adds support for getting equality deletes in the DeleteFileIndex.

I'm very purposefully ignoring them in _read_all_delete_files because they will crash.

Are these changes tested?

I made some equality deletes by-hand and had PyIceberg read them to see the indexes. Worked as expected. If you know a way to create equality deletes, I can test those as well.

Are there any user-facing changes?

  • Adds support for equality deletes in DeleteFileIndex

@ndrluis
Copy link
Copy Markdown
Collaborator

ndrluis commented Apr 26, 2026

@rambleraptor I think we should add a regression test for schema evolution here. This pruning path assumes the current table type for an equality field is the same type that was used when the data file and equality delete were written, which is not always true after a legal promotion like int -> long. In that case, historical manifests still contain 4-byte int bounds, and decoding them with the current LongType can fail in DeleteFileIndex.for_data_file(...).

For reference, Iceberg Java had to address the same schema-evolution issue in apache/iceberg#15268, where the fix was to avoid assuming the current schema is always the right one for equality-delete field resolution.

@rambleraptor
Copy link
Copy Markdown
Contributor Author

Thanks @ndrluis for the suggestion. I opened #3293 to handle the type promotion case for the historical manifests. It looks like we're not handling that properly and I don't want to pollute this PR too much.

@rambleraptor
Copy link
Copy Markdown
Contributor Author

@geruh @kevinjqliu @Fokko please take a look when you can!

@rambleraptor
Copy link
Copy Markdown
Contributor Author

I've successfully tried this out with Flink (thanks @Fokko for the tip!) and it's working as I expect it to. Is it worth checking in the files created by Flink?

Copy link
Copy Markdown
Member

@geruh geruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for opening @rambleraptor!!! Left some comments below.

Also, +1 to add to the flink testing and I believe there were talks about this being added to the TCK! While working on #2255, we tested all delete file combinations with flink.

self._schema = schema
self._by_partition: dict[tuple[int, Record], PositionDeletes] = {}
self._by_path: dict[str, PositionDeletes] = {}
self._eq_by_partition: dict[tuple[int, Record], EqualityDeletes] = {}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these additional variables, can we not just add together here? I want to be careful about adding java-shaped work.

Comment thread pyiceberg/io/pyarrow.py
@@ -1693,7 +1693,12 @@ def _task_to_record_batches(

def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> dict[str, list[ChunkedArray]]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

from pyiceberg.types import IntegerType, NestedField


def _create_data_file(file_path: str = "s3://bucket/data.parquet", spec_id: int = 0) -> DataFile:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test similar for a unpartitioned equality delete and position delete at the same sequence num to ensure that equality deletes apply at seq < N?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants