Add support for Equality Deletes on DeleteFileIndex by rambleraptor · Pull Request #3285 · apache/iceberg-python

rambleraptor · 2026-04-24T22:31:19Z

Part of #3270

Rationale for this change

This adds support for getting equality deletes in the DeleteFileIndex.

I'm very purposefully ignoring them in _read_all_delete_files because they will crash.

Are these changes tested?

I made some equality deletes by-hand and had PyIceberg read them to see the indexes. Worked as expected. If you know a way to create equality deletes, I can test those as well.

Are there any user-facing changes?

Adds support for equality deletes in DeleteFileIndex

ndrluis · 2026-04-26T16:50:44Z

@rambleraptor I think we should add a regression test for schema evolution here. This pruning path assumes the current table type for an equality field is the same type that was used when the data file and equality delete were written, which is not always true after a legal promotion like int -> long. In that case, historical manifests still contain 4-byte int bounds, and decoding them with the current LongType can fail in DeleteFileIndex.for_data_file(...).

For reference, Iceberg Java had to address the same schema-evolution issue in apache/iceberg#15268, where the fix was to avoid assuming the current schema is always the right one for equality-delete field resolution.

rambleraptor · 2026-04-28T00:00:20Z

Thanks @ndrluis for the suggestion. I opened #3293 to handle the type promotion case for the historical manifests. It looks like we're not handling that properly and I don't want to pollute this PR too much.

rambleraptor · 2026-04-28T00:06:03Z

@geruh @kevinjqliu @Fokko please take a look when you can!

rambleraptor · 2026-04-28T23:03:06Z

I've successfully tried this out with Flink (thanks @Fokko for the tip!) and it's working as I expect it to. Is it worth checking in the files created by Flink?

geruh

Nice, thanks for opening @rambleraptor!!! Left some comments below.

Also, +1 to add to the flink testing and I believe there were talks about this being added to the TCK! While working on #2255, we tested all delete file combinations with flink.

geruh · 2026-04-25T23:58:28Z

+        self._schema = schema
        self._by_partition: dict[tuple[int, Record], PositionDeletes] = {}
        self._by_path: dict[str, PositionDeletes] = {}
+        self._eq_by_partition: dict[tuple[int, Record], EqualityDeletes] = {}


Do we need these additional variables, can we not just add together here? I want to be careful about adding java-shaped work.

geruh · 2026-04-28T07:36:23Z

@@ -1693,7 +1693,12 @@ def _task_to_record_batches(

 def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> dict[str, list[ChunkedArray]]:


is this needed?

geruh · 2026-04-28T08:07:30Z

+from pyiceberg.types import IntegerType, NestedField


-def _create_data_file(file_path: str = "s3://bucket/data.parquet", spec_id: int = 0) -> DataFile:


can we add a test similar for a unpartitioned equality delete and position delete at the same sequence num to ensure that equality deletes apply at seq < N?

rambleraptor added 3 commits April 24, 2026 14:43

delete file index

7745f41

Add DeleteFileIndex support for EqualityDeletes

0b3081a

we shouldn't try to act on equality delete files

78800f7

rambleraptor mentioned this pull request Apr 24, 2026

Equality Delete support #3270

Open

geruh reviewed Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Equality Deletes on DeleteFileIndex#3285

Add support for Equality Deletes on DeleteFileIndex#3285
rambleraptor wants to merge 3 commits intoapache:mainfrom
rambleraptor:equality-delete-index

rambleraptor commented Apr 24, 2026

Uh oh!

ndrluis commented Apr 26, 2026

Uh oh!

rambleraptor commented Apr 28, 2026

Uh oh!

rambleraptor commented Apr 28, 2026

Uh oh!

rambleraptor commented Apr 28, 2026

Uh oh!

geruh left a comment

Uh oh!

geruh Apr 25, 2026

Uh oh!

geruh Apr 28, 2026

Uh oh!

geruh Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1693,7 +1693,12 @@ def _task_to_record_batches(

		def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> dict[str, list[ChunkedArray]]:

		from pyiceberg.types import IntegerType, NestedField


		def _create_data_file(file_path: str = "s3://bucket/data.parquet", spec_id: int = 0) -> DataFile:

Conversation

rambleraptor commented Apr 24, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ndrluis commented Apr 26, 2026

Uh oh!

rambleraptor commented Apr 28, 2026

Uh oh!

rambleraptor commented Apr 28, 2026

Uh oh!

rambleraptor commented Apr 28, 2026

Uh oh!

geruh left a comment

Choose a reason for hiding this comment

Uh oh!

geruh Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

geruh Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

geruh Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants