Skip to content

position_deletes metadata table#1615

Closed
amitgilad3 wants to merge 9 commits intoapache:mainfrom
amitgilad3:position_deletes
Closed

position_deletes metadata table#1615
amitgilad3 wants to merge 9 commits intoapache:mainfrom
amitgilad3:position_deletes

Conversation

@amitgilad3
Copy link
Copy Markdown
Contributor

Implements position_deletes metadata table - #1053

Copy link
Copy Markdown
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally LGTM! i added a few nit comments on style.

I think its a good idea to introduce the PositionalDelete class, similar to DataFile

Comment thread pyiceberg/table/inspect.py Outdated
)
return pa.concat_tables(manifests_by_snapshots)

def position_deletes(self) -> "pa.Table":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for any metadata tables referencing the current snapshot, let's add an optional param to allow querying for any given snapshot id.

Comment thread pyiceberg/table/inspect.py Outdated

from pyiceberg.io.pyarrow import schema_to_pyarrow

partition_record = self.tbl.metadata.specs_struct()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: specs_struct() looks at all PartitionSpecs, do we just need the current partition spec here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also nit: naming, this is a structtype, not the underlying record

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch , i now take the current PartitionSpec

Comment thread pyiceberg/table/inspect.py Outdated
partition_record = self.tbl.metadata.specs_struct()
pa_partition_struct = schema_to_pyarrow(partition_record)
pa_row_struct = schema_to_pyarrow(self.tbl.schema().as_struct())
positinal_delete_schema = pa.schema(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
positinal_delete_schema = pa.schema(
positional_delete_schema = pa.schema(

Comment thread pyiceberg/table/inspect.py Outdated
pa.field("delete_file_path", pa.string(), nullable=False),
]
)
return positinal_delete_schema
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return positinal_delete_schema
return positional_delete_schema

Comment thread pyiceberg/io/pyarrow.py Outdated
Comment on lines +892 to +895
def _read_delete_file(fs: FileSystem, data_file: DataFile, schema: "pa.Schema") -> pa.Table:
delete_fragment = _construct_fragment(fs, data_file, file_format_kwargs={"pre_buffer": True, "buffer_size": ONE_MEGABYTE})
table = ds.Scanner.from_fragment(fragment=delete_fragment, schema=schema).to_table()
return table
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this might be a good place to start introducing a PositionalDelete class and we can get rid of the custom _get_positional_file_schema function

@kevinjqliu
Copy link
Copy Markdown
Contributor

also lets add this to new table to the docs as well https://py.iceberg.apache.org/api/#inspecting-tables

@amitgilad3
Copy link
Copy Markdown
Contributor Author

done :)

Copy link
Copy Markdown
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments :)

cc @Fokko this PR adds the PositionDelete class 🥳

Comment thread pyiceberg/manifest.py
)


class PositionDelete(Record):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread pyiceberg/io/pyarrow.py
return _get_file_format(data_file.file_format, **file_format_kwargs).make_fragment(path, fs)


def _read_delete_file(fs: FileSystem, data_file: DataFile) -> Iterator[PositionDelete]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very similar to _read_deletes, do you think there's a way we can reimplement _read_deletes to return PositionDelete and refactor its usage?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure i did it right , i converted _read_deletes to use _read_delete_file so we don't implement the reading part twice , and kept the logic of _read_deletes to return Dict[str, pa.ChunkedArray]
wdyt ?

Comment thread pyiceberg/table/metadata.py Outdated
return StructType(*nested_fields)

def spec_struct(self, spec_id: Optional[int] = None) -> StructType:
"""Produce for a spec_id a struct of PartitionSpecs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Produce for a spec_id a struct of PartitionSpecs.
"""Produce for a spec_id a struct of PartitionSpecs.

Comment thread pyiceberg/table/inspect.py Outdated
Comment on lines +392 to +394
partition_struct = self.tbl.metadata.spec_struct()
pa_partition_struct = schema_to_pyarrow(partition_struct)
pa_row_struct = schema_to_pyarrow(self.tbl.schema().as_struct())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is using the table's current metadata, perhaps we want to pass it these based on which snapshot we're using

@amitgilad3
Copy link
Copy Markdown
Contributor Author

Ready for review :)

@github-actions
Copy link
Copy Markdown

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Mar 17, 2026
@github-actions
Copy link
Copy Markdown

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions Bot closed this Mar 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants