position_deletes metadata table#1615
Conversation
kevinjqliu
left a comment
There was a problem hiding this comment.
generally LGTM! i added a few nit comments on style.
I think its a good idea to introduce the PositionalDelete class, similar to DataFile
| ) | ||
| return pa.concat_tables(manifests_by_snapshots) | ||
|
|
||
| def position_deletes(self) -> "pa.Table": |
There was a problem hiding this comment.
nit: for any metadata tables referencing the current snapshot, let's add an optional param to allow querying for any given snapshot id.
|
|
||
| from pyiceberg.io.pyarrow import schema_to_pyarrow | ||
|
|
||
| partition_record = self.tbl.metadata.specs_struct() |
There was a problem hiding this comment.
nit: specs_struct() looks at all PartitionSpecs, do we just need the current partition spec here?
There was a problem hiding this comment.
also nit: naming, this is a structtype, not the underlying record
There was a problem hiding this comment.
good catch , i now take the current PartitionSpec
| partition_record = self.tbl.metadata.specs_struct() | ||
| pa_partition_struct = schema_to_pyarrow(partition_record) | ||
| pa_row_struct = schema_to_pyarrow(self.tbl.schema().as_struct()) | ||
| positinal_delete_schema = pa.schema( |
There was a problem hiding this comment.
| positinal_delete_schema = pa.schema( | |
| positional_delete_schema = pa.schema( |
| pa.field("delete_file_path", pa.string(), nullable=False), | ||
| ] | ||
| ) | ||
| return positinal_delete_schema |
There was a problem hiding this comment.
| return positinal_delete_schema | |
| return positional_delete_schema |
| def _read_delete_file(fs: FileSystem, data_file: DataFile, schema: "pa.Schema") -> pa.Table: | ||
| delete_fragment = _construct_fragment(fs, data_file, file_format_kwargs={"pre_buffer": True, "buffer_size": ONE_MEGABYTE}) | ||
| table = ds.Scanner.from_fragment(fragment=delete_fragment, schema=schema).to_table() | ||
| return table |
There was a problem hiding this comment.
nit: this might be a good place to start introducing a PositionalDelete class and we can get rid of the custom _get_positional_file_schema function
|
also lets add this to new table to the docs as well https://py.iceberg.apache.org/api/#inspecting-tables |
|
done :) |
kevinjqliu
left a comment
There was a problem hiding this comment.
some more comments :)
cc @Fokko this PR adds the PositionDelete class 🥳
| ) | ||
|
|
||
|
|
||
| class PositionDelete(Record): |
There was a problem hiding this comment.
| return _get_file_format(data_file.file_format, **file_format_kwargs).make_fragment(path, fs) | ||
|
|
||
|
|
||
| def _read_delete_file(fs: FileSystem, data_file: DataFile) -> Iterator[PositionDelete]: |
There was a problem hiding this comment.
this is very similar to _read_deletes, do you think there's a way we can reimplement _read_deletes to return PositionDelete and refactor its usage?
There was a problem hiding this comment.
Just to make sure i did it right , i converted _read_deletes to use _read_delete_file so we don't implement the reading part twice , and kept the logic of _read_deletes to return Dict[str, pa.ChunkedArray]
wdyt ?
| return StructType(*nested_fields) | ||
|
|
||
| def spec_struct(self, spec_id: Optional[int] = None) -> StructType: | ||
| """Produce for a spec_id a struct of PartitionSpecs. |
There was a problem hiding this comment.
| """Produce for a spec_id a struct of PartitionSpecs. | |
| """Produce for a spec_id a struct of PartitionSpecs. |
| partition_struct = self.tbl.metadata.spec_struct() | ||
| pa_partition_struct = schema_to_pyarrow(partition_struct) | ||
| pa_row_struct = schema_to_pyarrow(self.tbl.schema().as_struct()) |
There was a problem hiding this comment.
this is using the table's current metadata, perhaps we want to pass it these based on which snapshot we're using
|
Ready for review :) |
9d42fae to
463300d
Compare
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Implements position_deletes metadata table - #1053