feat: Add support for RollingManifestWriter#3009
Conversation
jayceslesar
left a comment
There was a problem hiding this comment.
matches the java it looks like to me
| class RollingManifestWriter: | ||
| """As opposed to ManifestWriter, a rolling writer could produce multiple manifest files.""" | ||
|
|
||
| _ROWS_DIVISOR = 250 |
There was a problem hiding this comment.
In Java we target at the size, controlled by commit.manifest.target-size-bytes. Any chance we can do something similar here? Or at least have some rationale behind this number.
There was a problem hiding this comment.
Yeah this is a direct port of the java implementations RollingmanifestWriter.ROW_DIVISOR.
this added divisor is basically a interval where we check the file size against target-size-bytes every 250 entries rather than on every write. But after some digging it looks like in PyIceberg tell() on both the PyArrow and fsspec OutputStream implementations is essentially free and use their own counters so maybe we can simplify this check and ignore the ROW_DIVISOR, but still doesn't hurt to have wdyt?
|
This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that's incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions. |
|
This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time. |
Rationale for this change
Adds support for the
RollingManifestWriterto split large commits into multiple manifest files instead of one giant file. This PR wraps the ManifestWriter and follows the Java implemenation of checking the size every 250 entries. The bulk of this work was done in #650.Next step is wiring this into
fast_append, but it's also useful for manifest repair operations, like deduplicating entries and rewriting manifests without blowing up the output sizesAre these changes tested?
Yes
Are there any user-facing changes?
No