perf: do not copy metadata for each data file in summary by Anton-Tarazi · Pull Request #2674 · apache/iceberg-python

Anton-Tarazi · 2025-10-29T02:49:30Z

Resolves #2673

Rationale for this change

_SnapshotProducer._summary() copies the metadata for every added / deleted DataFile. This is pretty expensive. Instead we just copy it once at the beginning of the function and use the same value each DataFile.

On my data, which overwrites a few million rows at a time, I saw the time for table.overwrite go from ~20 seconds to ~6 seconds.

Are these changes tested?

Yes, existing unit / integration tests

Are there any user-facing changes?

Just faster writes :)

f Please enter the commit message for your changes. Lines starting

Fokko · 2025-11-02T16:34:39Z

@Anton-Tarazi This makes a lot of sense to me, thanks for digging into this and providing the patch 🙌

## Summary Follow-up to #2674. `Transaction.table_metadata` replays all staged updates via `model_copy(deep=True)` on every access, so reading it (or `spec()`/`schema()` derived from it) repeatedly within a single snapshot-producer method is redundant deep-copy work. \#2674 hoisted the property access in `_summary()`; this PR extends the same pattern to three more call sites in `pyiceberg/table/update/snapshot.py` that still read the property more than once per invocation. ## Changes - `_SnapshotProducer._summary`: hoist `spec()`/`schema()` out of the per-data-file loop (they are invariant across files; still called 2× per file before this change) - `_DeleteFiles._compute_deletes`: hoist `table_metadata`/`schema` once at method entry (was 3 accesses — two via `self.schema()` for the metrics evaluators and one direct for `snapshot_by_id`) - `_MergeAppendFiles.__init__`: 3 consecutive `self._transaction.table_metadata.properties` accesses → 1 All hoists are at method entry. Nothing inside these methods stages a transaction update (the `AddSnapshotUpdate` is staged by the caller after `_commit()` returns), so `table_metadata` is invariant for the duration of each method. Not touched here: the `new_manifest_writer(self.spec(id))` calls inside per-manifest loops in `_write_delete_manifest` / `_compute_deletes` / `_OverwriteFiles._existing_manifests` also trigger 2–3 property accesses per iteration via the `schema()`/`spec()`/`new_manifest_writer()` helpers. Those loops are O(partition-groups or rewritten-manifests) rather than O(files), and fixing them cleanly would mean changing the helper signatures — happy to do that in a follow-up if there's interest. ## Testing New `test_snapshot_producer_bounded_metadata_access` wraps `Transaction.table_metadata` with a call counter and asserts: - `_summary()` access count is identical for 10 vs 100 appended files (independent of N), and ≤ 2 - `_MergeAppendFiles.__init__` makes exactly 1 more access than `_FastAppendFiles.__init__` (was 3 before this change — verified the test fails with the production diff reverted) The test constructs `_FastAppendFiles` / `_MergeAppendFiles` directly rather than going through the public append path, since the public path writes manifest avro files; the property-access count it measures is the behaviour under test and doesn't require I/O. Existing `tests/table/test_snapshots.py` passing. ## Motivation For appends/deletes/overwrites touching large numbers of files or manifests, the per-iteration property access dominates wall-clock (each access replays the staged-updates list through pydantic `model_copy`). This keeps the cost constant per method call. --------- Co-authored-by: Ruiyang Wang <rynewang@users.noreply.github.com>

perf: do not copy metadata for each data file in summary

1399e83

f Please enter the commit message for your changes. Lines starting

Anton-Tarazi mentioned this pull request Oct 29, 2025

_SnapshotProducer._summary() unreasonably slow #2673

Closed

3 tasks

Fokko approved these changes Nov 2, 2025

View reviewed changes

Fokko merged commit 1f9c46b into apache:main Nov 2, 2025
8 checks passed

This was referenced Apr 30, 2026

perf: Hoist table_metadata at remaining repeat-access sites in snapshot update #3301

Merged

perf: Cache Transaction.table_metadata between staged updates #3302

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: do not copy metadata for each data file in summary#2674

perf: do not copy metadata for each data file in summary#2674
Fokko merged 1 commit intoapache:mainfrom
Anton-Tarazi:stop_copying_metadata

Anton-Tarazi commented Oct 29, 2025

Uh oh!

Fokko commented Nov 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Anton-Tarazi commented Oct 29, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Fokko commented Nov 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants