Skip to content

perf: do not copy metadata for each data file in summary#2674

Merged
Fokko merged 1 commit intoapache:mainfrom
Anton-Tarazi:stop_copying_metadata
Nov 2, 2025
Merged

perf: do not copy metadata for each data file in summary#2674
Fokko merged 1 commit intoapache:mainfrom
Anton-Tarazi:stop_copying_metadata

Conversation

@Anton-Tarazi
Copy link
Copy Markdown
Contributor

Resolves #2673

Rationale for this change

_SnapshotProducer._summary() copies the metadata for every added / deleted DataFile. This is pretty expensive. Instead we just copy it once at the beginning of the function and use the same value each DataFile.

On my data, which overwrites a few million rows at a time, I saw the time for table.overwrite go from ~20 seconds to ~6 seconds.

Are these changes tested?

Yes, existing unit / integration tests

Are there any user-facing changes?

Just faster writes :)

f Please enter the commit message for your changes. Lines starting
@Fokko
Copy link
Copy Markdown
Contributor

Fokko commented Nov 2, 2025

@Anton-Tarazi This makes a lot of sense to me, thanks for digging into this and providing the patch 🙌

@Fokko Fokko merged commit 1f9c46b into apache:main Nov 2, 2025
8 checks passed
Fokko pushed a commit that referenced this pull request May 4, 2026
## Summary

Follow-up to #2674. `Transaction.table_metadata` replays all staged
updates via `model_copy(deep=True)` on every access, so reading it (or
`spec()`/`schema()` derived from it) repeatedly within a single
snapshot-producer method is redundant deep-copy work.

\#2674 hoisted the property access in `_summary()`; this PR extends the
same pattern to three more call sites in
`pyiceberg/table/update/snapshot.py` that still read the property more
than once per invocation.

## Changes

- `_SnapshotProducer._summary`: hoist `spec()`/`schema()` out of the
per-data-file loop (they are invariant across files; still called 2× per
file before this change)
- `_DeleteFiles._compute_deletes`: hoist `table_metadata`/`schema` once
at method entry (was 3 accesses — two via `self.schema()` for the
metrics evaluators and one direct for `snapshot_by_id`)
- `_MergeAppendFiles.__init__`: 3 consecutive
`self._transaction.table_metadata.properties` accesses → 1

All hoists are at method entry. Nothing inside these methods stages a
transaction update (the `AddSnapshotUpdate` is staged by the caller
after `_commit()` returns), so `table_metadata` is invariant for the
duration of each method.

Not touched here: the `new_manifest_writer(self.spec(id))` calls inside
per-manifest loops in `_write_delete_manifest` / `_compute_deletes` /
`_OverwriteFiles._existing_manifests` also trigger 2–3 property accesses
per iteration via the `schema()`/`spec()`/`new_manifest_writer()`
helpers. Those loops are O(partition-groups or rewritten-manifests)
rather than O(files), and fixing them cleanly would mean changing the
helper signatures — happy to do that in a follow-up if there's interest.

## Testing

New `test_snapshot_producer_bounded_metadata_access` wraps
`Transaction.table_metadata` with a call counter and asserts:
- `_summary()` access count is identical for 10 vs 100 appended files
(independent of N), and ≤ 2
- `_MergeAppendFiles.__init__` makes exactly 1 more access than
`_FastAppendFiles.__init__` (was 3 before this change — verified the
test fails with the production diff reverted)

The test constructs `_FastAppendFiles` / `_MergeAppendFiles` directly
rather than going through the public append path, since the public path
writes manifest avro files; the property-access count it measures is the
behaviour under test and doesn't require I/O.

Existing `tests/table/test_snapshots.py` passing.

## Motivation

For appends/deletes/overwrites touching large numbers of files or
manifests, the per-iteration property access dominates wall-clock (each
access replays the staged-updates list through pydantic `model_copy`).
This keeps the cost constant per method call.

---------

Co-authored-by: Ruiyang Wang <rynewang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_SnapshotProducer._summary() unreasonably slow

2 participants