Skip to content

fix(occ): manifest list rebuild#982

Open
mzzz-zzm wants to merge 5 commits intoapache:mainfrom
mzzz-zzm:fix/occ-manifest-list-rebuild
Open

fix(occ): manifest list rebuild#982
mzzz-zzm wants to merge 5 commits intoapache:mainfrom
mzzz-zzm:fix/occ-manifest-list-rebuild

Conversation

@mzzz-zzm
Copy link
Copy Markdown

@mzzz-zzm mzzz-zzm commented May 4, 2026

fix(occ): rebuild manifest list on OCC retry to inherit concurrent writes

Fixes #976

Problem

When doCommit retries after an OCC conflict, it re-uses the stale snapshot that was built before the first attempt. That snapshot's manifest list was written against the original parent and does not include any data files committed by concurrent writers in the meantime.

On catalogs that perform server-side snapshot validation (e.g. AWS S3 Tables), this causes silent data loss: the retried commit succeeds (HTTP 200) but the stale manifest list effectively drops the concurrent writer's files from the table history.

Fix

Each snapshotProducer now records which manifests it wrote itself (ownManifests — those not inherited from the original parent). A rebuildManifestList closure is attached to the addSnapshotUpdate and called by doCommit on every retry. The closure:

  1. Loads the fresh branch head's manifest list (concurrent manifests).
  2. Concatenates ownManifests with the fresh inherited manifests.
  3. Writes a new manifest list file with a retry-attempt suffix so each attempt gets a unique path.
  4. Adjusts sequence-number relative to the fresh parent.

After a successful commit, doCommit deletes the manifest list files produced by superseded retry attempts (orphan cleanup).

Files changed

  • table/snapshot_producers.go: computeOwnManifests, rebuildFn closure
  • table/updates.go: ownManifests / rebuildManifestList fields on addSnapshotUpdate, propagated through Apply
  • table/table.go: rebuildSnapshotUpdates helper, orphan cleanup after commit
  • table/rebuild_manifest_test.go: unit tests for rebuildSnapshotUpdates

mzzz-zzm added 2 commits May 5, 2026 07:32
When doCommit retries after an OCC conflict (ErrCommitFailed / HTTP 409),
the AddSnapshotUpdate still carries the manifest list built against the
original base snapshot. That list does not include any manifests committed
by concurrent writers between the first attempt and the retry, so the
retried commit silently discards those manifests and the data they
reference is lost.

Fix: snapshotProducer records its own manifest files at commit-build time
(computeOwnManifests). doCommit calls a rebuildManifestList closure on
each retry that rewrites the AddSnapshotUpdate with a fresh manifest list
that prepends the current branch-head manifests before the producer's own
files. Orphaned manifest-list files from failed attempts are deleted after
a successful commit.

Fixes apache#976
Covers four cases:
- closure is called with the fresh branch head as freshParent
- rebuild is skipped when snapshot parent already matches the fresh head
- updates without a rebuild closure pass through unchanged
- errors returned by the closure propagate to the caller
mzzz-zzm added 3 commits May 5, 2026 08:23
Verifies that a retried snapshot correctly inherits manifests from
concurrent writers rather than reusing the stale manifest list.

The test fails on main (got 1 manifest) and passes with this fix
(got 2 manifests), serving as a direct regression guard for the
manifest-list rebuild logic in rebuildSnapshotUpdates.

Signed-off-by: mzzz-zzm <80936176+mzzz-zzm@users.noreply.github.com>
@zeroshade zeroshade changed the title Fix/occ manifest list rebuild fix(occ): manifest list rebuild May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug (table) Snapshot manifest list is not rebuilt on OCC retry — concurrent writes are lost

1 participant