restore: pipeline recovery under new accdb#10176
Conversation
Performance Measurements ⏳
|
There was a problem hiding this comment.
Pull request overview
Adds recovery support to the snapshot load pipeline for the new accdb implementation, including mechanisms to reset accdb state on full snapshot failure and to revert incremental snapshot writes.
Changes:
- Introduces accdb reset and snapshot recovery APIs (
fd_accdb_reset,fd_accdb_snapshot_{save,revert}_whead) to support retrying failed loads. - Extends snapshot load write paths (
fd_accdb_snapshot_write_{one,batch}) with afork_idto enable incremental-mode txn tracking and revert-on-failure semantics. - Updates restore tiles (
snapin,snapwr,snapct) to use the new recovery hooks, addftruncateon full snapshot failure, and fix control/data drain ordering (flush_load).
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/flamenco/accdb/fd_accdb.h | Adds recovery/reset APIs and extends snapshot write APIs with fork_id semantics. |
| src/flamenco/accdb/fd_accdb.c | Implements reset and whead save/revert; adds incremental snapshot txn tracking and cross-fork override handling; adds CLEAR_DEFERRED cmd. |
| src/flamenco/accdb/fd_accdb_shmem.c | Persists cache_min_reserved into shmem for reset-time cache sentinel reconstruction. |
| src/flamenco/accdb/fd_accdb_private.h | Adds cache_min_reserved field and new FD_ACCDB_CMD_CLEAR_DEFERRED opcode. |
| src/discof/restore/generated/fd_snapwr_tile_seccomp.h | Regenerates seccomp filter to allow ftruncate(accounts_fd, 0). |
| src/discof/restore/fd_snapwr_tile.seccomppolicy | Allows ftruncate for reclaiming space on full snapshot failure. |
| src/discof/restore/fd_snapwr_tile.c | Adds offsets recovery and truncation-on-full-failure behavior. |
| src/discof/restore/fd_snapin_tile.c | Uses new accdb reset/recovery and fork-aware snapshot write APIs; adds incremental fork creation/purge/advance_root. |
| src/discof/restore/fd_snapct_tile.c | Reintroduces flush_load gating so RESET waits for snapld to drain dcache data. |
Greptile SummaryThis PR implements pipeline recovery for the snapshot load mechanism under the new Key changes:
Confidence Score: 4/5This PR is well-implemented with thorough state machine coordination. The recovery mechanisms correctly handle both full (reset) and incremental (purge+revert) failure modes. Safe to merge with standard testing. The implementation is architecturally sound: fork-based isolation for incremental writes allows clean purge-revert, the flush_load flag properly synchronizes the snapld data drain with the snapct RESET handler, and the ctrl_ack_frag changes correctly handle stale acks during RESET states. The accdb changes (reset, save/revert_whead, cross-fork insert tracking) are consistent with the existing pool and partition accounting. Score of 4 reflects high confidence in correctness with the caveat that the fd_accdb_reset precondition (no concurrent access) is a documented invariant that must be upheld by the system architecture. src/flamenco/accdb/fd_accdb.c contains the most complex logic (reset, revert_whead, cross-fork write tracking) and should receive careful testing for concurrent access patterns. src/discof/restore/fd_snapin_tile.c orchestrates the recovery flow and should be tested with both full and incremental failure scenarios. Important Files Changed
Sequence DiagramsequenceDiagram
participant snapct
participant snapld
participant snapin
participant snapwr
participant accdb
Note over snapct,accdb: Full Snapshot Load (success path)
snapct->>snapld: INIT_FULL
snapld->>snapin: INIT_FULL
snapin->>accdb: reset() + attach_child(NULL) + load_begin()
snapin->>snapwr: INIT_FULL
snapct->>snapld: DATA frags
snapld->>snapin: account data
snapin->>accdb: snapshot_write_batch(fork=USHORT_MAX)
snapin->>snapwr: account data (pwrite)
snapct->>snapld: FINI / NEXT
snapin->>accdb: save_whead(recovery)
snapin-->>snapct: ack NEXT
Note over snapct,accdb: Incremental Snapshot Load
snapct->>snapld: INIT_INCR
snapld->>snapin: INIT_INCR
snapin->>accdb: attach_child(root) -> incr_fork_id
snapct->>snapld: DATA frags
snapld->>snapin: account data
snapin->>accdb: snapshot_write_batch(fork=incr_fork_id)
Note over snapct,accdb: Incremental Failure + Recovery
snapct->>snapct: malformed detected
snapct->>snapld: FAIL
snapld->>snapin: FAIL (forwarded)
snapin->>accdb: purge(incr_fork_id)
snapin->>accdb: revert_whead(recovery)
snapin->>snapwr: FAIL
snapwr->>snapwr: restore offsets
snapld-->>snapct: FAIL on snapld_dc (flush_load=1)
snapin-->>snapct: ack FAIL (flush_ack++)
snapwr-->>snapct: ack FAIL (flush_ack++)
snapct->>snapct: flush_ack==cnt and flush_load -> retry
Reviews (1): Last reviewed commit: "restore: snapct flush_ack and improved l..." | Re-trigger Greptile |
|
I think I changed the flush load semantics to a counter, because now we need 3, snapwr, snapin, and snapld all need to send the ack back to snapct. Seems cleaner than 3 bools? |
94f5e36 to
3d7861d
Compare
Performance Measurements ⏳
|
3d7861d to
6be198a
Compare
Performance Measurements ⏳
|
6be198a to
5c3f566
Compare
Performance Measurements ⏳
|
5c3f566 to
b9aedc0
Compare
Performance Measurements ⏳
|
b9aedc0 to
b7bd228
Compare
Performance Measurements ⏳
|
b7bd228 to
f6a2058
Compare
Performance Measurements ⏳
|
Add recovery mechanism to the snapshot load pipeline under the new accdb.