Skip to content

restore: pipeline recovery under new accdb#10176

Open
jvarela-jump wants to merge 1 commit into
mainfrom
jvarela/restore-new-accdb-main-recovery
Open

restore: pipeline recovery under new accdb#10176
jvarela-jump wants to merge 1 commit into
mainfrom
jvarela/restore-new-accdb-main-recovery

Conversation

@jvarela-jump

@jvarela-jump jvarela-jump commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Add recovery mechanism to the snapshot load pipeline under the new accdb.

Copilot AI review requested due to automatic review settings June 10, 2026 22:43
Comment thread src/discof/restore/fd_snapwr_tile.seccomppolicy Outdated
Comment thread src/flamenco/accdb/fd_accdb.c Outdated
Comment thread src/flamenco/accdb/fd_accdb.c Outdated
@github-actions

Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-424669000-perf per slot 0.046552 s 0.046656 s 0.223%
backtest mainnet-424669000-perf snapshot load 1.86 s 1.935 s 4.032%
backtest mainnet-424669000-perf total elapsed 60.331101 s 60.466271 s 0.224%
firedancer mem usage with mainnet.toml 504.41 GiB 504.41 GiB 0.000%

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds recovery support to the snapshot load pipeline for the new accdb implementation, including mechanisms to reset accdb state on full snapshot failure and to revert incremental snapshot writes.

Changes:

  • Introduces accdb reset and snapshot recovery APIs (fd_accdb_reset, fd_accdb_snapshot_{save,revert}_whead) to support retrying failed loads.
  • Extends snapshot load write paths (fd_accdb_snapshot_write_{one,batch}) with a fork_id to enable incremental-mode txn tracking and revert-on-failure semantics.
  • Updates restore tiles (snapin, snapwr, snapct) to use the new recovery hooks, add ftruncate on full snapshot failure, and fix control/data drain ordering (flush_load).

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/flamenco/accdb/fd_accdb.h Adds recovery/reset APIs and extends snapshot write APIs with fork_id semantics.
src/flamenco/accdb/fd_accdb.c Implements reset and whead save/revert; adds incremental snapshot txn tracking and cross-fork override handling; adds CLEAR_DEFERRED cmd.
src/flamenco/accdb/fd_accdb_shmem.c Persists cache_min_reserved into shmem for reset-time cache sentinel reconstruction.
src/flamenco/accdb/fd_accdb_private.h Adds cache_min_reserved field and new FD_ACCDB_CMD_CLEAR_DEFERRED opcode.
src/discof/restore/generated/fd_snapwr_tile_seccomp.h Regenerates seccomp filter to allow ftruncate(accounts_fd, 0).
src/discof/restore/fd_snapwr_tile.seccomppolicy Allows ftruncate for reclaiming space on full snapshot failure.
src/discof/restore/fd_snapwr_tile.c Adds offsets recovery and truncation-on-full-failure behavior.
src/discof/restore/fd_snapin_tile.c Uses new accdb reset/recovery and fork-aware snapshot write APIs; adds incremental fork creation/purge/advance_root.
src/discof/restore/fd_snapct_tile.c Reintroduces flush_load gating so RESET waits for snapld to drain dcache data.

Comment thread src/flamenco/accdb/fd_accdb.c Outdated
Comment thread src/flamenco/accdb/fd_accdb.c
@greptile-jt

greptile-jt Bot commented Jun 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR implements pipeline recovery for the snapshot load mechanism under the new accdb. It replaces the previous FD_LOG_ERR crash-on-failure stubs with a proper retry mechanism that can recover from both full and incremental snapshot load failures.

Key changes:

  • fd_snapct_tile.c: Adds a flush_load flag to coordinate with the snapld tile, ensuring the RESET handler waits for snapld to drain its data frags before proceeding with a retry. The ctrl_ack_frag handler is updated to gracefully ignore stale acks that arrive after a FAIL transition, instead of crashing.
  • fd_snapin_tile.c: Implements fork-based isolation for incremental writes — a child fork is created so that fd_accdb_purge can revert just the incremental changes on failure. Full snapshot failures call fd_accdb_reset to wipe and reinitialize the database. Saves and restores accdb write head metadata (whead, partition_max, metrics) for clean rollback.
  • fd_snapwr_tile.c: Saves accounts_off / flush_off recovery checkpoints on full snapshot completion (NEXT). On failure, restores these offsets (or zeros for full snapshot) and truncates the accounts file via ftruncate.
  • fd_accdb.c / fd_accdb.h: Adds fd_accdb_reset (full reinit), fd_accdb_snapshot_save_whead / fd_accdb_snapshot_revert_whead (savepoint/rollback for layer-0 write head and partitions), FD_ACCDB_CMD_CLEAR_DEFERRED (T2 stale-pointer cleanup), and extends snapshot_write_one / snapshot_write_batch with a fork_id parameter to support cross-fork insert tracking via txn records.
  • fd_accdb_private.h / fd_accdb_shmem.c: Persists cache_min_reserved in shmem so fd_accdb_reset can replicate the cache sentinel logic from fd_accdb_shmem_new.
  • seccomp: Adds ftruncate syscall allowance for snapwr tile (restricted to the accounts fd with length 0).

Confidence Score: 4/5

This PR is well-implemented with thorough state machine coordination. The recovery mechanisms correctly handle both full (reset) and incremental (purge+revert) failure modes. Safe to merge with standard testing.

The implementation is architecturally sound: fork-based isolation for incremental writes allows clean purge-revert, the flush_load flag properly synchronizes the snapld data drain with the snapct RESET handler, and the ctrl_ack_frag changes correctly handle stale acks during RESET states. The accdb changes (reset, save/revert_whead, cross-fork insert tracking) are consistent with the existing pool and partition accounting. Score of 4 reflects high confidence in correctness with the caveat that the fd_accdb_reset precondition (no concurrent access) is a documented invariant that must be upheld by the system architecture.

src/flamenco/accdb/fd_accdb.c contains the most complex logic (reset, revert_whead, cross-fork write tracking) and should receive careful testing for concurrent access patterns. src/discof/restore/fd_snapin_tile.c orchestrates the recovery flow and should be tested with both full and incremental failure scenarios.

Important Files Changed

Filename Overview
src/discof/restore/fd_snapct_tile.c Adds flush_load flag coordinating with snapld's FAIL forwarding; updated snapld_frag and ctrl_ack_frag to handle stale acks during RESET states gracefully instead of crashing.
src/discof/restore/fd_snapin_tile.c Implements fork-based isolation for incremental writes and proper FAIL handling (reset for full, purge+revert for incremental). Adds accdb_incr_fork_id tracking and recovery metadata.
src/discof/restore/fd_snapwr_tile.c Adds recovery checkpoint for accounts_off/flush_off on NEXT, restores on FAIL. Full snapshot failure truncates the accounts DB file to 0.
src/flamenco/accdb/fd_accdb.c Core implementation: fd_accdb_reset (full reinit with T2 coordination), save_whead/revert_whead (savepoint/rollback), cross-fork insert logic with txn tracking.
src/flamenco/accdb/fd_accdb.h Public API additions: fd_accdb_snapshot_recovery_t, save_whead, revert_whead, fd_accdb_reset, updated snapshot_write_one/batch signatures with fork_id parameter.

Sequence Diagram

sequenceDiagram
    participant snapct
    participant snapld
    participant snapin
    participant snapwr
    participant accdb

    Note over snapct,accdb: Full Snapshot Load (success path)
    snapct->>snapld: INIT_FULL
    snapld->>snapin: INIT_FULL
    snapin->>accdb: reset() + attach_child(NULL) + load_begin()
    snapin->>snapwr: INIT_FULL
    snapct->>snapld: DATA frags
    snapld->>snapin: account data
    snapin->>accdb: snapshot_write_batch(fork=USHORT_MAX)
    snapin->>snapwr: account data (pwrite)
    snapct->>snapld: FINI / NEXT
    snapin->>accdb: save_whead(recovery)
    snapin-->>snapct: ack NEXT

    Note over snapct,accdb: Incremental Snapshot Load
    snapct->>snapld: INIT_INCR
    snapld->>snapin: INIT_INCR
    snapin->>accdb: attach_child(root) -> incr_fork_id
    snapct->>snapld: DATA frags
    snapld->>snapin: account data
    snapin->>accdb: snapshot_write_batch(fork=incr_fork_id)

    Note over snapct,accdb: Incremental Failure + Recovery
    snapct->>snapct: malformed detected
    snapct->>snapld: FAIL
    snapld->>snapin: FAIL (forwarded)
    snapin->>accdb: purge(incr_fork_id)
    snapin->>accdb: revert_whead(recovery)
    snapin->>snapwr: FAIL
    snapwr->>snapwr: restore offsets
    snapld-->>snapct: FAIL on snapld_dc (flush_load=1)
    snapin-->>snapct: ack FAIL (flush_ack++)
    snapwr-->>snapct: ack FAIL (flush_ack++)
    snapct->>snapct: flush_ack==cnt and flush_load -> retry
Loading

Reviews (1): Last reviewed commit: "restore: snapct flush_ack and improved l..." | Re-trigger Greptile

@mmcgee-jump

Copy link
Copy Markdown
Contributor

I think I changed the flush load semantics to a counter, because now we need 3, snapwr, snapin, and snapld all need to send the ack back to snapct. Seems cleaner than 3 bools?

@jvarela-jump jvarela-jump force-pushed the jvarela/restore-new-accdb-main-recovery branch from 94f5e36 to 3d7861d Compare June 11, 2026 16:27
@jvarela-jump jvarela-jump requested a review from Copilot June 11, 2026 16:29
@github-actions

Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-424669000-perf per slot 0.050423 s 0.050427 s 0.008%
backtest mainnet-424669000-perf snapshot load 1.953 s 2.051 s 5.018% ⚠️
backtest mainnet-424669000-perf total elapsed 65.348423 s 65.353501 s 0.008%
firedancer mem usage with mainnet.toml 504.41 GiB 504.41 GiB 0.000%

@jvarela-jump jvarela-jump force-pushed the jvarela/restore-new-accdb-main-recovery branch from 3d7861d to 6be198a Compare June 11, 2026 16:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment thread src/flamenco/accdb/fd_accdb.c
Comment thread src/flamenco/accdb/fd_accdb.c Outdated
@github-actions

Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-424669000-perf per slot 0.05054 s 0.050237 s -0.600%
backtest mainnet-424669000-perf snapshot load 1.943 s 2.044 s 5.198% ⚠️
backtest mainnet-424669000-perf total elapsed 65.499623 s 65.106618 s -0.600%
firedancer mem usage with mainnet.toml 504.41 GiB 504.41 GiB 0.000%

Copilot AI review requested due to automatic review settings June 11, 2026 18:42
@jvarela-jump jvarela-jump force-pushed the jvarela/restore-new-accdb-main-recovery branch from 6be198a to 5c3f566 Compare June 11, 2026 18:42
@github-actions

Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-424669000-perf per slot 0.050698 s 0.050735 s 0.073%
backtest mainnet-424669000-perf snapshot load 1.564 s 1.579 s 0.959%
backtest mainnet-424669000-perf total elapsed 65.704175 s 65.753053 s 0.074%
firedancer mem usage with mainnet.toml 504.41 GiB 504.41 GiB 0.000%

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment thread src/discof/restore/fd_snapin_tile.c
Comment thread src/discof/restore/fd_snapin_tile.c
Comment thread src/discof/restore/fd_snapct_tile.c
@jvarela-jump jvarela-jump force-pushed the jvarela/restore-new-accdb-main-recovery branch from 5c3f566 to b9aedc0 Compare June 11, 2026 21:11
@github-actions

Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-424669000-perf per slot 0.050228 s 0.050366 s 0.275%
backtest mainnet-424669000-perf snapshot load 1.986 s 2.043 s 2.870%
backtest mainnet-424669000-perf total elapsed 65.095852 s 65.274731 s 0.275%
firedancer mem usage with mainnet.toml 504.41 GiB 504.41 GiB 0.000%

Copilot AI review requested due to automatic review settings June 12, 2026 17:59
@jvarela-jump jvarela-jump force-pushed the jvarela/restore-new-accdb-main-recovery branch from b9aedc0 to b7bd228 Compare June 12, 2026 17:59
@github-actions

Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-424669000-perf per slot 0.050794 s 0.050733 s -0.120%
backtest mainnet-424669000-perf snapshot load 1.566 s 1.563 s -0.192%
backtest mainnet-424669000-perf total elapsed 65.829584 s 65.749645 s -0.121%
firedancer mem usage with mainnet.toml 506.46 GiB 506.46 GiB 0.000%

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Comment thread src/flamenco/accdb/fd_accdb.c
Comment thread src/flamenco/accdb/fd_accdb.h
@jvarela-jump jvarela-jump force-pushed the jvarela/restore-new-accdb-main-recovery branch from b7bd228 to f6a2058 Compare June 12, 2026 19:26
@github-actions

Copy link
Copy Markdown

Performance Measurements ⏳

Suite Baseline New Change
backtest mainnet-424669000-perf per slot 0.050695 s 0.050614 s -0.160%
backtest mainnet-424669000-perf snapshot load 1.567 s 1.573 s 0.383%
backtest mainnet-424669000-perf total elapsed 65.700376 s 65.596259 s -0.158%
firedancer mem usage with mainnet.toml 506.46 GiB 506.46 GiB 0.000%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants