Skip to content

frank, set-identity: mitigate switch-while-leader bugs#10144

Open
jherrera-jump wants to merge 1 commit into
firedancer-io:mainfrom
jherrera-jump:jherrera/gui-relax-assertion
Open

frank, set-identity: mitigate switch-while-leader bugs#10144
jherrera-jump wants to merge 1 commit into
firedancer-io:mainfrom
jherrera-jump:jherrera/gui-relax-assertion

Conversation

@jherrera-jump

@jherrera-jump jherrera-jump commented Jun 9, 2026

Copy link
Copy Markdown
Contributor
  • wait until the leader rotation is over to switch.
  • relax assumptions about poh state machine in the gui

@jherrera-jump jherrera-jump changed the title Jherrera/gui relax assertion frank, set-identity: mitigate switch-while-leader bugs Jun 9, 2026
@jherrera-jump jherrera-jump force-pushed the jherrera/gui-relax-assertion branch 2 times, most recently from c8442f5 to 8bd15a6 Compare June 9, 2026 22:12
@jherrera-jump jherrera-jump requested a review from Copilot June 9, 2026 22:13

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Mitigates identity switching hazards around leadership transitions in the Frankendancer pipeline by deferring key switches until sufficiently far from both the previous and next leader slots, and making the GUI more resilient to slot event reordering/duplicates caused by set-identity rewinds.

Changes:

  • PoH tile: tighten set-identity application conditions with a leader-slot guard window and simplify maybe_change_identity call sites.
  • Replay tile: add a similar leader-slot guard window before applying pending identity switches.
  • GUI: handle duplicate/reordered/missed SLOT_END/SLOT_START events by finalizing or ignoring appropriately (instead of hard error).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/discoh/pohh/fd_pohh_tile.c Defers identity switching until safely away from leadership transitions; updates call sites accordingly.
src/discof/replay/fd_replay_tile.c Adds leader-slot guard window before applying a pending identity switch.
src/disco/gui/fd_gui.c Makes slot bookkeeping tolerant to stale/duplicate/reordered slot start/end events caused by identity rewind scenarios.

Comment thread src/discoh/pohh/fd_pohh_tile.c Outdated
Comment thread src/discoh/pohh/fd_pohh_tile.c Outdated
Comment thread src/discof/replay/fd_replay_tile.c Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/discoh/pohh/fd_pohh_tile.c Outdated
Comment thread src/discof/replay/fd_replay_tile.c Outdated
@jherrera-jump jherrera-jump marked this pull request as ready for review June 9, 2026 22:51
@jherrera-jump jherrera-jump requested a review from mmcgee-jump June 9, 2026 22:52
@jherrera-jump jherrera-jump marked this pull request as draft June 10, 2026 19:03
@jherrera-jump jherrera-jump force-pushed the jherrera/gui-relax-assertion branch from b3cfb0e to d598d4d Compare June 10, 2026 19:05
@jherrera-jump jherrera-jump requested a review from Copilot June 10, 2026 19:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread src/discoh/pohh/fd_pohh_tile.c Outdated
Comment thread src/discof/replay/fd_replay_tile.c Outdated
Comment thread src/discof/replay/fd_replay_tile.c Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/discoh/pohh/fd_pohh_tile.c
Comment thread src/discof/replay/fd_replay_tile.c

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread src/discoh/pohh/fd_pohh_tile.c
Comment thread src/discof/replay/fd_replay_tile.c
Comment thread src/disco/gui/fd_gui.c

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread src/disco/gui/fd_gui.c
Comment thread src/disco/gui/fd_gui.c
@jherrera-jump jherrera-jump marked this pull request as ready for review June 10, 2026 22:05
@greptile-jt

greptile-jt Bot commented Jun 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR hardens the identity-switch (set-identity) flow by preventing switches during or near leader rotations, and makes the GUI more resilient to state machine anomalies that can occur during identity switches.

  • Replay tile (fd_replay_tile.c): Adds keyswitch_near_leader guard that defers identity switches until the current slot is more than FD_EPOCH_SLOTS_PER_ROTATION (4) slots away from any leader slot for either the current or pending identity. Also blocks switches near epoch boundaries where the adjacent leader schedule may not be available.
  • PoH tile (fd_pohh_tile.c): Replaces the simpler definitely_not_leader parameter with the same proximity-based guard. The maybe_change_identity function no longer needs a parameter — it performs its own comprehensive leader proximity check instead of relying on callers to indicate leadership status.
  • GUI (fd_gui.c): Replaces FD_LOG_ERR (crash) with FD_LOG_WARNING + graceful recovery when slot_start/slot_end events arrive out of order (e.g., due to PoH rewind during identity switch). Moves fd_gui_handle_slot_start after fd_gui_handle_slot_end so slot_start can call slot_end to finalize a stale prior slot.

Confidence Score: 4/5

This PR is a defensive hardening change that replaces crashes with graceful recovery and adds safety windows around leader rotations; safe to merge.

All three files make logically sound changes. The GUI gracefully handles slot_start/slot_end mismatches instead of crashing. The leader proximity checks in replay and pohh tiles correctly inspect both the current and pending identity's leader schedule, with proper NULL/ULONG_MAX sentinel handling. The recursive call in fd_gui_handle_slot_end is bounded to depth 2. The only concerns are stylistic (code duplication, long lines). The byte offsets for reading the pending identity public key are correct for each tile's keyswitch layout (replay gets pubkey at offset 0 from set_identity.c:224; pohh gets full keypair at offset 0 from set_identityh.c:173, with pubkey at offset 32).

No files require special attention; the most complex change is in fd_gui.c with the recursive slot_end handling, which has been verified to be bounded and correct.

Important Files Changed

Filename Overview
src/disco/gui/fd_gui.c Replaces FD_LOG_ERR crash with FD_LOG_WARNING + graceful recovery for slot_start/slot_end mismatches; adds defensive finalization of stale open slots. Recursive call in slot_end is bounded to depth 2 and safe.
src/discof/replay/fd_replay_tile.c Adds leader proximity guard (keyswitch_near_leader + epoch boundary check) to maybe_switch_identity, deferring identity switches until safely clear of all leader rotations for both current and pending identities.
src/discoh/pohh/fd_pohh_tile.c Replaces simple is_leader check with comprehensive leader proximity guard in maybe_change_identity; removes the definitely_not_leader parameter. Both callers (no_longer_leader, during_housekeeping) now benefit from the same safety window.

Sequence Diagram

sequenceDiagram
    participant Op as Operator (set-identity)
    participant KS as Keyswitch
    participant Replay as Replay Tile
    participant PoH as PoH Tile
    participant GUI as GUI

    Op->>KS: Write keypair, set SWITCH_PENDING
    
    Note over Replay: maybe_switch_identity()
    Replay->>KS: Query state == SWITCH_PENDING?
    Replay->>Replay: Check epoch boundary safety
    Replay->>Replay: keyswitch_near_leader(current_id)?
    Replay->>Replay: keyswitch_near_leader(pending_id)?
    alt Near leader or epoch boundary
        Replay-->>Replay: Defer switch (return early)
    else Clear of leader rotation
        Replay->>KS: Copy pubkey, set COMPLETED
    end

    Note over PoH: maybe_change_identity()
    PoH->>KS: Query state == SWITCH_PENDING?
    PoH->>PoH: Check epoch boundary safety
    PoH->>PoH: keyswitch_near_leader(current_id)?
    PoH->>PoH: keyswitch_near_leader(pending_id)?
    alt Near leader or epoch boundary
        PoH-->>PoH: Defer switch (return 0)
    else Clear of leader rotation
        PoH->>PoH: fd_ext_admin_rpc_set_identity()
        PoH->>PoH: Rewind PoH to reset_slot
        PoH->>KS: Set COMPLETED
    end

    Note over GUI: handle_slot_start/end relaxed
    GUI->>GUI: slot_start with stale open slot?
    GUI->>GUI: Finalize prior slot (call slot_end)
    GUI->>GUI: slot_end mismatch?
    GUI->>GUI: LOG_WARNING (not ERR/crash)
Loading

Reviews (1): Last reviewed commit: "frank, set-identity: mitigate switch-whi..." | Re-trigger Greptile

Comment on lines +567 to +576
/* keyswitch_near_leader returns 1 if identity has any leader slots
within FD_EPOCH_SLOTS_PER_ROTATION slots of cur_slot. */
static inline int
keyswitch_near_leader( fd_multi_epoch_leaders_t const * mleaders,
ulong cur_slot,
fd_pubkey_t const * identity ) {
ulong lo = fd_ulong_if( cur_slot>=FD_EPOCH_SLOTS_PER_ROTATION, cur_slot-FD_EPOCH_SLOTS_PER_ROTATION, 0UL );
ulong next = fd_multi_epoch_leaders_get_next_slot( mleaders, lo, identity );
return next!=ULONG_MAX && next<=cur_slot+FD_EPOCH_SLOTS_PER_ROTATION;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicated helper across tiles

keyswitch_near_leader is defined identically in both fd_replay_tile.c and fd_pohh_tile.c (line 1252). Consider extracting it into a shared header (e.g., alongside the fd_multi_epoch_leaders API in fd_multi_epoch_leaders.h) to avoid drift if the logic is ever updated in only one place.

gives them a slightly wider window to safely switch. */
fd_pubkey_t const * pending_identity = fd_type_pun_const( ctx->keyswitch->bytes );
fd_epoch_leaders_t const * cur_lsched = fd_multi_epoch_leaders_get_lsched_for_slot( ctx->mleaders, ctx->reset_slot );
if( FD_UNLIKELY( cur_lsched && ( ctx->reset_slot < cur_lsched->slot0 + FD_EPOCH_SLOTS_PER_ROTATION || ctx->reset_slot >= cur_lsched->slot0 + fd_ulong_sat_sub( cur_lsched->slot_cnt, FD_EPOCH_SLOTS_PER_ROTATION ) ) ) ) return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Long compound condition could benefit from a line break

This single-line compound condition with three sub-expressions is quite dense. Consider splitting across lines for readability, similar to the keyswitch_near_leader check on lines 593-594 which already uses a two-line format.

@jherrera-jump jherrera-jump force-pushed the jherrera/gui-relax-assertion branch from 65fc59a to eb83b99 Compare June 13, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants