Skip to content

fix(keystone): recover task state after axon reconnect#84

Merged
shark0F0497 merged 2 commits into
mainfrom
keystone-worktree2-main
May 27, 2026
Merged

fix(keystone): recover task state after axon reconnect#84
shark0F0497 merged 2 commits into
mainfrom
keystone-worktree2-main

Conversation

@shark0F0497
Copy link
Copy Markdown
Collaborator

Pull Request Checklist

Please ensure your PR meets the following requirements:

  • Code follows the style guidelines
  • Tests pass locally
  • Code is formatted
  • Documentation updated if needed
  • Commit messages follow conventional commits
  • PR description is complete and clear

Summary

This PR fixes Axon recorder/transfer reconnect handling after stale WebSocket connections and reconciles Keystone task state when the edge process reconnects with its current recorder state.

It improves recovery for sudden process stops, robot power loss, and ping-timeout reconnect windows without introducing the larger durable upload-intent model.


Motivation

  • A sudden robot shutdown or kill -STOP can leave Keystone holding a stale recorder or transfer WebSocket.
  • When Axon reconnects, Keystone previously rejected the new connection as "already connected" until the old handler fully exited.
  • A transient recorder disconnect could also roll an active task back to pending; when Axon later reconnected in ready or recording, Keystone did not always reconcile the task back to the expected state.
  • Upload completion should still be able to complete a task that was temporarily rolled back during a disconnect window.

Changes

Modified Files

  • [internal/api/handlers/axon_rpc.go](internal/api/handlers/axon_rpc.go) - Adds recorder stale-connection takeover, ping timeout close handling, cleaner WebSocket-close logging, and task state reconciliation from recorder state_update events.
  • [internal/api/handlers/task.go](internal/api/handlers/task.go) - Allows recording start callbacks to reconcile pending or ready tasks into in_progress.
  • [internal/api/handlers/transfer.go](internal/api/handlers/transfer.go) - Adds transfer stale-connection takeover, ping timeout close handling, cleaner WebSocket-close logging, and upload ACK completion from pending, ready, or in_progress.
  • [internal/config/config.go](internal/config/config.go) - Adds recorder and transfer ping timeout and stale-threshold configuration defaults.
  • [internal/services/hub.go](internal/services/hub.go) - Adds generic stale connection replacement based on connection LastSeenAt.
  • [internal/services/recorder_hub.go](internal/services/recorder_hub.go) - Exposes recorder connection LastSeenAt and stale-threshold connect support.
  • [internal/services/transfer_hub.go](internal/services/transfer_hub.go) - Exposes transfer connection LastSeenAt and stale-threshold connect support.

Added Files

  • [internal/api/handlers/task_state_recovery_test.go](internal/api/handlers/task_state_recovery_test.go) - Covers task reconciliation from recorder ready/recording/start callback paths.
  • [internal/api/handlers/websocket_log.go](internal/api/handlers/websocket_log.go) - Centralizes filtering of expected WebSocket close errors.
  • [internal/services/hub_test.go](internal/services/hub_test.go) - Covers stale-connection rejection/replacement behavior and stale handler disconnect safety.

Deleted Files

None.


Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update (documentation changes only)
  • Refactoring (code improvement without functional changes)
  • Performance improvement (code changes that improve performance)
  • Test changes (adding, modifying, or removing tests)

Impact Analysis

Breaking Changes

None.

Backward Compatibility

Fully backward compatible. Existing recorder and transfer WebSocket protocols are unchanged.


Testing

Test Environment

Local Keystone worktree with GOCACHE=/tmp/keystone-go-build-cache.

Test Cases

  • Unit tests pass locally
  • Integration tests pass locally
  • E2E tests pass (if applicable)
  • Manual testing completed

Manual Testing Steps

  • Verified reconnect behavior with stopped/resumed Axon processes and reviewed Keystone logs for stale connection replacement and task-state reconciliation.

Test Coverage

  • New tests added
  • Existing tests updated
  • Coverage maintained or improved

Commands run:

env GOCACHE=/tmp/keystone-go-build-cache go test ./internal/api/handlers -run 'TestRecorderStateUpdateReadyRestoresPendingTask|TestRecorderStateUpdateRecordingAdvancesPendingTask|TestRecordingStartCallbackAdvancesPendingTask' -count=1
env GOCACHE=/tmp/keystone-go-build-cache go test ./internal/api/handlers/...
env GOCACHE=/tmp/keystone-go-build-cache go test ./internal/services/...
env GOCACHE=/tmp/keystone-go-build-cache go test ./...
git diff --check

Screenshots / Recordings

Not applicable.


Performance Impact

  • Memory usage: No meaningful change
  • CPU usage: No meaningful change
  • Throughput: No meaningful change
  • Lock contention: No meaningful change

Documentation


Related Issues

  • Related to Axon recorder/transfer reconnect recovery after sudden process stop or robot power loss.
  • Follow-up: durable pending upload intents are still needed for full recovery during transfer upload/ACK interruption windows.

Additional Notes

  • This PR intentionally does not include the untracked pending-upload-intents design documents.
  • This PR does not implement durable upload intents. It improves stale WebSocket takeover and task-state reconciliation as a smaller, mergeable fix.
  • Known limitation: if transfer disconnects during the upload/ACK window, some edge cases still require the planned pending upload intent design.

Reviewers


Notes for Reviewers

  • Please focus on stale WebSocket takeover behavior in [internal/services/hub.go](internal/services/hub.go).
  • Please review recorder task-state reconciliation in [internal/api/handlers/axon_rpc.go](internal/api/handlers/axon_rpc.go).
  • Please review upload completion behavior in [internal/api/handlers/transfer.go](internal/api/handlers/transfer.go).

Checklist for Reviewers

  • Code changes are correct and well-implemented
  • Tests are adequate and pass
  • Documentation is updated and accurate
  • No unintended side effects
  • Performance impact is acceptable
  • Backward compatibility maintained (if applicable)

@shark0F0497 shark0F0497 merged commit d2139b9 into main May 27, 2026
5 checks passed
@shark0F0497 shark0F0497 deleted the keystone-worktree2-main branch May 27, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant