FEATURE: SSE Resumption in execd#583
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 839e0aa277
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
ninan-nn
left a comment
There was a problem hiding this comment.
LGTM : > I think this should be described as best-effort resume rather than lossless resumption. There are still at least two event-loss windows: concurrent stdout/stderr append can drop older eids due to the buffer’s strict monotonic check, and the replay-then-attach flow can miss events generated between the buffer snapshot and live handoff. So reconnect improves recoverability, but it does not yet guarantee gap-free continuation.
839e0aa to
441e3b5
Compare
e2e4ff4 to
7ac7afd
Compare
# Conflicts: # components/execd/pkg/web/model/error.go
# Conflicts: # .github/workflows/execd-test.yml # components/execd/README.md # components/execd/README_zh.md # components/execd/pkg/web/controller/codeinterpreting.go # components/execd/pkg/web/controller/command.go # components/execd/pkg/web/model/codeinterpreting.go
Update OnExecuteStdout/OnExecuteStderr callbacks to match new (eid int64, string) signature. Remove stale cwd existence validation test — Validate() delegates to runtime for path resolution. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop resumeEnabled from RunCode (no /code resume endpoint) - Give terminal events (complete/error) eids so they enter resume buffer - Merge nextStdoutStderrEventID/NextControlEventID into NextEventID - Use atomic.Bool for resumeEnabled to avoid data race - Remove unused ErrorCodeNotImplemented constant Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Transparently retry via GET /command/{id}/resume?after_eid={last_eid}
when the SSE connection drops. Tracks eid from each event so the
server-side ring buffer can replay only missed events.
- Add eid field to EventNode
- Wrap _execute_streaming_request with retry loop (max 3 attempts)
- Handle 409 (retry after 1s), 404 (graceful exit)
- Both async and sync adapters
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Simulate mid-stream ReadError via custom httpx byte streams,
then verify the adapter transparently resumes via GET
/command/{id}/resume?after_eid=<last_eid> and delivers all
events including the completion on the retry.
Both async (_ErrorAfterAsyncStream) and sync (_SyncErrorAfterStream).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds e2e test that injects mid-stream ReadError via custom httpx transport and verifies transparent SDK resume against real execd. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ream Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Session-scoped commands use a separate storage path (bashSessionClientMap), never set resumeEnabled, and have no resume route. The SDK attempts resume on disconnect but execd returns 404. Only /command path supports resume. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement automatic SSE connection drop recovery in Go, Kotlin, C#, and
TypeScript SDKs. Each SDK now tracks the monotonic eid per event, and on
network failure retries up to 3 times via GET /command/{id}/resume.
Go: io.MultiReader + errorReader for disconnect injection in tests
Kotlin: MockWebServer Dispatcher with DISCONNECT_DURING_RESPONSE_BODY
C#: TruncatedStreamContent for partial body delivery
TypeScript: ReadableStream start/pull for mid-stream error simulation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lint Line exceeded 140-char max-line-length rule. Split the long appendLine call with embedded JSON string across multiple lines. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Problem
POST /commandstream output over SSE; if the client disconnects, they lose the tail of the stream and cannot catch up or reconnect to the same execution cleanly.What we changed (how it’s solved)
pkg/sbuf: bounded ring-buffer store per stream id, keyed by eid, with strict monotonic checks - storage for replay after disconnect.RunCommand/RunCodewith resumeEnabled, the primary SSE registers a hub, and writeSSE both writes to the live connection and appends stdout/stderr frames (eid > 0) to the buffer; on request end, cleanup removes hub + buffer entry.GET /command/:id/resume: EventsAfter replays buffered SSE payloads, then if the command is still running and no primary SSE holds the live slot, tryAttachResume attaches the new response writer as the sole live consumer; 409 if the primary stream is still active.GET /command/{id}/resume, after_eid, responses (200 SSE, 409 via shared Conflict), aligned with the implementation.Testing
Breaking Changes
Checklist
execdand SDKs #507