Skip to content

FEATURE: SSE Resumption in execd#583

Open
Pangjiping wants to merge 23 commits into
alibaba:mainfrom
Pangjiping:feat/sse-resumption
Open

FEATURE: SSE Resumption in execd#583
Pangjiping wants to merge 23 commits into
alibaba:mainfrom
Pangjiping:feat/sse-resumption

Conversation

@Pangjiping
Copy link
Copy Markdown
Collaborator

@Pangjiping Pangjiping commented Mar 27, 2026

Summary

Problem

  • POST /command stream output over SSE; if the client disconnects, they lose the tail of the stream and cannot catch up or reconnect to the same execution cleanly.
  • Need a stable, monotonic event id (eid) per stdout/stderr chunk so clients can request "everything after after_eid" without ambiguity.

What we changed (how it’s solved)

  • pkg/sbuf: bounded ring-buffer store per stream id, keyed by eid, with strict monotonic checks - storage for replay after disconnect.
  • command_stream + SSE path: for RunCommand / RunCode with resumeEnabled, the primary SSE registers a hub, and writeSSE both writes to the live connection and appends stdout/stderr frames (eid > 0) to the buffer; on request end, cleanup removes hub + buffer entry.
  • GET /command/:id/resume: EventsAfter replays buffered SSE payloads, then if the command is still running and no primary SSE holds the live slot, tryAttachResume attaches the new response writer as the sole live consumer; 409 if the primary stream is still active.
  • specs/execd-api.yaml: documents GET /command/{id}/resume, after_eid, responses (200 SSE, 409 via shared Conflict), aligned with the implementation.

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation. execd for SSE Resumption Between execd and SDKs #507
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 839e0aa277

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread components/execd/pkg/runtime/types.go Outdated
Comment thread components/execd/pkg/web/controller/command_stream.go Outdated
Comment thread components/execd/pkg/web/controller/command.go
ninan-nn
ninan-nn previously approved these changes Mar 27, 2026
Copy link
Copy Markdown
Collaborator

@ninan-nn ninan-nn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM : > I think this should be described as best-effort resume rather than lossless resumption. There are still at least two event-loss windows: concurrent stdout/stderr append can drop older eids due to the buffer’s strict monotonic check, and the replay-then-attach flow can miss events generated between the buffer snapshot and live handoff. So reconnect improves recoverability, but it does not yet guarantee gap-free continuation.

@Pangjiping Pangjiping force-pushed the feat/sse-resumption branch from e2e4ff4 to 7ac7afd Compare March 28, 2026 08:31
Pangjiping and others added 11 commits March 28, 2026 16:36
# Conflicts:
#	components/execd/pkg/web/model/error.go
# Conflicts:
#	.github/workflows/execd-test.yml
#	components/execd/README.md
#	components/execd/README_zh.md
#	components/execd/pkg/web/controller/codeinterpreting.go
#	components/execd/pkg/web/controller/command.go
#	components/execd/pkg/web/model/codeinterpreting.go
Update OnExecuteStdout/OnExecuteStderr callbacks to match new (eid int64, string)
signature. Remove stale cwd existence validation test — Validate() delegates to
runtime for path resolution.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop resumeEnabled from RunCode (no /code resume endpoint)
- Give terminal events (complete/error) eids so they enter resume buffer
- Merge nextStdoutStderrEventID/NextControlEventID into NextEventID
- Use atomic.Bool for resumeEnabled to avoid data race
- Remove unused ErrorCodeNotImplemented constant

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Transparently retry via GET /command/{id}/resume?after_eid={last_eid}
when the SSE connection drops. Tracks eid from each event so the
server-side ring buffer can replay only missed events.

- Add eid field to EventNode
- Wrap _execute_streaming_request with retry loop (max 3 attempts)
- Handle 409 (retry after 1s), 404 (graceful exit)
- Both async and sync adapters

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Simulate mid-stream ReadError via custom httpx byte streams,
then verify the adapter transparently resumes via GET
/command/{id}/resume?after_eid=<last_eid> and delivers all
events including the completion on the retry.

Both async (_ErrorAfterAsyncStream) and sync (_SyncErrorAfterStream).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds e2e test that injects mid-stream ReadError via custom httpx
transport and verifies transparent SDK resume against real execd.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pangjiping and others added 5 commits May 3, 2026 09:16
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ream

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Session-scoped commands use a separate storage path (bashSessionClientMap),
never set resumeEnabled, and have no resume route. The SDK attempts resume
on disconnect but execd returns 404. Only /command path supports resume.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implement automatic SSE connection drop recovery in Go, Kotlin, C#, and
TypeScript SDKs. Each SDK now tracks the monotonic eid per event, and on
network failure retries up to 3 times via GET /command/{id}/resume.

Go: io.MultiReader + errorReader for disconnect injection in tests
Kotlin: MockWebServer Dispatcher with DISCONNECT_DURING_RESPONSE_BODY
C#: TruncatedStreamContent for partial body delivery
TypeScript: ReadableStream start/pull for mid-stream error simulation

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…lint

Line exceeded 140-char max-line-length rule. Split the long appendLine
call with embedded JSON string across multiple lines.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/execd feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants