Skip to content

feat: retry Iron Loop executor on API overload (529) with configurable backoff#7

Open
davidbijl wants to merge 2 commits into
robotijn:mainfrom
davidbijl:feat/overload-retry-529
Open

feat: retry Iron Loop executor on API overload (529) with configurable backoff#7
davidbijl wants to merge 2 commits into
robotijn:mainfrom
davidbijl:feat/overload-retry-529

Conversation

@davidbijl
Copy link
Copy Markdown

Fixes #6.

Summary

  • Layer 1 — executor agent (agents/iron-loop/iron-loop-executor.md): new API Overload (529) Handling section instructs the executor to distinguish pre-write overloads (safe to retry) from mid-write overloads (human gate required) and write the appropriate .status file.
  • Layer 2 — state layer (background.js, actions.js, state.js): overload-retry and overload-partial added to the status enum; cleanupStaleInProgress skips overload plans; startAgent resumes an overload-retry plan in-progress instead of picking a new todo plan and blocks with a human-gate error for overload-partial; getAgentStatus surfaces overload states when no lock is held.
  • Layer 3 — dashboard (menu-screens.js): AGENT section shows ⏳ retry in Xm — <plan> for scheduled retries and ⚠ partial write — review: <plan> for mid-write overloads.
  • Config (settings.js): new retry category with overloadIntervalSeconds (default 600 s / 10 min).
  • Tests (tests/overload-retry.test.js): 9 unit tests covering all three layers (icon enum, writeStatus fields, cleanup skip logic, startAgent resume/block paths, dashboard labels, config schema).

Answers to the open questions in #6

  1. Preferred layer for retry logic: the executor agent writes the status (Layer 1) and exits; the state layer drives resume/block on the next startAgent call (Layer 2). No ScheduleWakeup dependency — the operator restarts via the menu when ready, or the executor can call ScheduleWakeup if it's available in its context (the agent instructions mention it as optional).
  2. Step-level resume vs full restart: full restart from the beginning of the current plan. The plan's completed [x] checkboxes are on disk, so the executor can fast-forward past already-done steps. No separate step-marker mechanism is needed for a first pass.
  3. ScheduleWakeup availability: treated as optional in the agent instructions — if available, use it; if not, exit cleanly. The dashboard indicator and the menu's Start Agent button serve as the manual resume path.
  4. Scope: all three layers are included, but the changes are minimal and additive — no existing behaviour is modified except cleanupStaleInProgress (now skips overload plans) and startAgent (now checks in-progress before picking todo).

Test plan

  • Run node --test tests/overload-retry.test.js — 9 tests, 0 failures
  • Run node --test tests/*.test.js — existing suite passes (the 1 pre-existing failure in update.test.js is unrelated to this PR and was failing on main before these changes)
  • Manually: simulate an overload-retry status file in a plan under plans/in-progress/, open /ctoc:menu, confirm the dashboard AGENT section shows ⏳ retry in Xm
  • Manually: simulate an overload-partial status file, confirm ⚠ partial write — review
  • Manually: click Start Agent with an overload-retry plan present, confirm the executor resumes that plan rather than picking a new todo plan

🤖 Generated with Claude Code

davidbijl and others added 2 commits May 25, 2026 17:09
…e backoff

Implements three-layer recovery for HTTP 529 (API overloaded) errors during
Iron Loop executor runs, resolving issue robotijn#6.

Layer 1 — iron-loop-executor.md: adds explicit instructions for the executor
agent to distinguish pre-write (safe to retry) from mid-write (human review
required) overload events and write the appropriate status to the plan's
.status file.

Layer 2 — state layer:
- background.js: adds overload-retry and overload-partial to the status enum,
  preserves retry_at timestamp in writeStatus, adds markOverloadRetry() and
  markOverloadPartial() helpers.
- actions.js: cleanupStaleInProgress now skips overload plans; startAgent
  resumes an overload-retry plan in-progress instead of picking a new todo
  plan, and blocks with a human-gate error when an overload-partial plan exists.
- state.js: getAgentStatus surfaces overload-retry / overload-partial from
  in-progress plan status files when no lock is held.

Layer 3 — menu-screens.js: dashboard AGENT section shows ⏳ retry in Xm for
scheduled retries and ⚠ partial write — review for mid-write overloads.

Config — settings.js: adds retry.overloadIntervalSeconds (default 600s / 10 min).

Tests — tests/overload-retry.test.js: 9 unit tests covering all three layers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
….38)

Adds a NOTES section to the dashboard (between INBOX and AGENT) and a
menu.md rule that instructs Claude to read NOTES.md at the project root
on every /ctoc:menu invocation, surfacing queued user notes to the user.

NOTES.md is the user->Claude inbox: the ctoc-remote web client appends
to it from the browser, and Claude needs a natural moment to notice
those notes — invoking the menu is that moment. Without this wiring, a
user could submit a note via web client and Claude would only see it on
the next full project context refresh.

Distinct from .ctoc/inbox/ (lib/inbox.js), which is the agent->user
direction and is "READ-ONLY at its surface — writes come from upstream
agents". NOTES.md is plain markdown so it's visible to any directory
inspection too; the menu surfacing just makes it actively read instead
of merely noticeable.

- src/lib/notes.js: getNotesCount + readNotes helpers (null-safe, do not
  throw on missing/unreadable files).
- src/lib/menu-screens.js: dashboard NOTES section mirrors INBOX style
  (empty = "○ No notes pending", non-empty = "⊙ N notes pending").
- src/commands/menu.md: new Rule 8 explaining the read-and-surface
  behaviour and the distinction from INBOX.
- tests/notes.test.js: getNotesCount/readNotes happy + edge paths.
- tests/menu-screens.test.js: NOTES section render coverage (empty +
  populated).
- README.md + tests/readme-numbers.test.js: src/lib JS-module count
  bumped 105 → 106 and notes added to the lib listing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Retry Iron Loop executor on API overload (529) with configurable backoff

1 participant