feat(board-worker): persist executor failure diagnostics (close the investigation gap) by ProtocolWarden · Pull Request #345 · ProtocolWarden/OperationsCenter

ProtocolWarden · 2026-06-19T23:04:49Z

Summary

Answers "why isn't the controller investigating?". board_unblock now requeues failed tasks, but execution failures were diagnostically opaque: dispatch ran the executor with capture_output=True yet discarded proc.stdout/stderr on every failure path, team_executor persists no run artifacts, and the task recorded only a summary (N of N stages failed). A recurring failure (e.g. #264, 4/4 stages) could not be root-caused — the controller could only blind-requeue, and the evidence was thrown away.

Fix

persist_failure_diagnostics() writes the executor's stdout/stderr + result.json to a durable logs/local/failures/<role>-<short_id>.log and appends a [diagnostics: <path>] pointer + tail to result['failure_reason'], which flows into the task comment and fleet log. Wired into dispatch's failure branch; also captures the retry proc output (previously discarded too). Best-effort — never crashes dispatch.

Verification

5 new tests; 240 board_worker tests pass; ruff/ty clean; pre-push audit 0 findings; dispatch kept at 499 lines (C29). Deployed directly to the live fleet so the next execution failure is captured.

…ation board_unblock requeues failed tasks, but execution failures were diagnostically opaque: dispatch ran the executor with capture_output=True yet discarded proc.stdout/stderr on every failure path, team_executor persists no run artifacts, and the task recorded only 'N of N stages failed'. A recurring failure could not be root-caused — the controller could only blind-requeue. persist_failure_diagnostics() writes the executor's stdout/stderr + result.json to a durable logs/local/failures/<role>-<short_id>.log and appends a [diagnostics: <path>] pointer + tail to result['failure_reason'] (flows into the task comment + fleet log). Wired into dispatch's failure branch; also captures the retry proc output (previously discarded). Best-effort, never crashes dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ProtocolWarden · 2026-06-19T23:12:20Z

Resolved: superseded by new push — re-review resumed

~~Self-review concerns~~ — auto-fixing (up to 6 attempts; re-queued if still unresolved):

Critical concern: Variable proc scope. The diff shows proc being captured only in the retry section (line 279), but persist_failure_diagnostics(proc, ...) is called at line 335 which appears to be outside the retry block. This would cause NameError if the initial dispatch fails without triggering a retry, unless proc is also captured from the initial dispatch (not shown in diff). Verify that: (1) the initial dispatch result also captures proc, or (2) the control flow ensures persist_failure_diagnostics is only called when proc is in scope.

Otherwise: Implementation is well-designed (bounded output, best-effort error handling, thorough tests). Function logic is correct, tests are comprehensive (5 tests covering main cases and edge cases), and integration points are reasonable. Code quality is high.

…oncern resolved

…lean (0 findings) Verified complete and proper wiring of persist-exec-diagnostics feature. Integration gate results: D12=0, DC10=0 — all acceptance criteria met. Proc variable scope concern from self-review is unfounded; all execution paths guarantee proc is in scope before persist_failure_diagnostics call at line 336. All tests passing (240+ board_worker tests), full test suite green (9,357 tests), linting clean (0 violations). Production-ready for merge to main. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

…cess calls The baseline-validation gate (.venv/bin/pytest -q on the fresh workspace clone) failed because tests/unit/test_documentation_accuracy.py shelled out with bare ['python', '-m', 'pytest', ...]. In the workspace env that 'python' has no pytest, so stdout was empty and 'assert integration in markers_output' failed. CI only passed because there bare 'python' is the pytest interpreter. That single failure exited pytest 1, failing baseline validation and blocking EVERY improve/goal task ('N of N stages failed'). Replace all 7 bare-python subprocess calls with sys.executable (the interpreter running the test, which always has pytest). Full tree now 9377 passed / 0 failed. Root-caused via the failure-diagnostics capture added in #345. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…cess calls (#346) The baseline-validation gate (.venv/bin/pytest -q on the fresh workspace clone) failed because tests/unit/test_documentation_accuracy.py shelled out with bare ['python', '-m', 'pytest', ...]. In the workspace env that 'python' has no pytest, so stdout was empty and 'assert integration in markers_output' failed. CI only passed because there bare 'python' is the pytest interpreter. That single failure exited pytest 1, failing baseline validation and blocking EVERY improve/goal task ('N of N stages failed'). Replace all 7 bare-python subprocess calls with sys.executable (the interpreter running the test, which always has pytest). Full tree now 9377 passed / 0 failed. Root-caused via the failure-diagnostics capture added in #345. Co-authored-by: ProtocolWarden <ProtocolWarden@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…fusal) (#347) With the baseline blocker fixed (#346), tasks reached the agent stages and the #345 diagnostics surfaced the next failure: the planner stage got a prose refusal 'ContextGuard requires CL_ANCHOR to be set ... run eval $(cl session start ...)' instead of a JSON plan. OC's CLAUDE.md ContextGuard requires every Claude session targeting OC to be anchored; operations-center.sh sets CL_ANCHOR on the fleet, but build_allowlist_env (#340) stripped it (not in _ENV_PASSTHROUGH), re-breaking the #311 unblock — same regression class as the #344 PATH bug. Add CL_ANCHOR/CL_HOME/CL_SESSION_ID to the passthrough (forwarded only when present) so the executor agent stays anchored and cl_dispatch_wrap hydrate/capture is not silently disabled. Co-authored-by: ProtocolWarden <ProtocolWarden@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

docs(.console): Stage 1 verification complete — proc variable scope c…

12789d0

…oncern resolved

ProtocolWarden mentioned this pull request Jun 19, 2026

fix(tests): use sys.executable in doc-accuracy subprocess calls (unblocks the whole improve/goal pipeline) #346

Merged

ProtocolWarden merged commit 4ca8c06 into main Jun 19, 2026
21 checks passed

ProtocolWarden deleted the goal/persist-exec-diagnostics branch June 19, 2026 23:34

ProtocolWarden mentioned this pull request Jun 20, 2026

fix(board-worker): forward CL_ANCHOR to the executor (unblocks agent stages) #347

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(board-worker): persist executor failure diagnostics (close the investigation gap)#345

feat(board-worker): persist executor failure diagnostics (close the investigation gap)#345
ProtocolWarden merged 3 commits into
mainfrom
goal/persist-exec-diagnostics

ProtocolWarden commented Jun 19, 2026

Uh oh!

ProtocolWarden commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProtocolWarden commented Jun 19, 2026

Summary

Fix

Verification

Uh oh!

ProtocolWarden commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ProtocolWarden commented Jun 19, 2026 •

edited

Loading