feat(board-worker): persist executor failure diagnostics (close the investigation gap)#345
Conversation
…ation board_unblock requeues failed tasks, but execution failures were diagnostically opaque: dispatch ran the executor with capture_output=True yet discarded proc.stdout/stderr on every failure path, team_executor persists no run artifacts, and the task recorded only 'N of N stages failed'. A recurring failure could not be root-caused — the controller could only blind-requeue. persist_failure_diagnostics() writes the executor's stdout/stderr + result.json to a durable logs/local/failures/<role>-<short_id>.log and appends a [diagnostics: <path>] pointer + tail to result['failure_reason'] (flows into the task comment + fleet log). Wired into dispatch's failure branch; also captures the retry proc output (previously discarded). Best-effort, never crashes dispatch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Critical concern: Variable Otherwise: Implementation is well-designed (bounded output, best-effort error handling, thorough tests). Function logic is correct, tests are comprehensive (5 tests covering main cases and edge cases), and integration points are reasonable. Code quality is high. |
…lean (0 findings) Verified complete and proper wiring of persist-exec-diagnostics feature. Integration gate results: D12=0, DC10=0 — all acceptance criteria met. Proc variable scope concern from self-review is unfounded; all execution paths guarantee proc is in scope before persist_failure_diagnostics call at line 336. All tests passing (240+ board_worker tests), full test suite green (9,357 tests), linting clean (0 violations). Production-ready for merge to main. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…cess calls
The baseline-validation gate (.venv/bin/pytest -q on the fresh workspace clone)
failed because tests/unit/test_documentation_accuracy.py shelled out with bare
['python', '-m', 'pytest', ...]. In the workspace env that 'python' has no pytest,
so stdout was empty and 'assert integration in markers_output' failed. CI only
passed because there bare 'python' is the pytest interpreter. That single failure
exited pytest 1, failing baseline validation and blocking EVERY improve/goal task
('N of N stages failed').
Replace all 7 bare-python subprocess calls with sys.executable (the interpreter
running the test, which always has pytest). Full tree now 9377 passed / 0 failed.
Root-caused via the failure-diagnostics capture added in #345.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cess calls (#346) The baseline-validation gate (.venv/bin/pytest -q on the fresh workspace clone) failed because tests/unit/test_documentation_accuracy.py shelled out with bare ['python', '-m', 'pytest', ...]. In the workspace env that 'python' has no pytest, so stdout was empty and 'assert integration in markers_output' failed. CI only passed because there bare 'python' is the pytest interpreter. That single failure exited pytest 1, failing baseline validation and blocking EVERY improve/goal task ('N of N stages failed'). Replace all 7 bare-python subprocess calls with sys.executable (the interpreter running the test, which always has pytest). Full tree now 9377 passed / 0 failed. Root-caused via the failure-diagnostics capture added in #345. Co-authored-by: ProtocolWarden <ProtocolWarden@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…fusal) (#347) With the baseline blocker fixed (#346), tasks reached the agent stages and the #345 diagnostics surfaced the next failure: the planner stage got a prose refusal 'ContextGuard requires CL_ANCHOR to be set ... run eval $(cl session start ...)' instead of a JSON plan. OC's CLAUDE.md ContextGuard requires every Claude session targeting OC to be anchored; operations-center.sh sets CL_ANCHOR on the fleet, but build_allowlist_env (#340) stripped it (not in _ENV_PASSTHROUGH), re-breaking the #311 unblock — same regression class as the #344 PATH bug. Add CL_ANCHOR/CL_HOME/CL_SESSION_ID to the passthrough (forwarded only when present) so the executor agent stays anchored and cl_dispatch_wrap hydrate/capture is not silently disabled. Co-authored-by: ProtocolWarden <ProtocolWarden@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Answers "why isn't the controller investigating?". board_unblock now requeues failed tasks, but execution failures were diagnostically opaque: dispatch ran the executor with
capture_output=Trueyet discardedproc.stdout/stderron every failure path,team_executorpersists no run artifacts, and the task recorded only a summary (N of N stages failed). A recurring failure (e.g. #264, 4/4 stages) could not be root-caused — the controller could only blind-requeue, and the evidence was thrown away.Fix
persist_failure_diagnostics()writes the executor's stdout/stderr + result.json to a durablelogs/local/failures/<role>-<short_id>.logand appends a[diagnostics: <path>]pointer + tail toresult['failure_reason'], which flows into the task comment and fleet log. Wired into dispatch's failure branch; also captures the retry proc output (previously discarded too). Best-effort — never crashes dispatch.Verification
5 new tests; 240 board_worker tests pass; ruff/ty clean; pre-push audit 0 findings; dispatch kept at 499 lines (C29). Deployed directly to the live fleet so the next execution failure is captured.