diff --git a/.gitignore b/.gitignore index a2a9314..5bed249 100644 --- a/.gitignore +++ b/.gitignore @@ -3,6 +3,8 @@ !.agents/** !.github !.github/** +!claude/.claude-plugin +!claude/.claude-plugin/** .tts .reflection .opencode/ @@ -28,9 +30,18 @@ test/mocks/ evals/results/ evals/evals/ +# Raw CC stop datasets (contain real user session data — never commit) +evals/datasets/cc-stop-candidates-raw.jsonl +evals/datasets/cc-stop-candidates-filtered.jsonl +evals/datasets/cc-stop-classified.jsonl +# Allow committing the redacted gold set only +!evals/datasets/cc-stop-labeled-gold-redacted.jsonl + # E2E eval reports .eval/ .eval-tmp/ # Working notes plan.md +# Added by code-review-graph +.code-review-graph/ diff --git a/README.md b/README.md index 97707e9..feeb7fd 100644 --- a/README.md +++ b/README.md @@ -4,6 +4,9 @@ ## @reflection-3.ts - push opencode agent to reflect on the task, useful for continuous interrupted runs image +## @claude/ - Claude Code reflection plugin (experimental) +Port of the reflection idea to Claude Code as a `Stop` hook. Classifies the last assistant turn into one of six categories (complete, working, waiting_for_user_legitimate, tool_available_punt, summary_drift_stop, genuinely_stuck) with Claude Haiku 4.5, and re-prompts the agent when it punted to the user, drifted into "summary + next step + stop", or halted mid-thought. Honors `stop_hook_active` loop guard and caps at 3 inject cycles per session. Install with `claude --plugin-dir ./claude` for dev or via the marketplace once published. See [`claude/README.md`](claude/README.md). Baseline classifier accuracy and dataset are tracked in [`evals/datasets/README.md`](evals/datasets/README.md) and follow-up [#138](https://github.com/dzianisv/opencode-plugins/issues/138). + ## @telegram.ts - integrates with Telegram over [t.me/OpencodeMgrBot](@OpenCodeMgrBot) bot image @tts.ts - uses coqui TTS to read the opencode agent response. Useful to run a few agents on macOS and be notified when one finishes a task. diff --git a/claude/.claude-plugin/plugin.json b/claude/.claude-plugin/plugin.json new file mode 100644 index 0000000..1159bcb --- /dev/null +++ b/claude/.claude-plugin/plugin.json @@ -0,0 +1,13 @@ +{ + "name": "reflection-cc", + "version": "0.1.0", + "description": "Re-prompts Claude Code when it stops prematurely due to failure modes like summary-drift-stop or tool-available-punt", + "author": "dzianisv", + "license": "MIT", + "hooks": { + "stop": { + "command": "${CLAUDE_PLUGIN_ROOT}/bin/reflect.mjs", + "timeout": 30000 + } + } +} diff --git a/claude/README.md b/claude/README.md new file mode 100644 index 0000000..96129c3 --- /dev/null +++ b/claude/README.md @@ -0,0 +1,80 @@ +# reflection-cc + +Re-prompts Claude Code when it stops prematurely due to failure modes like summary-drift-stop or tool-available-punt. This plugin intercepts the Stop hook, analyzes the session transcript using Claude Haiku, classifies the failure reason, and decides whether to re-prompt with recovery instructions or accept the stop. + +## Install + +**Recommended (works today, CC v2.x):** add the Stop hook directly to `~/.claude/settings.json`: + +```json +{ + "hooks": { + "Stop": [ + { + "hooks": [ + { + "type": "command", + "command": "/absolute/path/to/opencode-plugins/claude/bin/reflect.mjs", + "timeout": 30 + } + ] + } + ] + } +} +``` + +The plugin manifest under `.claude-plugin/` is included for future marketplace publication, but in CC v2.1.150 `--plugin-dir` and the `enabledPlugins` config path do NOT activate `Stop` hooks for headless `-p` sessions. The settings-based install above is the authoritative path until that gap closes. + +**One-session try:** `claude --settings ''` ... or write the JSON to a file and pass `--settings ./reflect-settings.json`. + +## Failure Categories + +The classifier maps each Stop into one of these categories (only the first three trigger an inject): + +| Category | Inject? | Description | +|----------|---------|-------------| +| `summary_drift_stop` | **yes** | Agent wrote a plan with a "next step" then stopped before doing it | +| `tool_available_punt` | **yes** | Agent asked the user about something an available tool could resolve | +| `genuinely_stuck` | **yes** | Agent halted mid-thought, no question, no plan | +| `complete` | no | Task finished | +| `working` | no | Mid-action narration (rare at Stop) | +| `waiting_for_user_legitimate` | no | Agent legitimately needs user input | + +## How it works + +1. **Stop Hook**: Claude Code invokes the Stop hook when the agent terminates +2. **Transcript Analysis**: Haiku classifies the session transcript into failure categories +3. **Verdict**: Judge decides to re-prompt with recovery instructions or accept the stop +4. **Session Guards**: Loop prevention via attempt counter (max 3 cycles per session) + +## Testing + +`node claude/test/e2e-cc.mjs` runs 4 real E2E scenarios: + +- `explicit_wait_negative` — user asked "wait" → plugin must not inject. +- `complete_negative` — trivial Q&A → plugin must not inject. +- `attempt_cap_respected` — multi-file task → cap honored. +- `direct_pipe_summary_drift` — synthetic drift transcript piped to `reflect.mjs` → verifies the inject pathway end-to-end including schema-correct stdout. + +Real `claude -p` headless sessions + real Anthropic API. No stubs. Costs roughly $0.05–0.20 per scenario via Haiku 4.5 over your Max-subscription OAuth. Out of CI (auth + cost). Run before any change to the hook payload format. + +## Configuration + +Environment variables: +- `REFLECTION_CC_DEBUG=1` — Enable debug output +- `REFLECTION_CC_MODEL` — Model for classification (default: `haiku-4-5`) +- `REFLECTION_CC_MAX_ATTEMPTS=3` — Max re-prompt cycles per session + +## Disk Artifacts + +Plugin stores local transcripts and verdicts: +- `.reflection/verdict_.json` — Haiku verdict + recovery instructions +- `.reflection/_.json` — Full transcript snapshot +- `.reflection/_attempts.json` — Attempt counter for session + +**Privacy**: Transcripts stored locally only; never sent externally except to Haiku during classification. + +## Status + +Experimental. Baseline accuracy numbers pending PR evaluation. diff --git a/claude/bin/reflect.mjs b/claude/bin/reflect.mjs new file mode 100755 index 0000000..907337b --- /dev/null +++ b/claude/bin/reflect.mjs @@ -0,0 +1,544 @@ +#!/usr/bin/env node +/** + * reflect.mjs — Claude Code Stop hook entry-point + * + * Reads the CC Stop hook payload from stdin (JSON), applies safety rails, + * reads the transcript tail, builds a stop-context object, and (once task-11 + * lands) invokes the judge LLM to decide whether to inject a follow-up prompt. + * + * Parallel group A, task #10. Tasks 11 & 12 land classifier + feedback templates. + * + * Public API (re-exported for unit tests — task 14): + * readTranscriptTail(path, maxBytes?) → Entry[] + * buildStopContext(stopPayload, transcriptTail) → StopContext + * loopGuard(stopPayload) → boolean + * readAttempts(session_id, cwd) → number + * writeAttemptCounter(session_id, n, cwd) → void + * writeVerdict(session_id, verdictObj, cwd) → void + * debug(obj) → void + */ + +import fs from 'node:fs'; +import path from 'node:path'; +import { classifyStop } from '../lib/judge.mjs'; +import { buildFeedback, INJECT_CATEGORIES } from '../lib/feedback.mjs'; + +// --------------------------------------------------------------------------- +// Configuration +// --------------------------------------------------------------------------- + +const MAX_ATTEMPTS = parseInt(process.env.REFLECTION_CC_MAX_ATTEMPTS ?? '3', 10); +const DEBUG_ENABLED = process.env.REFLECTION_CC_DEBUG === '1'; +const TRANSCRIPT_MAX_BYTES = 200_000; +const TRANSCRIPT_MAX_ENTRIES = 50; + +// --------------------------------------------------------------------------- +// Fail-safe: never block a Stop on a plugin crash +// --------------------------------------------------------------------------- + +process.on('uncaughtException', (err) => { + _debugRaw({ msg: 'uncaught_exception', error: String(err), stack: err?.stack }); + process.exit(0); +}); + +process.on('unhandledRejection', (reason) => { + _debugRaw({ msg: 'unhandled_rejection', reason: String(reason) }); + process.exit(0); +}); + +// --------------------------------------------------------------------------- +// Debug logging +// --------------------------------------------------------------------------- + +/** + * Internal helper — writes without needing cwd (used in crash handlers). + * Falls back to stderr if the file write fails. + */ +function _debugRaw(obj) { + if (!DEBUG_ENABLED) return; + const line = JSON.stringify({ ts: new Date().toISOString(), ...obj }) + '\n'; + try { + // We may not have cwd at crash time; write to /tmp as best-effort. + fs.appendFileSync('/tmp/reflect-cc-crash.log', line); + } catch { + process.stderr.write(line); + } +} + +/** + * Append a timestamped JSON line to `.reflection/debug.log` under `cwd`. + * No-op unless REFLECTION_CC_DEBUG=1. + * + * @param {object} obj - arbitrary JSON-serialisable data + * @param {string} [cwd] - working directory (optional; falls back to process.cwd()) + */ +export function debug(obj, cwd) { + if (!DEBUG_ENABLED) return; + const dir = path.join(cwd ?? process.cwd(), '.reflection'); + const line = JSON.stringify({ ts: new Date().toISOString(), ...obj }) + '\n'; + try { + fs.mkdirSync(dir, { recursive: true }); + fs.appendFileSync(path.join(dir, 'debug.log'), line); + } catch (err) { + // debug must never throw + process.stderr.write(`[reflect.mjs] debug write failed: ${err}\n`); + } +} + +// --------------------------------------------------------------------------- +// cwd sanitization +// --------------------------------------------------------------------------- + +/** + * Validates the `cwd` field from a Stop hook payload before any fs writes. + * Requires an absolute path that survives normalization unchanged (no `..`). + * + * @param {string} cwd + * @returns {string} normalized absolute cwd + * @throws {Error} "reflect: invalid cwd" + */ +export function sanitizeCwd(cwd) { + if (typeof cwd !== 'string' || !path.isAbsolute(cwd)) { + throw new Error('reflect: invalid cwd'); + } + const resolved = path.resolve(cwd); + if (resolved !== cwd) { + throw new Error('reflect: invalid cwd'); + } + return resolved; +} + +// --------------------------------------------------------------------------- +// Loop guard +// --------------------------------------------------------------------------- + +/** + * Returns true if this Stop was triggered by a previous block injection + * (CC sets stop_hook_active=true on the immediate next Stop after a block). + * When true the hook MUST exit 0 — no further processing. + * + * @param {{ stop_hook_active?: boolean }} stopPayload + * @returns {boolean} + */ +export function loopGuard(stopPayload) { + return stopPayload?.stop_hook_active === true; +} + +// --------------------------------------------------------------------------- +// Attempt counter +// --------------------------------------------------------------------------- + +/** + * Returns the current inject attempt count for this session. + * File shape: { count: number, last_iso: string } + * Returns 0 if the file is absent or unreadable. + * + * @param {string} session_id + * @param {string} cwd + * @returns {number} + */ +export function readAttempts(session_id, cwd) { + const file = path.join(cwd, '.reflection', `${session_id}_attempts.json`); + let raw; + try { + raw = fs.readFileSync(file, 'utf8'); + } catch { + return 0; + } + try { + const parsed = JSON.parse(raw); + return typeof parsed.count === 'number' ? parsed.count : 0; + } catch { + debug({ msg: 'attempts_file_corrupt', file }, cwd); + return 0; + } +} + +/** + * Writes the attempt counter for a session. + * Creates `.reflection/` directory if absent. + * + * @param {string} session_id + * @param {number} n + * @param {string} cwd + */ +export function writeAttemptCounter(session_id, n, cwd) { + const dir = path.join(cwd, '.reflection'); + fs.mkdirSync(dir, { recursive: true }); + const file = path.join(dir, `${session_id}_attempts.json`); + + // Concurrency guard: if file exists, only write when newCount > existingCount. + // Ensures max-of-attempts wins across racing Stop hooks on same session. + let existingCount = 0; + try { + const raw = fs.readFileSync(file, 'utf8'); + const parsed = JSON.parse(raw); + if (typeof parsed.count === 'number') existingCount = parsed.count; + } catch { + // missing or corrupt — treat as 0 + } + if (n <= existingCount) return; + + // Atomic write: write to tmp + rename (POSIX rename is atomic). + const tmp = path.join(dir, `${session_id}_attempts.${process.pid}.tmp`); + fs.writeFileSync(tmp, JSON.stringify({ count: n, last_iso: new Date().toISOString() }), 'utf8'); + fs.renameSync(tmp, file); +} + +// --------------------------------------------------------------------------- +// Verdict file +// --------------------------------------------------------------------------- + +/** + * Writes a verdict object to `.reflection/verdict_${session_id}.json`. + * Creates the directory if absent. + * + * @param {string} session_id + * @param {object} verdictObj - arbitrary JSON-serialisable verdict + * @param {string} cwd + */ +export function writeVerdict(session_id, verdictObj, cwd) { + const dir = path.join(cwd, '.reflection'); + fs.mkdirSync(dir, { recursive: true }); + const file = path.join(dir, `verdict_${session_id}.json`); + fs.writeFileSync(file, JSON.stringify(verdictObj, null, 2), 'utf8'); +} + +// --------------------------------------------------------------------------- +// Transcript tail reader +// --------------------------------------------------------------------------- + +/** + * Reads the last `maxBytes` of a Claude Code JSONL transcript file, parses + * complete JSON lines, and returns the last TRANSCRIPT_MAX_ENTRIES entries + * that are either `type==="user"` or `type==="assistant"` (strips tool_use, + * tool_result, attachment, and all other entry types — see design.md §Input Shape). + * + * Transcript JSONL line shape (inferred from real files): + * + * type === "assistant": + * { type, parentUuid, uuid, sessionId, timestamp, message: { + * role: "assistant", + * content: Array< + * | { type: "text", text: string } + * | { type: "thinking", thinking: string } + * | { type: "tool_use", id, name, input } + * > + * }, ... } + * + * type === "user": + * { type, parentUuid, uuid, sessionId, timestamp, message: { + * role: "user", + * content: string | Array< + * | { type: "tool_result", tool_use_id, content } + * > + * }, ... } + * + * Other top-level types seen in practice: + * "attachment", "last-prompt", "permission-mode", "bridge-session", + * "file-history-snapshot", "tools_changed", "hook_success", + * "hook_additional_context", "skill_listing", "task_reminder", + * "create", "tool_reference", "direct", "text", "message" (inner) + * + * @param {string} filePath - absolute path to *.jsonl transcript + * @param {number} [maxBytes=200_000] - max bytes to read from the tail + * @returns {Array} - filtered transcript entries (user + assistant only) + */ +export function readTranscriptTail(filePath, maxBytes = TRANSCRIPT_MAX_BYTES) { + let stat; + try { + stat = fs.statSync(filePath); + } catch { + return []; + } + + const fileSize = stat.size; + const readSize = Math.min(maxBytes, fileSize); + const offset = fileSize - readSize; + + let buffer; + let fd; + try { + buffer = Buffer.alloc(readSize); + fd = fs.openSync(filePath, 'r'); + fs.readSync(fd, buffer, 0, readSize, offset); + } catch { + return []; + } finally { + if (fd !== undefined) { + try { fs.closeSync(fd); } catch { /* ignore */ } + } + } + + const rawText = buffer.toString('utf8'); + + // Split on newlines; first line may be a partial line if we truncated mid-line + const lines = rawText.split('\n'); + + // Skip the first segment — it's likely a partial line from the tail offset + const startIdx = offset === 0 ? 0 : 1; + + /** @type {Array} */ + const filtered = []; + + for (let i = startIdx; i < lines.length; i++) { + const line = lines[i].trim(); + if (!line) continue; + + let entry; + try { + entry = JSON.parse(line); + } catch { + continue; // skip malformed lines + } + + const t = entry?.type; + if (t === 'user' || t === 'assistant') { + // For user entries: skip those whose message.content is purely tool_result + // arrays (these are the CC mechanism for tool output, not conversational turns). + if (t === 'user') { + const content = entry?.message?.content; + if (Array.isArray(content)) { + const allToolResults = content.every((c) => c?.type === 'tool_result'); + if (allToolResults) continue; // exclude pure tool-result user turns + } + } + filtered.push(entry); + } + } + + // Return the last N entries + return filtered.slice(-TRANSCRIPT_MAX_ENTRIES); +} + +// --------------------------------------------------------------------------- +// Stop context builder +// --------------------------------------------------------------------------- + +/** + * Extracts a clean "stop context" object from the Stop payload + transcript tail. + * This is the shape passed to the judge LLM (task 11). + * + * @param {object} stopPayload - CC Stop hook JSON from stdin + * @param {string} stopPayload.session_id + * @param {string} stopPayload.cwd + * @param {string} stopPayload.transcript_path + * @param {string} [stopPayload.response] - last assistant text (shortcut from CC) + * @param {string} [stopPayload.hook_event_name] + * @param {Array} transcriptTail - filtered entries from readTranscriptTail() + * @returns {{ + * session_id: string, + * attempt: number, + * user_messages: string[], + * final_assistant_text: string, + * tools_available_inferred: string[], + * raw_tail: Array + * }} + */ +export function buildStopContext(stopPayload, transcriptTail) { + const session_id = stopPayload?.session_id ?? 'unknown'; + const cwd = stopPayload?.cwd ?? process.cwd(); + const attempt = readAttempts(session_id, cwd); + + // Extract human-readable user messages (text content only) + const user_messages = []; + for (const entry of transcriptTail) { + if (entry.type !== 'user') continue; + const content = entry?.message?.content; + if (typeof content === 'string' && content.trim()) { + user_messages.push(content.trim()); + } else if (Array.isArray(content)) { + // Grab only text blocks from mixed content arrays + for (const block of content) { + if (block?.type === 'text' && typeof block.text === 'string' && block.text.trim()) { + user_messages.push(block.text.trim()); + } + } + } + } + + // Derive final assistant text: prefer CC's `response` field (it IS the last turn), + // fall back to the last assistant entry's text content from the tail. + let final_assistant_text = (stopPayload?.response ?? '').trim(); + if (!final_assistant_text) { + // Walk tail in reverse, find last assistant entry with a text block + for (let i = transcriptTail.length - 1; i >= 0; i--) { + const entry = transcriptTail[i]; + if (entry.type !== 'assistant') continue; + const content = entry?.message?.content; + if (!Array.isArray(content)) break; + const textBlocks = content.filter((c) => c?.type === 'text'); + if (textBlocks.length > 0) { + final_assistant_text = textBlocks.map((b) => b.text).join('\n').trim(); + break; + } + } + } + + // Infer available tools from tool_use entries visible in the session. + // We look through ALL lines in the raw tail (including assistant messages with + // tool_use content blocks) for tool names actually used, as a proxy for + // "tools available". Task 11 may refine this further. + const toolNames = new Set(); + for (const entry of transcriptTail) { + if (entry.type !== 'assistant') continue; + const content = entry?.message?.content; + if (!Array.isArray(content)) continue; + for (const block of content) { + if (block?.type === 'tool_use' && typeof block.name === 'string') { + toolNames.add(block.name); + } + } + } + const tools_available_inferred = [...toolNames].sort(); + + return { + session_id, + attempt, + user_messages, + final_assistant_text, + tools_available_inferred, + raw_tail: transcriptTail, + }; +} + +// --------------------------------------------------------------------------- +// Stdin reader +// --------------------------------------------------------------------------- + +/** + * Reads all of stdin and returns as a string. + * @returns {Promise} + */ +async function readStdin() { + return new Promise((resolve, reject) => { + const chunks = []; + process.stdin.setEncoding('utf8'); + process.stdin.on('data', (chunk) => chunks.push(chunk)); + process.stdin.on('end', () => resolve(chunks.join(''))); + process.stdin.on('error', reject); + }); +} + +// --------------------------------------------------------------------------- +// Main flow +// --------------------------------------------------------------------------- + +async function main() { + const input = await readStdin(); + + let payload; + try { + payload = JSON.parse(input); + } catch (err) { + debug({ msg: 'stdin_parse_error', error: String(err) }, process.cwd()); + process.exit(0); + } + + // ── 1. LOOP GUARD (most important check — runs first, always) ────────────── + if (loopGuard(payload)) { + debug({ msg: 'loop_guard_triggered', session_id: payload.session_id }, payload.cwd); + process.exit(0); + } + + const { session_id, transcript_path } = payload; + // Sanitize cwd from payload before any fs writes — throws on invalid input. + // uncaughtException handler exits 0 (fail-safe: no inject, no fs ops). + const cwd = sanitizeCwd(payload?.cwd ?? process.cwd()); + + // ── 2. ATTEMPT CAP ──────────────────────────────────────────────────────── + const attempts = readAttempts(session_id, cwd); + if (attempts >= MAX_ATTEMPTS) { + debug({ msg: 'attempt_cap_reached', session_id, attempts, max: MAX_ATTEMPTS }, cwd); + process.exit(0); + } + + // ── 3. TRANSCRIPT TAIL ─────────────────────────────────────────────────── + // CC's Stop hook fires before the transcript flush of the final assistant + // turn completes on some paths. Poll briefly for an assistant entry with + // text content to appear before classifying. Bail-out conditions: + // - we already have a usable ctx (assistant text non-empty) + // - poll budget exhausted (~1s total) + let tail = transcript_path ? readTranscriptTail(transcript_path) : []; + let ctx = buildStopContext(payload, tail); + const FLUSH_POLL_MS = 100; + const FLUSH_POLL_MAX = 10; + for (let i = 0; i < FLUSH_POLL_MAX && !ctx.final_assistant_text; i++) { + await new Promise((r) => setTimeout(r, FLUSH_POLL_MS)); + tail = transcript_path ? readTranscriptTail(transcript_path) : []; + ctx = buildStopContext(payload, tail); + } + + // Fail-safe: if we STILL have no final assistant text after polling, the + // transcript probably hasn't flushed and we cannot classify reliably. + // Skip rather than risk a false-positive inject. + if (!ctx.final_assistant_text) { + debug({ msg: 'no_assistant_text_after_poll', session_id, tail_len: tail.length }, cwd); + process.exit(0); + } + + debug( + { + msg: 'stop_received', + session_id, + attempts, + user_msg_count: ctx.user_messages.length, + final_assistant_text_len: ctx.final_assistant_text.length, + tools_available_inferred: ctx.tools_available_inferred, + }, + cwd, + ); + + // ── 5. JUDGE LLM CALL ───────────────────────────────────────────────────── + let verdict; + try { + verdict = await classifyStop(ctx); + } catch (e) { + debug({ msg: 'judge_threw', err: String(e?.message ?? e) }, cwd); + verdict = { category: 'API_ERROR', reason: String(e?.message ?? e), confidence: 0 }; + } + + debug({ msg: 'verdict', category: verdict.category, confidence: verdict.confidence }, cwd); + + const nextAttempt = attempts + 1; + const verdictRecord = { + ...verdict, + session_id, + attempt: nextAttempt, + timestamp: new Date().toISOString(), + injected: false, + }; + + // ── 6. INJECT DECISION ─────────────────────────────────────────────────── + if (INJECT_CATEGORIES.has(verdict.category)) { + const fb = buildFeedback(verdict.category, ctx, nextAttempt); + if (fb.shouldInject) { + writeAttemptCounter(session_id, nextAttempt, cwd); + verdictRecord.injected = true; + verdictRecord.feedback_reason = fb.reason; + writeVerdict(session_id, verdictRecord, cwd); + + // Stop hook payload shape (per CC v2.x hookify rule_engine.py + + // empirical test 2026-05-26): { decision: "block", reason }. + // `reason` is the text CC injects as the agent's next-turn instruction + // (rendered as `Stop hook feedback:\n`). `systemMessage` is + // accepted but appears to be dropped from the conversation in v2.1.150, + // so we put the full guidance in `reason`. `hookSpecificOutput` / + // `additionalContext` is rejected as Invalid input by Stop. + const out = { + decision: 'block', + reason: fb.additionalContext || fb.reason, + }; + process.stdout.write(JSON.stringify(out)); + debug({ msg: 'inject_sent', category: verdict.category, attempt: nextAttempt, reason: fb.reason }, cwd); + process.exit(0); + } + } + + // No inject: write verdict, exit clean. + writeVerdict(session_id, verdictRecord, cwd); + debug({ msg: 'no_inject', category: verdict.category, attempt: nextAttempt }, cwd); + process.exit(0); +} + +main(); diff --git a/claude/hooks/hooks.json b/claude/hooks/hooks.json new file mode 100644 index 0000000..7e27ad3 --- /dev/null +++ b/claude/hooks/hooks.json @@ -0,0 +1,8 @@ +{ + "hooks": { + "stop": { + "command": "${CLAUDE_PLUGIN_ROOT}/bin/reflect.mjs", + "timeout": 30000 + } + } +} diff --git a/claude/lib/feedback.mjs b/claude/lib/feedback.mjs new file mode 100644 index 0000000..60c9708 --- /dev/null +++ b/claude/lib/feedback.mjs @@ -0,0 +1,224 @@ +/** + * feedback.mjs — per-category feedback templates for the reflection plugin + * + * Exports: + * buildFeedback(category, ctx, attempt) → { shouldInject, reason, additionalContext } + * INJECT_CATEGORIES — Set of categories where shouldInject may be true + */ + +// --------------------------------------------------------------------------- +// Constants +// --------------------------------------------------------------------------- + +/** Categories that are eligible for injection (before attempt cap). */ +export const INJECT_CATEGORIES = new Set([ + 'summary_drift_stop', + 'tool_available_punt', + 'genuinely_stuck', +]); + +const MAX_INJECT_ATTEMPT = 3; + +// --------------------------------------------------------------------------- +// Helpers +// --------------------------------------------------------------------------- + +/** + * Scan `text` for a sentence that follows a "next step" signal phrase. + * Returns the extracted sentence (trimmed, capped at 200 chars) or null. + * + * @param {string} text + * @returns {string | null} + */ +function extractNextStep(text) { + if (!text) return null; + + // Patterns that signal "the agent named its next step" + const patterns = [ + /next\s+step[s]?\s*[:–—]\s*([^.!?\n]+[.!?]?)/i, + /next[,]?\s+i[''](?:ll|m going to)\s+([^.!?\n]+[.!?]?)/i, + /now\s+i[''](?:ll|m going to)\s+([^.!?\n]+[.!?]?)/i, + /i(?:'ll|'m going to| will| am going to)\s+now\s+([^.!?\n]+[.!?]?)/i, + ]; + + for (const re of patterns) { + const m = text.match(re); + if (m && m[1]) { + return m[1].trim().slice(0, 200); + } + } + return null; +} + +/** + * Turn a tools array into a readable comma-separated string. + * + * @param {string[]} tools + * @returns {string} + */ +function summarizeTools(tools) { + if (!Array.isArray(tools) || tools.length === 0) return ''; + return tools.join(', '); +} + +// --------------------------------------------------------------------------- +// Per-category template builders +// --------------------------------------------------------------------------- + +/** + * @param {string} nextStep - extracted next-step sentence or null + * @param {number} attempt - 1-based + * @returns {{ reason: string, additionalContext: string }} + */ +function templateSummaryDrift(nextStep, attempt) { + const stepClause = nextStep + ? `The next step you named was: "${nextStep}".` + : 'Your last message described a next step but did not execute it.'; + + const base = `You wrote a plan and named a next step. Execute it now. ${stepClause} Do not write another summary or plan — make tool calls, write code, run commands.`; + + if (attempt === 1) { + return { + reason: 'summary_drift: plan written but not executed', + additionalContext: base, + }; + } + if (attempt === 2) { + return { + reason: 'summary_drift: second inject — still not executing', + additionalContext: `Second time triggering this. ${base} Stop narrating. Start doing.`, + }; + } + // attempt === 3 + return { + reason: 'summary_drift: final inject before session close', + additionalContext: `Third inject. ${base} If this approach is not working, try a different one. No more plans — next response must be a tool call or code. Session closes after this.`, + }; +} + +/** + * @param {string[]} tools - inferred available tools + * @param {number} attempt - 1-based + * @returns {{ reason: string, additionalContext: string }} + */ +function templateToolAvailablePunt(tools, attempt) { + const toolList = summarizeTools(tools); + + const base = `You have these tools: ${toolList}. Use them yourself instead of asking the user. If you can answer with a tool, answer. Do not ask the user before trying.`; + + if (attempt === 1) { + return { + reason: 'tool_available_punt: agent deferred to user instead of using tools', + additionalContext: base, + }; + } + if (attempt === 2) { + return { + reason: 'tool_available_punt: second inject — still deferring', + additionalContext: `We did this once. You still have those tools: ${toolList}. Use them now. Do not ask — act.`, + }; + } + // attempt === 3 + return { + reason: 'tool_available_punt: final inject before session close', + additionalContext: `Third inject. Tools available: ${toolList}. If this approach keeps failing, try a different tool or strategy. Otherwise the session will close. Make a tool call in your next response.`, + }; +} + +/** + * @param {number} attempt - 1-based + * @returns {{ reason: string, additionalContext: string }} + */ +function templateGenuinelyStuck(attempt) { + const base = + 'You stopped mid-thought. Either: (a) state explicitly what is blocking and what you have tried, OR (b) take the next concrete action — make a tool call, write code, run a command. No more silence.'; + + if (attempt === 1) { + return { + reason: 'genuinely_stuck: agent stopped without explicit blocker or action', + additionalContext: base, + }; + } + if (attempt === 2) { + return { + reason: 'genuinely_stuck: second inject — still no action or stated blocker', + additionalContext: `Second time. ${base} If you are blocked, say exactly what is blocking you and what you already tried. If you are not blocked, make a tool call.`, + }; + } + // attempt === 3 + return { + reason: 'genuinely_stuck: final inject before session close', + additionalContext: `Third inject. ${base} If this approach is not working, try a completely different strategy. Next response must be a tool call, code, or an explicit blocker statement. Session closes after this.`, + }; +} + +// --------------------------------------------------------------------------- +// Main export +// --------------------------------------------------------------------------- + +/** + * Build a feedback object for a classified Stop event. + * + * @param {string} category - one of the 9 possible classifier labels + * @param {{ user_messages: string[], final_assistant_text: string, tools_available_inferred: string[], session_id: string }} ctx + * @param {number} attempt - 1-based attempt counter for this session + * @returns {{ shouldInject: boolean, reason: string, additionalContext: string | null }} + */ +export function buildFeedback(category, ctx, attempt) { + // Defense in depth: never inject beyond MAX_INJECT_ATTEMPT even if caller forgot. + if (attempt >= MAX_INJECT_ATTEMPT + 1) { + return { shouldInject: false, reason: 'attempt_cap', additionalContext: null }; + } + + // Categories that never inject. + switch (category) { + case 'complete': + return { shouldInject: false, reason: 'task_complete', additionalContext: null }; + + case 'waiting_for_user_legitimate': + return { shouldInject: false, reason: 'legitimate_user_wait', additionalContext: null }; + + case 'working': + return { shouldInject: false, reason: 'still_working', additionalContext: null }; + + case 'TIMEOUT': + return { shouldInject: false, reason: 'timeout_failsafe', additionalContext: null }; + + case 'PARSE_ERROR': + return { shouldInject: false, reason: 'parse_error_failsafe', additionalContext: null }; + + case 'API_ERROR': + return { shouldInject: false, reason: 'api_error_failsafe', additionalContext: null }; + } + + // Categories that inject. + let tpl; + switch (category) { + case 'summary_drift_stop': { + const nextStep = extractNextStep(ctx?.final_assistant_text ?? ''); + tpl = templateSummaryDrift(nextStep, attempt); + break; + } + + case 'tool_available_punt': { + const tools = ctx?.tools_available_inferred ?? []; + tpl = templateToolAvailablePunt(tools, attempt); + break; + } + + case 'genuinely_stuck': { + tpl = templateGenuinelyStuck(attempt); + break; + } + + default: + // Unknown category — fail safe. + return { shouldInject: false, reason: `unknown_category:${category}`, additionalContext: null }; + } + + return { + shouldInject: true, + reason: tpl.reason, + additionalContext: tpl.additionalContext, + }; +} diff --git a/claude/lib/judge.mjs b/claude/lib/judge.mjs new file mode 100644 index 0000000..d3e3a34 --- /dev/null +++ b/claude/lib/judge.mjs @@ -0,0 +1,310 @@ +/** + * judge.mjs — in-hook LLM classifier for Claude Code Stop hooks. + * + * Exported surface: + * classifyStop(stopContext, opts?) → Promise + * + * stopContext shape (built by buildStopContext in reflect.mjs): + * { session_id, attempt, user_messages, final_assistant_text, + * tools_available_inferred, raw_tail } + * + * Classification shape: + * { category, reason, confidence, raw_text?, usage? } + * + * Auth: reads OAuth token from ~/.claude/.credentials.json — no API key needed. + * Net: POST https://api.anthropic.com/v1/messages via global fetch (Node 18+). + * Deps: none (stdlib only). + */ + +import { readFileSync } from 'node:fs'; +import { homedir } from 'node:os'; +import { join } from 'node:path'; + +// --------------------------------------------------------------------------- +// Constants +// --------------------------------------------------------------------------- + +const API_URL = 'https://api.anthropic.com/v1/messages'; +const ANTHROPIC_VERSION = '2023-06-01'; +const ANTHROPIC_BETA = 'oauth-2025-04-20'; +const DEFAULT_MODEL = process.env.REFLECTION_CC_MODEL ?? 'claude-haiku-4-5'; +const DEFAULT_TIMEOUT_MS = 15_000; +const MAX_TOKENS = 250; + +const CATEGORIES = [ + 'complete', + 'waiting_for_user_legitimate', + 'tool_available_punt', + 'summary_drift_stop', + 'genuinely_stuck', + 'working', +]; + +// --------------------------------------------------------------------------- +// Error sanitization +// --------------------------------------------------------------------------- + +/** + * Strips credentials from response bodies / error text before it lands in + * Error.message or debug logs. Truncates to 200 chars. + * + * @param {string} text + * @returns {string} + */ +function sanitizeError(text) { + if (typeof text !== 'string') text = String(text ?? ''); + let s = text; + s = s.replace(/Bearer\s+[^\s"',}]+/gi, 'Bearer '); + s = s.replace(/"authorization"\s*:\s*"[^"]*"/gi, '"authorization":""'); + s = s.replace(/"x-api-key"\s*:\s*"[^"]*"/gi, '"x-api-key":""'); + if (s.length > 200) s = s.slice(0, 200); + return s; +} + +// --------------------------------------------------------------------------- +// Auth +// --------------------------------------------------------------------------- + +/** + * Loads the OAuth access token from ~/.claude/.credentials.json. + * Throws a sentinel error (prefixed "judge:") if the file is missing, + * unreadable, or the token is absent/empty — caller treats this as no-inject. + * + * @returns {string} access token + */ +function loadOAuthToken() { + const credPath = join(homedir(), '.claude', '.credentials.json'); + let raw; + try { + raw = readFileSync(credPath, 'utf8'); + } catch (err) { + throw new Error(`judge: cannot read credentials file: ${err.message}`); + } + + let obj; + try { + obj = JSON.parse(raw); + } catch (err) { + throw new Error(`judge: credentials file is not valid JSON: ${err.message}`); + } + + const token = obj?.claudeAiOauth?.accessToken; + if (!token) { + throw new Error('judge: no claudeAiOauth.accessToken in ~/.claude/.credentials.json'); + } + return token; +} + +// --------------------------------------------------------------------------- +// Prompt +// --------------------------------------------------------------------------- + +/** + * Truncates a string to n characters, appending a truncation note if cut. + * Mirrors the helper in classify-cc-stops.mjs verbatim. + * + * @param {string} s + * @param {number} n + * @returns {string} + */ +function truncate(s, n) { + if (!s) return ''; + if (s.length <= n) return s; + return s.slice(0, n) + `…[truncated ${s.length - n}ch]`; +} + +/** + * Builds the classifier prompt from a stopContext object. + * Prompt content is identical to classify-cc-stops.mjs's buildPrompt(). + * + * @param {object} ctx - stopContext from buildStopContext() + * @returns {string} + */ +function buildPrompt(ctx) { + const userMsgs = (ctx.user_messages ?? []) + .map((m, i) => `[USER ${i + 1}] ${truncate(m, 1200)}`) + .join('\n\n'); + const finalText = truncate(ctx.final_assistant_text ?? '', 2400); + const tools = (ctx.tools_available_inferred ?? []).join(', '); + + return `You classify how a Claude Code assistant ended a turn. Pick ONE category. + +CATEGORIES: +- complete: task is done; assistant delivered the answer or finished the requested work. +- waiting_for_user_legitimate: assistant asks a question that ONLY the user can answer (preference, missing info no tool can fetch). +- tool_available_punt: assistant punts to the user about something the available tools could resolve. The assistant has access to tools like Bash, WebFetch, browser MCP, etc., yet asks the user instead of trying. +- summary_drift_stop: assistant wrote a summary or plan with a "next step" and STOPPED before doing the next step. e.g., "I've created the file. Next step: run the tests." (without running them.) +- genuinely_stuck: assistant stopped mid-thought or without clear conclusion; no question, no summary, just halted. Often short. +- working: rarely a stop; only assign if the final turn is clearly mid-action (e.g., "Running tests now...") with no closure. + +TOOLS THE ASSISTANT HAD: ${tools || '(none recorded)'} + +USER MESSAGES (in order): +${userMsgs || '(none)'} + +FINAL ASSISTANT TEXT: +${finalText} + +Respond ONLY with a JSON object on a single line, no markdown fence, no prose: +{"category": "", "reason": "", "confidence": <0.0-1.0>}`; +} + +// --------------------------------------------------------------------------- +// Response parsing +// --------------------------------------------------------------------------- + +/** + * Strips code fences, finds the first {...} block, and JSON.parses it. + * Validates that category is one of the 6 known values. + * + * @param {string} text - raw text from the model + * @param {object} [usage] - token usage from the API response + * @returns {{ category: string, reason: string, confidence: number, raw_text: string, usage?: object }} + */ +function parseResponse(text, usage) { + let s = text.trim(); + + // Strip code fences if the model added them despite instructions + if (s.startsWith('```')) { + s = s.replace(/^```(?:json)?\s*/i, '').replace(/```\s*$/, '').trim(); + } + + const match = s.match(/\{[\s\S]*\}/); + if (!match) { + return { + category: 'PARSE_ERROR', + reason: `no json found: ${s.slice(0, 100)}`, + confidence: 0, + raw_text: text, + usage, + }; + } + + let obj; + try { + obj = JSON.parse(match[0]); + } catch (err) { + return { + category: 'PARSE_ERROR', + reason: err.message, + confidence: 0, + raw_text: text, + usage, + }; + } + + if (!CATEGORIES.includes(obj.category)) { + return { + category: 'PARSE_ERROR', + reason: `unknown category: ${obj.category}`, + confidence: 0, + raw_text: text, + usage, + }; + } + + return { + category: obj.category, + reason: obj.reason ?? '', + confidence: typeof obj.confidence === 'number' ? obj.confidence : 0, + raw_text: text, + usage, + }; +} + +// --------------------------------------------------------------------------- +// Public API +// --------------------------------------------------------------------------- + +/** + * Classifies a Claude Code Stop event using a judge LLM call. + * + * @param {object} stopContext - built by buildStopContext() in reflect.mjs: + * { session_id, attempt, user_messages, final_assistant_text, + * tools_available_inferred, raw_tail } + * @param {object} [opts] + * @param {string} [opts.model] - override model (default: REFLECTION_CC_MODEL or claude-haiku-4-5) + * @param {number} [opts.timeoutMs] - override timeout in ms (default: 15000) + * @param {AbortSignal} [opts.signal] - external cancellation signal + * @returns {Promise<{ category: string, reason: string, confidence: number, raw_text?: string, usage?: object }>} + */ +export async function classifyStop(stopContext, opts = {}) { + const model = opts.model ?? DEFAULT_MODEL; + const timeoutMs = opts.timeoutMs ?? DEFAULT_TIMEOUT_MS; + + // Load token — throws "judge: ..." on failure (caller treats as no-inject) + let token; + try { + token = loadOAuthToken(); + } catch (err) { + throw err; // already prefixed with "judge:" + } + + const prompt = buildPrompt(stopContext); + + const body = JSON.stringify({ + model, + max_tokens: MAX_TOKENS, + system: 'You are a precise classifier. Output JSON only.', + messages: [{ role: 'user', content: prompt }], + }); + + // Compose abort signal: hard timeout + optional caller signal + const timeoutController = new AbortController(); + const timerId = setTimeout(() => timeoutController.abort(), timeoutMs); + + // Merge caller signal if provided + let signal = timeoutController.signal; + if (opts.signal) { + // If either aborts, abort both + opts.signal.addEventListener('abort', () => timeoutController.abort(), { once: true }); + // We still use timeoutController.signal — it fires on timeout OR on opts.signal abort + } + + let res; + try { + res = await fetch(API_URL, { + method: 'POST', + headers: { + 'anthropic-version': ANTHROPIC_VERSION, + 'anthropic-beta': ANTHROPIC_BETA, + 'authorization': `Bearer ${token}`, + 'content-type': 'application/json', + }, + body, + signal, + }); + } catch (err) { + clearTimeout(timerId); + if (timeoutController.signal.aborted) { + return { + category: 'TIMEOUT', + reason: `judge call exceeded ${timeoutMs}ms`, + confidence: 0, + }; + } + throw new Error(`judge: fetch failed: ${sanitizeError(err.message)}`); + } finally { + clearTimeout(timerId); + } + + if (!res.ok) { + let body; + try { body = await res.text(); } catch { body = ''; } + throw new Error(`judge: api ${res.status}: ${sanitizeError(body)}`); + } + + let json; + try { + json = await res.json(); + } catch (err) { + throw new Error(`judge: failed to parse api response: ${sanitizeError(err.message)}`); + } + + const rawText = json.content?.[0]?.text ?? ''; + const usage = json.usage + ? { input_tokens: json.usage.input_tokens, output_tokens: json.usage.output_tokens } + : undefined; + + return parseResponse(rawText, usage); +} diff --git a/claude/package.json b/claude/package.json new file mode 100644 index 0000000..0ce761c --- /dev/null +++ b/claude/package.json @@ -0,0 +1,18 @@ +{ + "name": "@opencode-plugins/reflection-cc", + "version": "0.1.0", + "description": "Re-prompts Claude Code when it stops prematurely due to failure modes like summary-drift-stop or tool-available-punt", + "type": "module", + "main": "bin/reflect.mjs", + "bin": { + "reflect": "bin/reflect.mjs" + }, + "engines": { + "node": ">=18" + }, + "scripts": { + "test:e2e": "node test/e2e-cc.mjs" + }, + "author": "dzianisv", + "license": "MIT" +} diff --git a/claude/test/e2e-cc.mjs b/claude/test/e2e-cc.mjs new file mode 100644 index 0000000..538b2d2 --- /dev/null +++ b/claude/test/e2e-cc.mjs @@ -0,0 +1,523 @@ +#!/usr/bin/env node +/** + * e2e-cc.mjs — End-to-end test runner for the Claude Code reflection plugin. + * + * Spawns a real `claude -p` headless session per scenario, loads this plugin + * via `--plugin-dir`, uses Haiku 4.5 as the subject model (cheap), captures + * the resulting transcript, and runs a judge LLM (Haiku via OAuth direct) + * to verify the expected inject behavior happened (or did not happen). + * + * NO MOCKS. NO STUBS. Real CC, real plugin install, real judge. + * + * Usage: + * node claude/test/e2e-cc.mjs # run all scenarios + * node claude/test/e2e-cc.mjs --scenario 1 # run one + * node claude/test/e2e-cc.mjs --keep # keep sandbox dirs + * node claude/test/e2e-cc.mjs --evidence-dir D # override evidence dir + * + * Cost: ~$0.15-0.30 per scenario (Haiku subject + Haiku judge). User's Max + * subscription via OAuth Bearer in ~/.claude/.credentials.json. + */ + +import { spawnSync } from "node:child_process"; +import { readFileSync, writeFileSync, mkdirSync, existsSync, readdirSync, statSync, rmSync } from "node:fs"; +import { homedir, tmpdir } from "node:os"; +import { join, dirname, resolve as resolvePath } from "node:path"; +import { fileURLToPath } from "node:url"; + +const __dirname = dirname(fileURLToPath(import.meta.url)); +const PLUGIN_DIR = resolvePath(__dirname, ".."); +const REPO_ROOT = resolvePath(__dirname, "..", ".."); + +const args = parseArgs(process.argv.slice(2)); +const ONLY = args.scenario != null ? parseInt(args.scenario, 10) : null; +const KEEP = !!args.keep; +const EVIDENCE_DIR = args["evidence-dir"] ?? join(REPO_ROOT, ".tasks", "131-cc-reflection", "smoke-evidence"); + +// -------------------------------------------------------------------------- +// OAuth token for judge calls (NOT for the subject session — claude -p uses +// its own auth path via the credentials file). +// -------------------------------------------------------------------------- + +function loadOAuthToken() { + const path = join(homedir(), ".claude", ".credentials.json"); + try { + const obj = JSON.parse(readFileSync(path, "utf8")); + return obj.claudeAiOauth?.accessToken; + } catch (e) { + throw new Error(`E2E: cannot load OAuth token from ${path}: ${e.message}`); + } +} + +const TOKEN = loadOAuthToken(); + +// -------------------------------------------------------------------------- +// Scenarios +// -------------------------------------------------------------------------- + +const SCENARIOS = [ + { + id: 1, + name: "explicit_wait_negative", + // User EXPLICITLY tells the agent to wait. Plugin must classify as + // waiting_for_user_legitimate and NOT inject. This is a negative test + // for false positives — the plugin must not "push past" a legitimate + // user-requested wait. + prompt: + "Three small tasks: " + + "1) write config.json {\"name\":\"demo\"}, " + + "2) write src/index.js that prints config.name, " + + "3) write README.md. " + + "After step 1, write a one-line summary of what's next, then STOP without doing step 2. " + + "Wait until I say 'go' before continuing.", + expectsInject: false, + expectedCategory: "waiting_for_user_legitimate", + judgeQuestion: + "The user explicitly said 'wait until I say go'. The plugin MUST NOT inject — that would override a legitimate user-requested wait. " + + "If INJECT COUNT = 0, PASS. " + + "If INJECT COUNT > 0, FAIL — false positive on a legitimate user-requested wait.", + maxBudget: 0.3, + }, + { + id: 2, + name: "complete_negative", + prompt: "What is 2 + 2? Reply with just the number.", + expectsInject: false, + expectedCategory: "complete", + judgeQuestion: + "Did the plugin AVOID injecting any additionalContext? " + + "The expected behavior is the agent answers '4' and no inject occurs. " + + "An inject here would be a false positive — the worst failure mode for the plugin.", + maxBudget: 0.2, + }, + { + id: 3, + name: "attempt_cap_respected", + prompt: + "Create a file called demo.txt with the single word 'hello'. " + + "Then create another file called demo2.txt with 'world'. " + + "Then create demo3.txt with 'done'. " + + "After each file, just briefly say what you did and what's next, " + + "without doing the next step yet — wait for me to confirm.", + expectsInject: true, + expectedCategory: "summary_drift_stop", + judgeQuestion: + "Did the plugin inject AT MOST 3 times in this transcript (counting additionalContext occurrences)? " + + "The attempt cap is 3 — more than 3 means the cap is broken. " + + "Zero injects means the classifier missed the pattern but the cap test is vacuously satisfied — note in your reasoning.", + maxBudget: 0.8, + }, +]; + +// -------------------------------------------------------------------------- +// Scenario 4 is a direct-pipe scenario: we feed reflect.mjs a synthetic +// transcript that the classifier WILL judge as summary_drift_stop, then +// verify the full inject pathway end-to-end (real classifier call, real +// stdout JSON, real CC payload shape). No claude -p invocation. +// -------------------------------------------------------------------------- + +function runDirectPipeScenario() { + const id = 4; + const name = "direct_pipe_summary_drift"; + const sandbox = join(tmpdir(), "cc-reflect-e2e", `s${id}-${Date.now()}`); + mkdirSync(sandbox, { recursive: true, mode: 0o700 }); + const evidenceDir = join(EVIDENCE_DIR, `scenario-${id}-${name}`); + mkdirSync(evidenceDir, { recursive: true }); + + process.stderr.write(`\n[s${id}] ${name}\n`); + process.stderr.write(` sandbox : ${sandbox}\n`); + process.stderr.write(` evidence : ${evidenceDir}\n`); + + // Synthetic transcript JSONL with a textbook summary-drift final turn. + const fakeSessionId = "test-" + Date.now(); + const tFile = join(sandbox, `transcript-${fakeSessionId}.jsonl`); + const entries = [ + { type: "user", uuid: "u1", sessionId: fakeSessionId, message: { role: "user", content: "Write a Python factorial function and unit-test it. Run the tests." } }, + { type: "assistant", uuid: "a1", sessionId: fakeSessionId, message: { role: "assistant", content: [{ type: "text", text: "I've created factorial.py and test_factorial.py. Next step: run `python -m pytest test_factorial.py -v` to verify the tests pass." }] } }, + ]; + writeFileSync(tFile, entries.map(e => JSON.stringify(e)).join("\n") + "\n"); + + const payload = { + session_id: fakeSessionId, + transcript_path: tFile, + cwd: sandbox, + hook_event_name: "Stop", + response: "I've created factorial.py and test_factorial.py. Next step: run `python -m pytest test_factorial.py -v` to verify the tests pass.", + stop_hook_active: false, + }; + + const startTime = Date.now(); + const result = spawnSync("node", [join(PLUGIN_DIR, "bin", "reflect.mjs")], { + input: JSON.stringify(payload), + cwd: sandbox, + timeout: 30_000, + encoding: "utf8", + env: { ...process.env, REFLECTION_CC_DEBUG: "1" }, + }); + const elapsed = ((Date.now() - startTime) / 1000).toFixed(1); + + writeFileSync(join(evidenceDir, "stdin.json"), JSON.stringify(payload, null, 2)); + writeFileSync(join(evidenceDir, "stdout.txt"), result.stdout ?? ""); + writeFileSync(join(evidenceDir, "stderr.txt"), result.stderr ?? ""); + + // Copy reflection state + const reflDir = join(sandbox, ".reflection"); + if (existsSync(reflDir)) { + for (const f of readdirSync(reflDir)) { + try { writeFileSync(join(evidenceDir, f), readFileSync(join(reflDir, f))); } catch {} + } + } + + let stdout = {}; + try { stdout = JSON.parse(result.stdout ?? "{}"); } catch {} + + // Verify the emitted JSON matches CC's Stop hook schema exactly. + const hasDecision = stdout.decision === "block"; + const hasReason = typeof stdout.reason === "string" && stdout.reason.length > 0; + const noHookSpecificOutput = !("hookSpecificOutput" in stdout); + + let verdict = "FAIL"; + let reason; + if (result.status !== 0) { + reason = `reflect.mjs exited non-zero: ${result.status}`; + } else if (!hasDecision) { + reason = "stdout missing decision:'block'"; + } else if (!hasReason) { + reason = "stdout missing non-empty reason"; + } else if (!noHookSpecificOutput) { + reason = "stdout contains hookSpecificOutput (would be rejected by CC Stop hook)"; + } else { + verdict = "PASS"; + reason = `valid Stop hook block emitted: reason=${stdout.reason.slice(0, 100)}`; + } + + process.stderr.write(` exit=${result.status} elapsed=${elapsed}s\n`); + process.stderr.write(` verdict : ${verdict} — ${reason}\n`); + + writeFileSync(join(evidenceDir, "verdict.json"), JSON.stringify({ + scenario: name, + verdict, + reason, + expected_schema: { decision: "block", reason: "", "no-hookSpecificOutput": true }, + actual_stdout: stdout, + exit_code: result.status, + elapsed_s: elapsed, + }, null, 2)); + + if (!KEEP) { + try { rmSync(sandbox, { recursive: true, force: true }); } catch {} + } + + return { + scenario: name, + expectsInject: true, + injects: hasDecision && hasReason ? 1 : 0, + verdict, + reason, + elapsed_s: elapsed, + }; +} + +// -------------------------------------------------------------------------- +// Run a single scenario +// -------------------------------------------------------------------------- + +function transcriptPathForCwd(cwd) { + // CC encodes cwd as the project slug — replace / with - + const slug = "-" + cwd.replace(/^\//, "").replace(/\//g, "-"); + const dir = join(homedir(), ".claude", "projects", slug); + if (!existsSync(dir)) return null; + // Pick the most-recently-modified JSONL + const files = readdirSync(dir) + .filter(f => f.endsWith(".jsonl")) + .map(f => ({ f, m: statSync(join(dir, f)).mtimeMs })) + .sort((a, b) => b.m - a.m); + return files.length ? join(dir, files[0].f) : null; +} + +function countInjects(transcriptPath) { + if (!transcriptPath || !existsSync(transcriptPath)) return 0; + const lines = readFileSync(transcriptPath, "utf8").split("\n").filter(Boolean); + let count = 0; + for (const line of lines) { + let obj; + try { obj = JSON.parse(line); } catch { continue; } + // Stop hook block: CC writes a hook_blocking_error attachment AND an + // isMeta user message "Stop hook feedback: ". + if (obj.type === "attachment" && obj.attachment?.type === "hook_blocking_error" && obj.attachment?.hookName === "Stop") count++; + } + return count; +} + +function extractInjectMessages(transcriptPath) { + if (!transcriptPath || !existsSync(transcriptPath)) return []; + const lines = readFileSync(transcriptPath, "utf8").split("\n").filter(Boolean); + const out = []; + for (const line of lines) { + let obj; + try { obj = JSON.parse(line); } catch { continue; } + if (obj.type === "attachment" && obj.attachment?.type === "hook_blocking_error" && obj.attachment?.hookName === "Stop") { + const be = obj.attachment.blockingError; + out.push(typeof be === "string" ? be : (be?.blockingError ?? JSON.stringify(be))); + } + } + return out; +} + +async function runJudge({ scenario, transcript, transcriptPath }) { + const prompt = `You evaluate whether a Claude Code reflection plugin worked correctly in an end-to-end test. + +SCENARIO: ${scenario.name} +TASK GIVEN TO THE AGENT: ${scenario.prompt} +EXPECTED BEHAVIOR: ${scenario.expectsInject ? "Plugin SHOULD inject" : "Plugin should NOT inject"} +EXPECTED CATEGORY (if inject expected): ${scenario.expectedCategory} + +JUDGE QUESTION: ${scenario.judgeQuestion} + +TRANSCRIPT EXCERPT (last entries): +${transcript.slice(-12000)} + +INJECT COUNT (entries with type=hook_additional_context): ${countInjects(transcriptPath)} +INJECT MESSAGES: ${JSON.stringify(extractInjectMessages(transcriptPath).slice(0, 3))} + +Respond ONLY with a JSON object on a single line, no markdown fence: +{"verdict": "PASS" | "FAIL" | "INCONCLUSIVE", "reason": "", "evidence": ""}`; + + const res = await fetch("https://api.anthropic.com/v1/messages", { + method: "POST", + headers: { + "anthropic-version": "2023-06-01", + "anthropic-beta": "oauth-2025-04-20", + "authorization": `Bearer ${TOKEN}`, + "content-type": "application/json", + }, + body: JSON.stringify({ + model: "claude-haiku-4-5", + max_tokens: 400, + system: "You are a precise test judge. Output JSON only.", + messages: [{ role: "user", content: prompt }], + }), + }); + if (!res.ok) { + const t = await res.text(); + throw new Error(`judge api ${res.status}: ${t.slice(0, 200)}`); + } + const json = await res.json(); + const text = json.content?.[0]?.text ?? ""; + const match = text.match(/\{[\s\S]*\}/); + if (!match) return { verdict: "INCONCLUSIVE", reason: `judge returned non-json: ${text.slice(0, 100)}`, evidence: "" }; + try { + return JSON.parse(match[0]); + } catch (e) { + return { verdict: "INCONCLUSIVE", reason: `judge parse error: ${e.message}`, evidence: "" }; + } +} + +function runScenario(scenario) { + const sandboxParent = join(tmpdir(), "cc-reflect-e2e"); + mkdirSync(sandboxParent, { recursive: true }); + const sandbox = join(sandboxParent, `s${scenario.id}-${Date.now()}`); + mkdirSync(sandbox, { recursive: true, mode: 0o700 }); + + const evidenceDir = join(EVIDENCE_DIR, `scenario-${scenario.id}-${scenario.name}`); + mkdirSync(evidenceDir, { recursive: true }); + + process.stderr.write(`\n[s${scenario.id}] ${scenario.name}\n`); + process.stderr.write(` sandbox : ${sandbox}\n`); + process.stderr.write(` evidence : ${evidenceDir}\n`); + process.stderr.write(` prompt : ${scenario.prompt.slice(0, 100)}...\n`); + + // Install the hook via --settings (the --plugin-dir path doesn't enable + // Stop hooks in headless -p mode — verified 2026-05-26). Inline settings + // points at this plugin's bin/reflect.mjs absolute path. + const settings = { + hooks: { + Stop: [ + { + hooks: [ + { type: "command", command: join(PLUGIN_DIR, "bin", "reflect.mjs"), timeout: 30 }, + ], + }, + ], + }, + }; + const settingsPath = join(sandbox, ".reflect-settings.json"); + writeFileSync(settingsPath, JSON.stringify(settings)); + + const startTime = Date.now(); + const claudeArgs = [ + "-p", + "--settings", settingsPath, + "--model", "haiku", + "--output-format", "json", + "--max-budget-usd", String(scenario.maxBudget), + "--dangerously-skip-permissions", + scenario.prompt, + ]; + + const result = spawnSync("claude", claudeArgs, { + cwd: sandbox, + timeout: 300_000, + encoding: "utf8", + env: { ...process.env, REFLECTION_CC_DEBUG: "1" }, + }); + + const elapsed = ((Date.now() - startTime) / 1000).toFixed(1); + process.stderr.write(` exit=${result.status} elapsed=${elapsed}s\n`); + + // Save the headless result + writeFileSync(join(evidenceDir, "claude-stdout.json"), result.stdout ?? ""); + writeFileSync(join(evidenceDir, "claude-stderr.txt"), result.stderr ?? ""); + + // Locate the transcript file + const transcriptPath = transcriptPathForCwd(sandbox); + if (transcriptPath && existsSync(transcriptPath)) { + const tr = readFileSync(transcriptPath, "utf8"); + writeFileSync(join(evidenceDir, "transcript.jsonl"), tr); + } else { + process.stderr.write(` WARN: no transcript file found for cwd=${sandbox}\n`); + } + + // Copy reflection state + const reflDir = join(sandbox, ".reflection"); + if (existsSync(reflDir)) { + const entries = readdirSync(reflDir); + for (const e of entries) { + try { + const src = join(reflDir, e); + const buf = readFileSync(src); + writeFileSync(join(evidenceDir, e), buf); + } catch {} + } + } + + return { scenario, result, sandbox, transcriptPath, evidenceDir, elapsed }; +} + +async function main() { + const allScenarios = [...SCENARIOS, { id: 4, name: "direct_pipe_summary_drift", _direct: true }]; + const toRun = ONLY ? allScenarios.filter(s => s.id === ONLY) : allScenarios; + if (toRun.length === 0) { + process.stderr.write(`No scenario with id ${ONLY}\n`); + process.exit(2); + } + + process.stderr.write(`E2E runner — ${toRun.length} scenario(s), plugin at ${PLUGIN_DIR}\n`); + + const summary = []; + for (const scenario of toRun) { + if (scenario._direct) { + summary.push(runDirectPipeScenario()); + continue; + } + const run = runScenario(scenario); + const transcript = run.transcriptPath && existsSync(run.transcriptPath) + ? readFileSync(run.transcriptPath, "utf8") + : ""; + + // First check: did claude -p succeed AT ALL? If the subject session + // errored out (model alias wrong, auth fail, budget hit before any turn, + // etc.) we must NOT let the judge declare PASS on an empty transcript. + let claudeOutcome = {}; + try { claudeOutcome = JSON.parse(run.result.stdout ?? "{}"); } catch {} + const subjectErrored = run.result.status !== 0 || claudeOutcome.is_error === true || claudeOutcome.subtype?.startsWith("error_"); + if (subjectErrored) { + const reason = `subject session errored: status=${run.result.status} subtype=${claudeOutcome.subtype} result=${(claudeOutcome.result || "").slice(0, 200)}`; + process.stderr.write(` ERROR: ${reason}\n`); + const verdict = { verdict: "INCONCLUSIVE", reason, evidence: "" }; + writeFileSync(join(run.evidenceDir, "verdict.json"), JSON.stringify({ + scenario: scenario.name, + expectsInject: scenario.expectsInject, + injects: countInjects(run.transcriptPath), + elapsed_s: run.elapsed, + judge: verdict, + subject_outcome: claudeOutcome, + }, null, 2)); + summary.push({ + scenario: scenario.name, + expectsInject: scenario.expectsInject, + injects: countInjects(run.transcriptPath), + verdict: verdict.verdict, + reason: verdict.reason, + elapsed_s: run.elapsed, + }); + if (!KEEP) { + try { rmSync(run.sandbox, { recursive: true, force: true }); } catch {} + } + continue; + } + + let verdict = { verdict: "INCONCLUSIVE", reason: "no transcript", evidence: "" }; + if (transcript) { + try { + verdict = await runJudge({ scenario, transcript, transcriptPath: run.transcriptPath }); + } catch (e) { + verdict = { verdict: "INCONCLUSIVE", reason: `judge error: ${e.message}`, evidence: "" }; + } + } + + const injects = countInjects(run.transcriptPath); + process.stderr.write(` judge : ${verdict.verdict} — ${verdict.reason}\n`); + process.stderr.write(` injects : ${injects}\n`); + + writeFileSync(join(run.evidenceDir, "verdict.json"), JSON.stringify({ + scenario: scenario.name, + expectsInject: scenario.expectsInject, + injects, + elapsed_s: run.elapsed, + judge: verdict, + }, null, 2)); + + summary.push({ + scenario: scenario.name, + expectsInject: scenario.expectsInject, + injects, + verdict: verdict.verdict, + reason: verdict.reason, + elapsed_s: run.elapsed, + }); + + if (!KEEP) { + try { rmSync(run.sandbox, { recursive: true, force: true }); } catch {} + } + } + + // Write summary + const summaryPath = join(EVIDENCE_DIR, "SUMMARY.md"); + let md = `# E2E run summary\n\nDate: ${new Date().toISOString()}\nPlugin: ${PLUGIN_DIR}\nSubject model: claude-haiku-4-5 via \`claude -p\`\nJudge model: claude-haiku-4-5 via OAuth direct\n\n| # | Scenario | Expects inject | Injects | Verdict | Elapsed | Reason |\n|---|----------|---------------|---------|---------|---------|--------|\n`; + for (let i = 0; i < summary.length; i++) { + const s = summary[i]; + md += `| ${i + 1} | ${s.scenario} | ${s.expectsInject ? "yes" : "no"} | ${s.injects} | **${s.verdict}** | ${s.elapsed_s}s | ${s.reason} |\n`; + } + writeFileSync(summaryPath, md); + + process.stderr.write(`\nSummary written: ${summaryPath}\n`); + + const failed = summary.filter(s => s.verdict === "FAIL").length; + const inconclusive = summary.filter(s => s.verdict === "INCONCLUSIVE").length; + process.stderr.write(`PASS: ${summary.filter(s => s.verdict === "PASS").length} FAIL: ${failed} INCONCLUSIVE: ${inconclusive}\n`); + + process.exit(failed > 0 ? 1 : 0); +} + +function parseArgs(argv) { + const out = {}; + for (let i = 0; i < argv.length; i++) { + if (argv[i].startsWith("--")) { + const key = argv[i].slice(2); + const next = argv[i + 1]; + if (next && !next.startsWith("--")) { + out[key] = next; + i++; + } else { + out[key] = true; + } + } + } + return out; +} + +main().catch(e => { + process.stderr.write(`FATAL: ${e.stack}\n`); + process.exit(2); +}); diff --git a/evals/datasets/README.md b/evals/datasets/README.md new file mode 100644 index 0000000..7a6532e --- /dev/null +++ b/evals/datasets/README.md @@ -0,0 +1,67 @@ +# CC Stop Classification Datasets + +This directory holds datasets used by the Claude Code reflection plugin (`claude/`) and the eval suite (`evals/stuck-detection.yaml`). + +## Files + +| File | Source | Tracked? | Description | +|------|--------|----------|-------------| +| `cc-stop-candidates-raw.jsonl` | `evals/scripts/mine-cc-stops.mjs` | **gitignored** | Every Stop boundary mined from `~/.claude/projects/**/*.jsonl`. Contains real user session content. | +| `cc-stop-candidates-filtered.jsonl` | `evals/scripts/filter-cc-stops.mjs` | **gitignored** | Heuristic-filtered subset of the raw set. Tagged with `hint:summary_drift / hint:punt / hint:stuck / hint:question`. | +| `cc-stop-classified.jsonl` | `evals/scripts/classify-cc-stops.mjs` | **gitignored** | LLM-classified (Claude Haiku 4.5 via Anthropic OAuth) into 6 categories. | +| `cc-stop-labeled-gold-redacted.jsonl` | `evals/scripts/audit-cc-classifications.mjs` + supervisor audit | **tracked** | Stratified sample (~30 records), PII/secrets redacted, supervisor-audited `gold_label` per record. Used as few-shot source for the eval prompt and as the gold set for `evals/stuck-detection.yaml`. | + +## Categories (v1) + +| Category | Inject? | Description | +|----------|---------|-------------| +| `complete` | no | Task done. | +| `working` | no | (rare at Stop) agent narrated mid-action. | +| `waiting_for_user_legitimate` | no | Agent legitimately needs user input. | +| `tool_available_punt` | **yes** | Agent asked user when tools could resolve. | +| `summary_drift_stop` | **yes** | Agent wrote a plan with a "next step", stopped before doing it. | +| `genuinely_stuck` | **yes** | Agent halted mid-thought, no question, no plan. | + +## Baseline distribution (n=907) + +From the 2026-05-25 classification run over the user's 14 active CC project transcripts: + +``` +working 374 (40%) +complete 261 (29%) +waiting_for_user_legitimate 210 (23%) +summary_drift_stop 35 (4%) +genuinely_stuck 27 (3%) +tool_available_punt 0 (0%) +``` + +## Known issues + +1. **`working` over-assignment (374 = 40%)** — at Stop time the agent is by definition not working; the classifier likely confuses just-finished-action summaries with "working". Tracked: follow-up issue. +2. **`tool_available_punt` under-assignment (0)** — the heuristic filter found 26 candidates, but the classifier reassigned all of them. Either (a) the pattern is genuinely rare in this user's sessions, or (b) the prompt doesn't surface it. Tracked: follow-up issue. + +## Redaction rules applied to the committed gold file + +- emails → `` +- bearer tokens, `sk-ant-*`, `ghp_*`, `gho_*`, long secret-shaped strings → `` / `` +- absolute `/home//...` paths → `/...` +- `github.com//` refs → `github.com//` +- `project_slug`, `session_id` → `` / `` + +UUIDs and short hex strings (≤ 60 chars matching `^[0-9a-f]{32,64}$`) are preserved as they don't leak useful info. + +## Reproducing + +```bash +# 1. Mine +node evals/scripts/mine-cc-stops.mjs + +# 2. Filter +node evals/scripts/filter-cc-stops.mjs + +# 3. Classify (requires ~/.claude/.credentials.json with OAuth token) +node evals/scripts/classify-cc-stops.mjs + +# 4. Build redacted audit sample +node evals/scripts/audit-cc-classifications.mjs --per-cat 8 +``` diff --git a/evals/datasets/cc-stop-labeled-gold-redacted.jsonl b/evals/datasets/cc-stop-labeled-gold-redacted.jsonl new file mode 100644 index 0000000..0cfc874 --- /dev/null +++ b/evals/datasets/cc-stop-labeled-gold-redacted.jsonl @@ -0,0 +1,30 @@ +{"project_slug":"","session_id":"","stop_index":7,"timestamp":"2026-05-23T08:19:36.360Z","user_messages":["/clear\n clear\n ","OPENCLAW-BOX-BOT-54 - BudgetExceededError: ExceededBudget: User=ocb-3 over budget. Spend=25.06484909999997, Budget=25.0\n\naddress that","feat (litellm): local faster-whisper transcription fallback #1381\n\n19h\n\n83\n\n× Checks failed\n\nfix(browser): auto-recover hung Chrome without full gateway restart #1380\n\nChecks\n\nfeat(growth-manager): add seo-audit + growth-hacking skills, drop vibetech-private skill #1372\n\nChecks\n\n\nCreate a sub-agent to review and address issues of this PR and then merge it. Make sure to tell the issues resolved at CI checks pass.","fix(browser): auto-recover hung Chrome without full gateway restart #1380\n\nChecks\n\nfeat(growth-manager): add seo-audit + growth-hacking skills, drop vibetech-private skill #1372\n\n","Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear. Please recommend everything in GitHub issues so we can hook up next time if we need it.","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n - **Fix BudgetExceededError** (Sentry OPENCLAW-BOX-BOT-54): LiteLLM's `BudgetExceededError: ExceededBudget: User=ocb-3 over budget. Spend=25.06..., Budget=25.0` was surfacing as a generic error with no topup keyboard instead of the credits-exhausted message.\n - **Review and merge PR #1381** (faster-whisper fallback): Had failing CI check (webapp-e2e CANCELLED), needed sub-agent review + fix before merge.\n - **Review and merge PR #1380 and #1372** in parallel: Both had passing checks, needed review before merge.\n - **Review all open GitHub issues**, address non-GitHub-centric ones, resolve them, keep Sentry clear, and document recommendations in GitHub issues for future hookup.\n\n2. Key Technical Concepts:\n - **LiteLLM BudgetExceededError**: Python exception that leaks as response text (not HTTP 429 body), bypassing existing `isBudgetExhausted` check\n - **GatewayError flow**: `GATEWAY_ERROR_RESPONSE_EXACT/PREFIX` regex patterns detect leaked error text → throw as `GatewayError` → `handleProxyError` handles\n - **`isBudgetExhausted()`**: Checks response body/diagnostics string for budget exhaustion keywords\n - **`errDiagnostics`**: Concatenated error chain string used for error routing in `handleProxyError`\n - **faster-whisper-server**: `ghcr.io/fedirz/faster-whisper-server:latest-cpu` — OpenAI-compatible audio transcription, fallback for Azure Whisper quota exhaustion\n - **LiteLLM fallback chain**: `{\"whisper-1\": [\"whisper-1-local\"]}` in `litellm_settings.fallbacks`\n - **Chrome liveness probes**: `httpGet /json/version` on port 9222 correctly detects zombie Chrome (tcpSocket would pass even on deadlocked Chrome)\n - **GitHub CI workflow**: `webapp-e2e` uses concurrency group `webapp-e2e-shared-account` with `cancel-in-progress: true` — CANCELLED status counts as \"failed\" in PR checks\n - **Caveman mode**: Active at \"full\" level — terse communication, drop articles/filler\n - **code-review-graph MCP**: Must be used BEFORE Grep/Glob/Read per CLAUDE.md\n\n3. Files and Code Sections:\n - **`src/gateway/client.ts`** (modified, uncommitted)\n - Added `BudgetExceededError|ExceededBudget` to leaked-error regex patterns so these strings get thrown as `GatewayError` instead of returned as response text\n ```typescript\n const GATEWAY_ERROR_RESPONSE_EXACT =\n /^\\s*(Unknown error|LLM request timed out|request timed out|model overloaded|service unavailable|internal server error|rate limit exceeded|context deadline exceeded|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*$/i;\n const GATEWAY_ERROR_RESPONSE_PREFIX =\n /^\\s*(Unknown error|LLM request timed out|request timed out|model overloaded|service unavailable|internal server error|rate limit exceeded|context deadline exceeded|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*\\n/i;\n ```\n - **`src/handlers/gateway-error-guidance.ts`** (modified, uncommitted)\n - Extended `isBudgetExhausted()` to match LiteLLM's error format\n ```typescript\n export function isBudgetExhausted(responseBody?: string): boolean {\n if (!responseBody) return false;\n const lower = responseBody.toLowerCase();\n return (\n lower.includes(\"budget_exceeded\") ||\n lower.includes(\"exceededbudget\") ||\n lower.includes(\"exceeded budget\") ||\n lower.includes(\"max budget\") ||\n lower.includes(\"budgetexceedederror\") ||\n lower.includes(\"over budget\")\n );\n }\n ```\n - **`src/handlers/telegram-proxy.ts`** (modified, uncommitted)\n - Added `else if (isBudgetExhausted(errDiagnostics))` branch before generic error handler so budget errors arriving as text (no HTTP statusCode) still show topup keyboard\n - Linter also added `isConnectTimeout` import + `isTransientConnectTimeout` logic to demote connect-timeout error…[truncated]","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n\nThe session resumed from a prior conversation. The user's active requests were:\n- **Fix pending BudgetExceededError commit**: Three files were modified but uncommitted from the previous session — commit them.\n- **Review all open GitHub issues**: Address non-GitHub-centric issues in code, resolve them (close), keep Sentry clear, and document recommendations in GitHub issues for future reference.\n\nThe specific user message driving ongoing work: \"Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear. Please recommend everything in GitHub issues so we can hook up next time if we need it.\"\n\n2. Key Technical Concepts:\n- **LiteLLM BudgetExceededError**: Python exception that leaks into gateway response text (not HTTP 429), requiring three-layer detection fix\n- **Caveman mode**: Active at \"full\" level — terse communication, drop articles/filler\n- **code-review-graph MCP**: Must be used BEFORE Grep/Glob/Read per CLAUDE.md\n- **`pushSchema` migration pattern**: `src/db/client.ts` contains idempotent `CREATE TABLE IF NOT EXISTS` + `CREATE INDEX IF NOT EXISTS` statements applied at startup — the DB migration system for this project\n- **Drizzle ORM**: Schema defined in `src/db/schema.ts`, pushed via `pushSchema()` in `src/db/client.ts`\n- **Partial unique indexes**: Postgres `WHERE status = 'active'` partial index to prevent duplicate active subscriptions\n- **Payment methods**: \"stars\", \"wallet_pay\", \"crypto\", \"stripe\" — different charge IDs stored in different columns\n- **K8s IngressRoute (Traefik)**: Used for routing webhook traffic to bot ports (3001/3002/3003/3004)\n- **`pg_advisory_xact_lock`**: Already used for crypto payment deduplication in `activateSubscription`\n- **grammY bot framework**: Telegram bot framework used throughout\n\n3. Files and Code Sections:\n\n- **`src/gateway/client.ts`** (committed in BudgetExceededError fix)\n - Added `BudgetExceededError|ExceededBudget` to two regex patterns:\n ```typescript\n const GATEWAY_ERROR_RESPONSE_EXACT =\n /^\\s*(Unknown error|LLM request timed out|...|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*$/i;\n const GATEWAY_ERROR_RESPONSE_PREFIX =\n /^\\s*(Unknown error|LLM request timed out|...|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*\\n/i;\n ```\n\n- **`src/handlers/gateway-error-guidance.ts`** (committed in BudgetExceededError fix)\n - Extended `isBudgetExhausted()` with LiteLLM format strings:\n ```typescript\n export function isBudgetExhausted(responseBody?: string): boolean {\n if (!responseBody) return false;\n const lower = responseBody.toLowerCase();\n return (\n lower.includes(\"budget_exceeded\") ||\n lower.includes(\"exceededbudget\") ||\n lower.includes(\"exceeded budget\") ||\n lower.includes(\"max budget\") ||\n lower.includes(\"budgetexceedederror\") ||\n lower.includes(\"over budget\")\n );\n }\n ```\n\n- **`src/handlers/telegram-proxy.ts`** (committed in BudgetExceededError fix)\n - Added else-if branch in `handleProxyError` before generic handler:\n ```typescript\n } else if (isBudgetExhausted(errDiagnostics)) {\n const errorMessage = resolveGatewayHttpErrorMessage(429, errDiagnostics);\n await ctx.reply(errorMessage, { reply_markup: buildTopUpKeyboard() });\n } else {\n ```\n\n- **`src/dashboard/server.ts`** (committed in second commit — fix #1312)\n - Changed `DashboardUser` interface:\n ```typescript\n export interface DashboardUser {\n telegramId: number | null; // was: number\n ...\n }\n ```\n - Changed line 468: `telegramId: user.telegramId ?? null,` (was `?? 0`)\n - Fixed HTML table cell (line 608): `${u.telegramId ?? \"-\"}`\n - Fixed cleanup-test array types: `telegramId: number | null` (was `number`)\n - Fixed line 4432: `teleg…[truncated]","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n The user's ongoing request (from prior session, preserved): \"Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear. Please recommend everything in GitHub issues so we can hook up next time if we need it.\" The goal is to resolve all addressable code bugs, close resolved issues in GitHub, and post recommendation comments on issues requiring human attention so future sessions can pick up where we left off.\n\n2. Key Technical Concepts:\n - **Caveman mode (full)**: Active — terse communication, drop articles/filler, fragments OK\n - **code-review-graph MCP**: Must use BEFORE Grep/Glob/Read per CLAUDE.md\n - **grammY bot framework**: Telegram bot framework\n - **Drizzle ORM**: Schema in `src/db/schema.ts`, migrations via `pushSchema()` in `src/db/client.ts`\n - **Payment methods**: \"stars\", \"wallet_pay\", \"crypto\", \"stripe\" — each with different charge ID columns and webhook handlers\n - **Partial unique index**: `uniq_one_active_sub_per_user ON subscriptions(user_id) WHERE status = 'active'` — prevents duplicate active subscriptions\n - **normalizeTxHash()**: Trims + lowercases crypto tx hashes before storage/lookup\n - **pg_advisory_xact_lock**: Used in `activateSubscription` for crypto deduplication\n - **RateLimiter**: Per-IP rate limiting on webhook endpoints (60 req/min)\n - **amountMatches()**: Crypto webhook helper — allows $1 tolerance for exchange rate slippage\n - **PLANS config**: `src/config.ts` — plan IDs (byok=$10, pro=$25, max), each with `usdPrice` field\n - **Wallet Pay external ID format**: `wp_{telegramUserId}_{planId}_{timestamp}` parsed by `parseExternalId()`\n - **Stripe webhook events**: `checkout.session.completed` (new sub), `invoice.payment_succeeded` (renewal)\n\n3. Files and Code Sections:\n\n **`src/payments/stripe-webhook.ts`** (modified in current session — fixes #1345, #1332)\n - Added USD currency rejection in subscription checkout path (handleCheckoutCompleted):\n ```typescript\n const currency = (session.currency ?? \"usd\").toUpperCase();\n if (currency !== \"USD\") {\n log.error(\"Stripe webhook: non-USD currency rejected\", {\n sessionId, currency, telegramUserId, resolvedPlanId,\n });\n return;\n }\n const amountCents = session.amount_total ?? 0;\n const expectedMinCents = PLANS[resolvedPlanId].usdPrice * 100;\n if (amountCents < expectedMinCents) {\n log.error(\"Stripe webhook: amount below plan minimum\", {\n sessionId, resolvedPlanId, amountCents, expectedMinCents,\n });\n return;\n }\n ```\n - Added USD currency rejection in invoice renewal path (handleInvoicePaymentSucceeded):\n ```typescript\n const currency = (invoice.currency ?? \"usd\").toUpperCase();\n if (currency !== \"USD\") {\n log.error(\"Stripe renewal: non-USD currency rejected\", {\n invoiceId, subscriptionId, currency, telegramUserId, resolvedPlanId,\n });\n return;\n }\n const amountCents = invoice.amount_paid ?? 0;\n ```\n\n **`src/payments/wallet-pay-webhook.ts`** (modified in current session — fixes #1332)\n - Added `PLANS` to import: `import { normalizePlanId, PLANS } from \"../config.js\";`\n - Added currency + amount validation in `handleOrderPaid` after parsing amount/currency:\n ```typescript\n if (currency !== \"USDT\") {\n log.error(\"Wallet Pay ORDER_PAID: unexpected currency, expected USDT\", {\n walletPayOrderId, externalId, currency, resolvedPlanId,\n });\n return;\n }\n const expectedMinUsd = PLANS[resolvedPlanId].usdPrice - 1;\n if (!Number.isFinite(amount) || amount < expectedMinUsd) {\n log.error(\"Wallet Pay ORDER_PAID: amount below plan minimum\", {\n walletPayOrderId, externalId, resolvedPlanId, amount, expectedMinUsd,\n });\n return;\n }\n ```\n\n **`src/p…[truncated]","CI cd failed.","Opus, review fixes. Reflect.","If they are okay, let's merge, then monitor main, to make sure we didn't brake anything","\nbkrc1prn3\ntoolu_01C4wkUDww3gHZdodqEwgz6U\n/tmp/claude-1000/-home-azureuser-workspace-OpenClawBot/fdffbd68-2255-41c3-8c8f-552885f81057/tasks/bkrc1prn3.output\ncompleted\nBackground command \"Wait for PR checks to finish\" completed (exit code 0)\n"],"final_assistant_text":"Monitor armed. Will report per-job results + final status.","tools_available_inferred":["ToolSearch","mcp__code-review-graph__semantic_search_nodes_tool","Bash","Read","Edit","Skill","Agent","ScheduleWakeup","Monitor"],"prior_tool_uses_count":239,"session_total_turns":335,"heuristic_tags":["hint:stuck"],"classification":{"category":"complete","reason":"Assistant confirmed monitor is armed and will report results, completing the user's request to monitor main branch after merge.","confidence":0.92},"gold_label":"complete","gold_note":"v1: haiku label accepted as-is"} +{"project_slug":"","session_id":"","stop_index":265,"timestamp":"2026-05-24T22:48:31.793Z","user_messages":["/clear\n clear\n ","OPENCLAW-BOX-BOT-54 - BudgetExceededError: ExceededBudget: User=ocb-3 over budget. Spend=25.06484909999997, Budget=25.0\n\naddress that","feat (litellm): local faster-whisper transcription fallback #1381\n\n19h\n\n83\n\n× Checks failed\n\nfix(browser): auto-recover hung Chrome without full gateway restart #1380\n\nChecks\n\nfeat(growth-manager): add seo-audit + growth-hacking skills, drop vibetech-private skill #1372\n\nChecks\n\n\nCreate a sub-agent to review and address issues of this PR and then merge it. Make sure to tell the issues resolved at CI checks pass.","fix(browser): auto-recover hung Chrome without full gateway restart #1380\n\nChecks\n\nfeat(growth-manager): add seo-audit + growth-hacking skills, drop vibetech-private skill #1372\n\n","Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear. Please recommend everything in GitHub issues so we can hook up next time if we need it.","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n - **Fix BudgetExceededError** (Sentry OPENCLAW-BOX-BOT-54): LiteLLM's `BudgetExceededError: ExceededBudget: User=ocb-3 over budget. Spend=25.06..., Budget=25.0` was surfacing as a generic error with no topup keyboard instead of the credits-exhausted message.\n - **Review and merge PR #1381** (faster-whisper fallback): Had failing CI check (webapp-e2e CANCELLED), needed sub-agent review + fix before merge.\n - **Review and merge PR #1380 and #1372** in parallel: Both had passing checks, needed review before merge.\n - **Review all open GitHub issues**, address non-GitHub-centric ones, resolve them, keep Sentry clear, and document recommendations in GitHub issues for future hookup.\n\n2. Key Technical Concepts:\n - **LiteLLM BudgetExceededError**: Python exception that leaks as response text (not HTTP 429 body), bypassing existing `isBudgetExhausted` check\n - **GatewayError flow**: `GATEWAY_ERROR_RESPONSE_EXACT/PREFIX` regex patterns detect leaked error text → throw as `GatewayError` → `handleProxyError` handles\n - **`isBudgetExhausted()`**: Checks response body/diagnostics string for budget exhaustion keywords\n - **`errDiagnostics`**: Concatenated error chain string used for error routing in `handleProxyError`\n - **faster-whisper-server**: `ghcr.io/fedirz/faster-whisper-server:latest-cpu` — OpenAI-compatible audio transcription, fallback for Azure Whisper quota exhaustion\n - **LiteLLM fallback chain**: `{\"whisper-1\": [\"whisper-1-local\"]}` in `litellm_settings.fallbacks`\n - **Chrome liveness probes**: `httpGet /json/version` on port 9222 correctly detects zombie Chrome (tcpSocket would pass even on deadlocked Chrome)\n - **GitHub CI workflow**: `webapp-e2e` uses concurrency group `webapp-e2e-shared-account` with `cancel-in-progress: true` — CANCELLED status counts as \"failed\" in PR checks\n - **Caveman mode**: Active at \"full\" level — terse communication, drop articles/filler\n - **code-review-graph MCP**: Must be used BEFORE Grep/Glob/Read per CLAUDE.md\n\n3. Files and Code Sections:\n - **`src/gateway/client.ts`** (modified, uncommitted)\n - Added `BudgetExceededError|ExceededBudget` to leaked-error regex patterns so these strings get thrown as `GatewayError` instead of returned as response text\n ```typescript\n const GATEWAY_ERROR_RESPONSE_EXACT =\n /^\\s*(Unknown error|LLM request timed out|request timed out|model overloaded|service unavailable|internal server error|rate limit exceeded|context deadline exceeded|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*$/i;\n const GATEWAY_ERROR_RESPONSE_PREFIX =\n /^\\s*(Unknown error|LLM request timed out|request timed out|model overloaded|service unavailable|internal server error|rate limit exceeded|context deadline exceeded|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*\\n/i;\n ```\n - **`src/handlers/gateway-error-guidance.ts`** (modified, uncommitted)\n - Extended `isBudgetExhausted()` to match LiteLLM's error format\n ```typescript\n export function isBudgetExhausted(responseBody?: string): boolean {\n if (!responseBody) return false;\n const lower = responseBody.toLowerCase();\n return (\n lower.includes(\"budget_exceeded\") ||\n lower.includes(\"exceededbudget\") ||\n lower.includes(\"exceeded budget\") ||\n lower.includes(\"max budget\") ||\n lower.includes(\"budgetexceedederror\") ||\n lower.includes(\"over budget\")\n );\n }\n ```\n - **`src/handlers/telegram-proxy.ts`** (modified, uncommitted)\n - Added `else if (isBudgetExhausted(errDiagnostics))` branch before generic error handler so budget errors arriving as text (no HTTP statusCode) still show topup keyboard\n - Linter also added `isConnectTimeout` import + `isTransientConnectTimeout` logic to demote connect-timeout error…[truncated]","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n\nThe session resumed from a prior conversation. The user's active requests were:\n- **Fix pending BudgetExceededError commit**: Three files were modified but uncommitted from the previous session — commit them.\n- **Review all open GitHub issues**: Address non-GitHub-centric issues in code, resolve them (close), keep Sentry clear, and document recommendations in GitHub issues for future reference.\n\nThe specific user message driving ongoing work: \"Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear. Please recommend everything in GitHub issues so we can hook up next time if we need it.\"\n\n2. Key Technical Concepts:\n- **LiteLLM BudgetExceededError**: Python exception that leaks into gateway response text (not HTTP 429), requiring three-layer detection fix\n- **Caveman mode**: Active at \"full\" level — terse communication, drop articles/filler\n- **code-review-graph MCP**: Must be used BEFORE Grep/Glob/Read per CLAUDE.md\n- **`pushSchema` migration pattern**: `src/db/client.ts` contains idempotent `CREATE TABLE IF NOT EXISTS` + `CREATE INDEX IF NOT EXISTS` statements applied at startup — the DB migration system for this project\n- **Drizzle ORM**: Schema defined in `src/db/schema.ts`, pushed via `pushSchema()` in `src/db/client.ts`\n- **Partial unique indexes**: Postgres `WHERE status = 'active'` partial index to prevent duplicate active subscriptions\n- **Payment methods**: \"stars\", \"wallet_pay\", \"crypto\", \"stripe\" — different charge IDs stored in different columns\n- **K8s IngressRoute (Traefik)**: Used for routing webhook traffic to bot ports (3001/3002/3003/3004)\n- **`pg_advisory_xact_lock`**: Already used for crypto payment deduplication in `activateSubscription`\n- **grammY bot framework**: Telegram bot framework used throughout\n\n3. Files and Code Sections:\n\n- **`src/gateway/client.ts`** (committed in BudgetExceededError fix)\n - Added `BudgetExceededError|ExceededBudget` to two regex patterns:\n ```typescript\n const GATEWAY_ERROR_RESPONSE_EXACT =\n /^\\s*(Unknown error|LLM request timed out|...|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*$/i;\n const GATEWAY_ERROR_RESPONSE_PREFIX =\n /^\\s*(Unknown error|LLM request timed out|...|BudgetExceededError|ExceededBudget)\\s*\\.?\\s*\\n/i;\n ```\n\n- **`src/handlers/gateway-error-guidance.ts`** (committed in BudgetExceededError fix)\n - Extended `isBudgetExhausted()` with LiteLLM format strings:\n ```typescript\n export function isBudgetExhausted(responseBody?: string): boolean {\n if (!responseBody) return false;\n const lower = responseBody.toLowerCase();\n return (\n lower.includes(\"budget_exceeded\") ||\n lower.includes(\"exceededbudget\") ||\n lower.includes(\"exceeded budget\") ||\n lower.includes(\"max budget\") ||\n lower.includes(\"budgetexceedederror\") ||\n lower.includes(\"over budget\")\n );\n }\n ```\n\n- **`src/handlers/telegram-proxy.ts`** (committed in BudgetExceededError fix)\n - Added else-if branch in `handleProxyError` before generic handler:\n ```typescript\n } else if (isBudgetExhausted(errDiagnostics)) {\n const errorMessage = resolveGatewayHttpErrorMessage(429, errDiagnostics);\n await ctx.reply(errorMessage, { reply_markup: buildTopUpKeyboard() });\n } else {\n ```\n\n- **`src/dashboard/server.ts`** (committed in second commit — fix #1312)\n - Changed `DashboardUser` interface:\n ```typescript\n export interface DashboardUser {\n telegramId: number | null; // was: number\n ...\n }\n ```\n - Changed line 468: `telegramId: user.telegramId ?? null,` (was `?? 0`)\n - Fixed HTML table cell (line 608): `${u.telegramId ?? \"-\"}`\n - Fixed cleanup-test array types: `telegramId: number | null` (was `number`)\n - Fixed line 4432: `teleg…[truncated]","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n The user's ongoing request (from prior session, preserved): \"Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear. Please recommend everything in GitHub issues so we can hook up next time if we need it.\" The goal is to resolve all addressable code bugs, close resolved issues in GitHub, and post recommendation comments on issues requiring human attention so future sessions can pick up where we left off.\n\n2. Key Technical Concepts:\n - **Caveman mode (full)**: Active — terse communication, drop articles/filler, fragments OK\n - **code-review-graph MCP**: Must use BEFORE Grep/Glob/Read per CLAUDE.md\n - **grammY bot framework**: Telegram bot framework\n - **Drizzle ORM**: Schema in `src/db/schema.ts`, migrations via `pushSchema()` in `src/db/client.ts`\n - **Payment methods**: \"stars\", \"wallet_pay\", \"crypto\", \"stripe\" — each with different charge ID columns and webhook handlers\n - **Partial unique index**: `uniq_one_active_sub_per_user ON subscriptions(user_id) WHERE status = 'active'` — prevents duplicate active subscriptions\n - **normalizeTxHash()**: Trims + lowercases crypto tx hashes before storage/lookup\n - **pg_advisory_xact_lock**: Used in `activateSubscription` for crypto deduplication\n - **RateLimiter**: Per-IP rate limiting on webhook endpoints (60 req/min)\n - **amountMatches()**: Crypto webhook helper — allows $1 tolerance for exchange rate slippage\n - **PLANS config**: `src/config.ts` — plan IDs (byok=$10, pro=$25, max), each with `usdPrice` field\n - **Wallet Pay external ID format**: `wp_{telegramUserId}_{planId}_{timestamp}` parsed by `parseExternalId()`\n - **Stripe webhook events**: `checkout.session.completed` (new sub), `invoice.payment_succeeded` (renewal)\n\n3. Files and Code Sections:\n\n **`src/payments/stripe-webhook.ts`** (modified in current session — fixes #1345, #1332)\n - Added USD currency rejection in subscription checkout path (handleCheckoutCompleted):\n ```typescript\n const currency = (session.currency ?? \"usd\").toUpperCase();\n if (currency !== \"USD\") {\n log.error(\"Stripe webhook: non-USD currency rejected\", {\n sessionId, currency, telegramUserId, resolvedPlanId,\n });\n return;\n }\n const amountCents = session.amount_total ?? 0;\n const expectedMinCents = PLANS[resolvedPlanId].usdPrice * 100;\n if (amountCents < expectedMinCents) {\n log.error(\"Stripe webhook: amount below plan minimum\", {\n sessionId, resolvedPlanId, amountCents, expectedMinCents,\n });\n return;\n }\n ```\n - Added USD currency rejection in invoice renewal path (handleInvoicePaymentSucceeded):\n ```typescript\n const currency = (invoice.currency ?? \"usd\").toUpperCase();\n if (currency !== \"USD\") {\n log.error(\"Stripe renewal: non-USD currency rejected\", {\n invoiceId, subscriptionId, currency, telegramUserId, resolvedPlanId,\n });\n return;\n }\n const amountCents = invoice.amount_paid ?? 0;\n ```\n\n **`src/payments/wallet-pay-webhook.ts`** (modified in current session — fixes #1332)\n - Added `PLANS` to import: `import { normalizePlanId, PLANS } from \"../config.js\";`\n - Added currency + amount validation in `handleOrderPaid` after parsing amount/currency:\n ```typescript\n if (currency !== \"USDT\") {\n log.error(\"Wallet Pay ORDER_PAID: unexpected currency, expected USDT\", {\n walletPayOrderId, externalId, currency, resolvedPlanId,\n });\n return;\n }\n const expectedMinUsd = PLANS[resolvedPlanId].usdPrice - 1;\n if (!Number.isFinite(amount) || amount < expectedMinUsd) {\n log.error(\"Wallet Pay ORDER_PAID: amount below plan minimum\", {\n walletPayOrderId, externalId, resolvedPlanId, amount, expectedMinUsd,\n });\n return;\n }\n ```\n\n **`src/p…[truncated]","CI cd failed.","Opus, review fixes. Reflect.","If they are okay, let's merge, then monitor main, to make sure we didn't brake anything","\nbkrc1prn3\ntoolu_01C4wkUDww3gHZdodqEwgz6U\n/tmp/claude-1000/-home-azureuser-workspace-OpenClawBot/fdffbd68-2255-41c3-8c8f-552885f81057/tasks/bkrc1prn3.output\ncompleted\nBackground command \"Wait for PR checks to finish\" completed (exit code 0)\n","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\npr-scope: skipped\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\ncanary-pr-control-plane: skipped\ntest: success\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\nwebapp-e2e: success\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n - User's standing intent (from prior session): \"Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear.\"\n - Current session explicit requests:\n - \"CI cd failed.\" — fix failing CI on PR #1384\n - \"Opus, please review the PR, if it make sense. Make sure that we need changes form that PR and github issues, and they are ablsolutelly necesary and solve something important.\"\n - \"Opus, review fixes. Reflect.\" — self-review the test fixes\n - \"If they are okay, let's merge, then monitor main, to make sure we didn't brake anything\"\n - Caveman mode (full) active throughout: terse style, drop articles/filler/pleasantries, fragments OK. Code/commits/security written normally.\n\n2. Key Technical Concepts:\n - GitHub PR #1384 on VibeTechnologies/OpenClawBot\n - Branch `fix/browser-auto-recover` → squash-merged to main as commit 2c1de723\n - Vitest mocking via `vi.mock()` and `vi.fn().mockReturnValue()`\n - Payment-method-aware refund helpers in `src/commands/refund.ts`: `getAdminRefundInstruction`, `getPaymentAmountLabel`, `getRefundNote` — switch on payment.paymentMethod\n - LiteLLM BudgetExceededError detection via `isBudgetExhausted(responseBody)` in `src/handlers/gateway-error-guidance.ts`\n - Stripe webhook USD-only validation in `handleCheckoutCompleted` and `handleInvoicePaymentSucceeded`\n - Wallet Pay webhook USDT-only currency check + amount validation against `PLANS[planId].usdPrice - 1`\n - DB schema partial unique index: `uniq_one_active_sub_per_user ON subscriptions(user_id) WHERE status = 'active'`\n - Monitor tool with persistent polling of `gh run view --json jobs`\n - Background Bash task + ScheduleWakeup for waiting on CI\n\n3. Files and Code Sections:\n\n - **`tests/unit/refund-command.test.ts`** (modified at line ~169 in `setupFullMocks()`)\n - Reason: Mock payment object lacked `paymentMethod` field; new payment-method-aware helpers hit `default:` branches returning wrong strings\n - Change:\n ```typescript\n currency: \"XTR\",\n paymentMethod: \"stars\", // ← added\n status: \"completed\",\n createdAt: new Date(),\n refundedAt: null,\n };\n ```\n\n - **`tests/unit/stripe-webhook.test.ts`** (modified at lines 753-780 and 1069-1097)\n - Reason: Tests sent EUR currency but our new validation rejects non-USD; payment never created\n - First test changed from `sessionId: \"cs_test_eur_currency\"`, `currency: \"eur\"`, `expect(payment.currency).toBe(\"EUR\")` → `sessionId: \"cs_test_usd_currency\"`, `currency: \"usd\"`, `expect(payment.currency).toBe(\"USD\")`\n - Second test changed from `invoiceId: \"in_test_eur_renewal\"`, `amountPaid: 900`, `currency: \"eur\"`, `expect(payment.currency).toBe(\"EUR\"); expect(payment.amount).toBe(900)` → `invoiceId: \"in_test_usd_renewal\"`, `amountPaid: 1000`, `currency: \"usd\"`, `expect(payment.currency).toBe(\"USD\"); expect(payment.amount).toBe(1000)`\n\n - **`tests/unit/telegram-proxy.test.ts`** (modified mock at lines 65-78)\n - Reason: Commit dc7ea8a4 added `isBudgetExhausted(errDiagnostics)` call in proxy; mock didn't stub it; `undefined()` threw, suppressing reply\n - Change: Added `isBudgetExhausted: vi.fn().mockReturnValue(false),` to the gateway-error-guidance mock object\n\n - **`src/handlers/telegram-proxy.ts`** (read-only, line 230-252) — production code where error reply path lives:\n ```typescript\n } else if (err instanceof GatewayError && err.statusCode) {\n const errorMessage = resolveGatewayHttpErrorMessage(err.statusCode, err.responseBody);\n if (err.statusCode === 429 && isBudgetExhausted(err.responseBody)) {\n await ctx.reply(errorMessage, { reply_markup: buildTopUpKeyboard() });\n } else {\n await ctx…[truncated]","/compact\n compact\n ","\u001b[2mCompacted (ctrl+o to see full summary)\u001b[22m","Review github issues, some of them supposed to be closed, like this https://github.com///issues/1236, we already added sse sidecar that supposed to solve this issue.","go","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\ndeploy: success\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\ncanary-ci: failure\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n - **Standing intent**: Review open GitHub issues, close already-resolved ones, keep Sentry clear.\n - **Issue #1236/#1237**: User noted SSE sidecar should have resolved these \"Unknown error\" issues; confirm and close.\n - **Issue scan**: After closing those, scan other candidate issues (#1282, #1126, #1108, #1059, #1018, #1146) to find any others resolvable.\n - **canary-ci failure**: Investigate why main CI failed; determine if caused by PR #1384 or pre-existing.\n - **Document on status.md**: Document CI status, closed issues, open issues, and next steps.\n - **Upgrade whoisdzianis openclaw**: Upgrade production VM tenant whoisdzianis from 2026.4.9 → 2026.5.20 (latest npm), ensuring no version pins remain.\n\n2. Key Technical Concepts:\n - **SSE-retry sidecar** (`infra/sse-retry-proxy/`): Starlette reverse-proxy between openclaw and LiteLLM that catches Azure `response.failed`/`response.incomplete` with empty error envelope (HTTP 200 + in-band failure event). LiteLLM can't see these at HTTP level; openclaw stringified them as \"Unknown error (no error details in response)\". Sidecar retries upstream and returns HTTP 529 (overloaded) on exhaustion → triggers openclaw's model fallback chain (gpt-5.1 → gpt-4.1).\n - **Personal Deepeval gate**: canary-ci runs 7 scenarios (image_processing, video_processing, browser_cdp_surfing, investment_advice, weather_request, browser_wikipedia, cv_latex_pdf) against `model=openclaw` — pre-existing HTTP 400 failure on last 5+ main runs before #1384.\n - **openclaw versioning**: VM install at `/opt/openclaw/` managed via `bootstrap/01-install-openclaw.sh` with `OPENCLAW_VERSION` env var. Script detects installed vs. target version, wipes node_modules/package.json/package-lock.json/openclaw.mjs, reinstalls, handles swap setup and service stop/restart for 2GB VMs.\n - **sync-vm-bootstrap.sh gap**: The sync script's SCRIPTS array only includes `02-seed-workspace.sh`, `04-install-sudo.sh`, `seed-ops-tools.sh`, `06-install-browser.sh` — does NOT include `01-install-openclaw.sh`, so fleet sync cannot upgrade openclaw.\n - **GitHub CI run 26327930703**: The main run after merging commit 2c1de723 (PR #1384). test ✅, webapp-e2e ✅, build-and-push ✅, deploy ✅, canary-ci ❌ (pre-existing).\n\n3. Files and Code Sections:\n - **`bootstrap/01-install-openclaw.sh`**\n - Handles openclaw install/upgrade on VMs. Key logic:\n - `OPENCLAW_VERSION` env var (default: 2026.5.20 after b550a52f)\n - `OPENCLAW_INSTALL_DIR` (default: /app K8s; set /opt/openclaw for VM)\n - `OPENCLAW_OWNER` (default: 1000:1000; set node:node for VM)\n - Checks `_installed_version()` via node/require — if matches target and healthy, skips.\n - If version differs: stops services (openclaw-gateway, openclaw-chrome, openclaw-cdp-ui, hermes-gateway), wipes node_modules/package-lock.json/package.json/openclaw.mjs, installs `openclaw@${OPENCLAW_VERSION}`, symlinks `openclaw.mjs` and `/usr/local/bin/openclaw`, then restarts services.\n - Sets up 2GB swap on low-memory VMs before install.\n - Also patches `supportsXHighThinking` in dist bundle to allow xhigh for gpt-5.3-codex.\n - **Used to**: Upgrade whoisdzianis from 2026.4.9 → 2026.5.20.\n - **`scripts/sync-vm-bootstrap.sh`**\n - Syncs bootstrap scripts to VM tenants and reruns them.\n - SCRIPTS array: `bootstrap/02-seed-workspace.sh`, `bootstrap/04-install-sudo.sh`, `bootstrap/seed-ops-tools.sh`, `bootstrap/06-install-browser.sh`.\n - Notably missing: `bootstrap/01-install-openclaw.sh` — cannot upgrade openclaw via sync.\n - **`/opt/openclaw/package.json`** (on whoisdzianis VM, not in repo)\n - Before upgrade: `{\"dependencies\": {\"openclaw\": \"^2026.4.9\"}}`\n - After upgrade: wiped and replaced by install script.\n - **`infra/terraform/main…[truncated]","/compact\n compact\n ","\u001b[2mCompacted (ctrl+o to see full summary)\u001b[22m","what left","Spawn sonnet agents with high reasoning per issue, Quick wins (mechanical)\n\nsync-vm-bootstrap.sh — add 01-install-openclaw.sh to SCRIPTS array (1-liner)\n#1146 — remove Lamoom widget from www/src/app/console/layout.tsx\nEngineering work (open issues)\n\n#1108 — console.openclaw.vibebrowser.app/console/ 404 (most tractable next)\n#1018 — openclawbox migrate drops non-markdown files (e.g. minicooper.csv); WORKSPACE_INCLUDE hardcoded at scripts/openclawbox-migrate.ts:28-31\n#1059 — os_disk_size_gb = 100 in infra/terraform/main.tf:59,90 → canary eviction loop\n#1282 — VM reprovision wipes secondary Telegram bindings\n#1126 — Provisioner Fresh Smoke: LiteLLM key not in VerificationTokenTable\n\n\nThen review their work na dmerge. Make sure we don't brake production!","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n The user's ongoing request (from prior session, preserved): \"Review all the GitHub issues we have, all not GitHub-centric issues and address them, resolve them. Let's keep our sentry clear. Please recommend everything in GitHub issues so we can hook up next time if we need it.\" The goal is to resolve all addressable code bugs, close resolved issues in GitHub, and post recommendation comments on issues requiring human attention so future sessions can pick up where we left off.\n\n2. Key Technical Concepts:\n - **Caveman mode (full)**: Active — terse communication, drop articles/filler, fragments OK\n - **code-review-graph MCP**: Must use BEFORE Grep/Glob/Read per CLAUDE.md\n - **grammY bot framework**: Telegram bot framework\n - **Drizzle ORM**: Schema in `src/db/schema.ts`, migrations via `pushSchema()` in `src/db/client.ts`\n - **Payment methods**: \"stars\", \"wallet_pay\", \"crypto\", \"stripe\" — each with different charge ID columns and webhook handlers\n - **Partial unique index**: `uniq_one_active_sub_per_user ON subscriptions(user_id) WHERE status = 'active'` — prevents duplicate active subscriptions\n - **normalizeTxHash()**: Trims + lowercases crypto tx hashes before storage/lookup\n - **pg_advisory_xact_lock**: Used in `activateSubscription` for crypto deduplication\n - **RateLimiter**: Per-IP rate limiting on webhook endpoints (60 req/min)\n - **amountMatches()**: Crypto webhook helper — allows $1 tolerance for exchange rate slippage\n - **PLANS config**: `src/config.ts` — plan IDs (byok=$10, pro=$25, max), each with `usdPrice` field\n - **Wallet Pay external ID format**: `wp_{telegramUserId}_{planId}_{timestamp}` parsed by `parseExternalId()`\n - **Stripe webhook events**: `checkout.session.completed` (new sub), `invoice.payment_succeeded` (renewal)\n\n3. Files and Code Sections:\n\n **`src/payments/stripe-webhook.ts`** (modified in current session — fixes #1345, #1332)\n - Added USD currency rejection in subscription checkout path (handleCheckoutCompleted):\n ```typescript\n const currency = (session.currency ?? \"usd\").toUpperCase();\n if (currency !== \"USD\") {\n log.error(\"Stripe webhook: non-USD currency rejected\", {\n sessionId, currency, telegramUserId, resolvedPlanId,\n });\n return;\n }\n const amountCents = session.amount_total ?? 0;\n const expectedMinCents = PLANS[resolvedPlanId].usdPrice * 100;\n if (amountCents < expectedMinCents) {\n log.error(\"Stripe webhook: amount below plan minimum\", {\n sessionId, resolvedPlanId, amountCents, expectedMinCents,\n });\n return;\n }\n ```\n - Added USD currency rejection in invoice renewal path (handleInvoicePaymentSucceeded):\n ```typescript\n const currency = (invoice.currency ?? \"usd\").toUpperCase();\n if (currency !== \"USD\") {\n log.error(\"Stripe renewal: non-USD currency rejected\", {\n invoiceId, subscriptionId, currency, telegramUserId, resolvedPlanId,\n });\n return;\n }\n const amountCents = invoice.amount_paid ?? 0;\n ```\n\n **`src/payments/wallet-pay-webhook.ts`** (modified in current session — fixes #1332)\n - Added `PLANS` to import: `import { normalizePlanId, PLANS } from \"../config.js\";`\n - Added currency + amount validation in `handleOrderPaid` after parsing amount/currency:\n ```typescript\n if (currency !== \"USDT\") {\n log.error(\"Wallet Pay ORDER_PAID: unexpected currency, expected USDT\", {\n walletPayOrderId, externalId, currency, resolvedPlanId,\n });\n return;\n }\n const expectedMinUsd = PLANS[resolvedPlanId].usdPrice - 1;\n if (!Number.isFinite(amount) || amount < expectedMinUsd) {\n log.error(\"Wallet Pay ORDER_PAID: amount below plan minimum\", {\n walletPayOrderId, externalId, resolvedPlanId, amount, expectedMinUsd,\n });\n return;\n }\n ```\n\n **`src/p…[truncated]","CI cd failed.","Opus, review fixes. Reflect.","If they are okay, let's merge, then monitor main, to make sure we didn't brake anything","\nbkrc1prn3\ntoolu_01C4wkUDww3gHZdodqEwgz6U\n/tmp/claude-1000/-home-azureuser-workspace-OpenClawBot/fdffbd68-2255-41c3-8c8f-552885f81057/tasks/bkrc1prn3.output\ncompleted\nBackground command \"Wait for PR checks to finish\" completed (exit code 0)\n","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\npr-scope: skipped\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\ncanary-pr-control-plane: skipped\ntest: success\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","\nb6e58mnfo\nMonitor event: \"main CI/CD run 26327930703 status changes\"\nwebapp-e2e: success\nIf this event is something the user would act on now, send a PushNotification. Routine or benign output doesn't need one.\n","focus on the failing canary, looks like we broke it by our fixees; 2/ #\tFix\tBlocker\n#1389\t#1059: terraform disk 100→256 GB\tForce-replaces both node pools — needs maintenance window - close tickets. we don't have control on infra and nodes anymore.","This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.\n\nSummary:\n1. Primary Request and Intent:\n - **Standing**: Review open GitHub issues, close resolved ones, keep Sentry clean.\n - **Session start**: User asked \"what left\" for a status summary of remaining work.\n - **Main request**: \"Spawn sonnet agents with high reasoning per issue [sync-vm-bootstrap, #1146, #1108, #1018, #1059, #1282, #1126]. Then review their work and merge. Make sure we don't brake production!\"\n - **Follow-up**: \"focus on the failing canary, looks like we broke it by our fixes; 2/ #1059 terraform disk - close tickets. we don't have control on infra and nodes anymore.\"\n - Close #1059 GitHub issue and PR #1389 — infra no longer team-managed.\n - Investigate whether our merged fixes caused canary-ci to fail.\n\n2. Key Technical Concepts:\n - **Worktree isolation**: Each agent runs in `OpenClawBot/.claude/worktrees/agent-{id}/` on branch `worktree-agent-{id}`\n - **Caveman mode**: Active (full level) — terse responses, drop filler, fragments OK\n - **canary-ci**: Personal Deepeval gate — 7 scenarios (image_processing, video_processing, browser_cdp_surfing, investment_advice, weather_request, browser_wikipedia, cv_latex_pdf) run against `model=openclaw` (K8s canary tenant). Pre-existing `http=400` failure — canary tenant runs old openclaw version.\n - **Deploy www failure**: `Post-deploy login widget health check` fails waiting for `'text=\"Sign in with Telegram\"'` — pre-existing failure across 10+ commits before our changes.\n - **AzureRM force-replace**: `os_disk_size_gb` changes in `azurerm_kubernetes_cluster` / `azurerm_kubernetes_cluster_node_pool` trigger node pool destruction and recreation.\n - **openclaw versioning**: Installed via npm, managed by `bootstrap/01-install-openclaw.sh`. K8s tenants use persistent volumes; VM tenants use `sync-vm-bootstrap.sh`.\n - **Exclusion-based migration**: `openclawbox-migrate.ts` now uses `WORKSPACE_EXCLUDE_PREFIXES` + `WORKSPACE_SKIP_DIRS` instead of hardcoded `WORKSPACE_INCLUDE` whitelist.\n - **LiteLLM VerificationTokenTable race**: After `POST /key/generate`, key may not be visible in DB immediately — fixed by polling `GET /key/info?key=` with exponential backoff.\n - **Secondary Telegram bindings**: Stored only in `openclaw.json` on VM; `reprovisionVmTenant` now reads existing config before overwrite and re-injects non-default accounts.\n\n3. Files and Code Sections:\n - **`scripts/sync-vm-bootstrap.sh`**\n - Added `bootstrap/01-install-openclaw.sh` as first entry in SCRIPTS array\n - Added `run_step \"install openclaw\"` before seed-workspace, running as root with `OPENCLAW_INSTALL_DIR=/opt/openclaw` and `OPENCLAW_OWNER=${OPENCLAW_USER}:${OPENCLAW_USER}`\n - 🟡 Note: `OPENCLAW_VERSION` not forwarded through wrapper — inner script uses its own hardcoded default\n\n - **`www/src/app/console/layout.tsx`**\n - Removed: `import Script from \"next/script\"`\n - Removed: `const WIDGET_SRC = ...` and `const AGENT_TOKEN = ...`\n - Removed: conditional `{AGENT_TOKEN && (