Abilityai · vybe · May 23, 2026 · May 14, 2026 · May 14, 2026 · May 17, 2026
diff --git a/docs/KNOWN_ISSUES.md b/docs/KNOWN_ISSUES.md
@@ -98,6 +98,53 @@ Claude Code has a hardcoded 60-second timeout for all MCP HTTP tool calls. This
 
 ---
 
+### 🟢 Stop Hooks That Spawn Network Processes Can Hold the Agent Stdout Pipe Open
+
+**Status**: Mitigated platform-side by the orphan-killer (#620); operator-side defense-in-depth recommended
+**Priority**: LOW (mitigated)
+**Affects**: Any agent whose Stop hook spawns processes that may call `setsid()` — notably `git push` spawning `ssh`
+
+**Symptoms:**
+- Agent execution logs show: `Reader thread(s) still busy after process exit ... killing process group`
+- Followed by: `force-closing pipes; some buffered data may be lost`
+- Then: `Execution completed without a result message` (502)
+- Pattern repeats on every execution for the affected agent
+
+**Cause:**
+The hook's grandchild (e.g. `ssh` spawned by `git push`) calls `setsid()` and escapes claude's process group. `terminate_process_group(claude_pgid)` doesn't reach it, so it keeps the stdout pipe write-end open during network I/O. The reader's `readline()` never sees EOF, the drain times out, and the force-close fallback discards the final `{"type":"result"}` JSON.
+
+**Platform fix:**
+`_kill_orphan_pipe_writers` (`docker/base-image/agent_server/utils/subprocess_pgroup.py`) enumerates `/proc/*/fd` for processes outside our pgid holding the same pipe inode and SIGKILLs them. Shipped via #620. Agents only inherit the fix after `./scripts/deploy/build-base-image.sh` and container recreation — older base images won't have it.
+
+**Operator-side defense-in-depth (bash/sh hooks):**
+Redirect inherited fds to a **log file** (not `/dev/null` — failures must stay debuggable) **before any command that could spawn or duplicate a file descriptor**:
+
+```bash
+#!/bin/bash
+# MUST come before any command that spawns a child or duplicates fd 1/2 —
+# including `set -x`, `exec 3>&1`, command substitutions, background jobs.
+# `set +e` only affects exit-code propagation; this line is about fd inheritance.
+mkdir -p ~/.trinity/logs
+exec >> ~/.trinity/logs/stop-hook.log 2>&1
+set +e
+echo "=== Stop hook fired at $(date -Iseconds) ==="
+git push origin HEAD
+```
+
+Log to a file instead of `/dev/null`: a hook that silently swallows `git push` failures is its own outage class — branches stop syncing with no operator-visible signal.
+
+**Other shells / languages:**
+- **Python hooks**: pass `stdout=open(log_path, 'a'), stderr=subprocess.STDOUT` to every `subprocess.Popen`/`subprocess.run` that calls external processes.
+- **Node hooks**: pass `{ stdio: ['ignore', logFd, logFd] }` to `child_process.spawn`.
+- **fish**: the `exec` redirect form differs — use `set -gx`-based redirection or wrap external calls in `... &>> log_path`.
+
+**Related Files:**
+- `docker/base-image/agent_server/utils/subprocess_pgroup.py` — platform fix: `_kill_orphan_pipe_writers`
+- `docker/base-image/agent_server/services/headless_executor.py` — drain + result recovery
+- Resolved by: #620 (closes #618); regression-tested via `tests/unit/test_subprocess_pgroup.py::TestDrainReaderThreads::test_setsid_escapee_drained_via_orphan_killer_preserves_result_line` (#586).
+
+---
+
 ## Resolved Issues
 
 _No resolved issues yet_

diff --git a/docs/memory/feature-flows.md b/docs/memory/feature-flows.md
@@ -13,8 +13,10 @@
 |------|-----|---------|------|
 | 2026-05-18 | #887 | fix(read-only): guard moved to base image (`/opt/trinity/hooks/`, root-owned 0555); MultiEdit bypass fixed; fail-closed via `run_hook()`; lifecycle always syncs config on start (stale-volume fix); config file protected by `path_deny` + `bash_deny` in guardrails-baseline.json; 18 unit tests | [read-only-mode.md](feature-flows/read-only-mode.md) |
 | 2026-05-18 | #888 | write_user_memory MCP tool — per-user memory write with server-side email resolution, fixing PII cross-user memory leak | [write-user-memory.md](feature-flows/write-user-memory.md) |
+| 2026-05-17 | #35d4e78 | fix(credentials): map agent-server connect errors to 503 on `import_credentials` and `export_credentials` — `httpx.RequestError` (ConnectError/TimeoutException/ReadError) now surfaces 503 instead of 500 when the agent container is up but its FastAPI server isn't reachable yet. Mirrors the inject/agent-files pattern. | [credential-injection.md](feature-flows/credential-injection.md) |
 | 2026-05-17 | #862 | fix(cleanup): execution retention sweeps were no-ops — `prune_execution_logs`/`prune_execution_rows` queried `status IN ('completed','failed','terminated')` but `TaskExecutionStatus` uses `'success'/'failed'/'cancelled'/'skipped'`; only `'failed'` rows ever pruned; fixed SQL predicates + `idx_executions_completed_terminal` partial index + migration to drop/recreate existing wrong index on live installs | [cleanup-service.md](feature-flows/cleanup-service.md) |
 | 2026-05-13 | #586 | obs(agent-runtime): `[METRIC] drain_outcome` emissions on the slow path of `drain_reader_threads` — two reachable sites surface `outcome=natural`/`force_close`/`leaked`, `stuck_initial`, `drain_elapsed_ms`, optional `leaked_count`, plus vestigial `orphan_kill_count=0` (since the #817 cgroup-sweep refactor, actual orphan counts are logged separately as `Cgroup sweep killed N orphan(s)`). Fast path stays silent. New Stop-hook authoring guidance in `TRINITY_COMPATIBLE_AGENT_GUIDE.md` shows how to release the inherited stdout FD before blocking I/O so hooks avoid the slow path entirely. Fleet audit at `scripts/586-fleet-check.sh` gates close-out by scanning Vector agent logs for residual "still stuck after Ns" / "no result message after" events. 2 new unit tests. | [execution-termination.md](feature-flows/execution-termination.md) |
+| 2026-05-13 | #602/#830 | sec: drop SYS_PTRACE / MKNOD / NET_RAW / FSETID from `FULL_CAPABILITIES` (Phase 3c). SYS_PTRACE closes the AISEC-C2 heap-read OAuth-exfil path. FULL set is now 9 caps (was 13). Constants extracted to stdlib-only `services/agent_service/capabilities.py`; `lifecycle.py` re-exports. | [container-capabilities.md](feature-flows/container-capabilities.md), [agent-lifecycle.md](feature-flows/agent-lifecycle.md) |
 | 2026-05-13 | #831 | feat: platform default model — admin sets `platform_default_model` in Settings General tab; `task_execution_service.execute_task()` resolves `model=None` → platform default (TTL-cached, write-through invalidation); `GET /api/settings/feature-flags` exposes value for frontend; SchedulesPanel shows "platform default (X)" when no model set; PRESET_MODELS updated to canonical Anthropic list (Opus 4.7 / Sonnet 4.6 / Haiku 4.5) | [model-selection.md](feature-flows/model-selection.md), [platform-settings.md](feature-flows/platform-settings.md), [task-execution-service.md](feature-flows/task-execution-service.md) |
 | 2026-05-12 | #808 | fix(orphan-killer): `_set_idle_priority()` (SCHED_IDLE/nice) + `_scan_deadline` 8s per-iteration budget — prevents orphan-killer daemon thread from starving uvicorn health probes on 1-CPU containers and triggering circuit breaker | [parallel-headless-execution.md](feature-flows/parallel-headless-execution.md) |
 | 2026-05-12 | #474 | fix(circuit-breaker): only TCP unreachability counts toward the circuit — `agent_client._request()` now classifies via shared `is_circuit_failure()` helper backed by `CIRCUIT_FAILURE_EXCEPTIONS` (`ConnectError`, `ConnectTimeout`). `TRANSIENT_TRANSPORT_EXCEPTIONS` (`ReadTimeout`/`WriteTimeout`/`PoolTimeout`/`WriteError`/`ReadError`/`RemoteProtocolError`) still raise `AgentNotReachableError` but no longer increment the 3-failure threshold; raw `OSError` subclasses (`BrokenPipeError`/`ConnectionResetError`) propagate uncaught; `asyncio.CancelledError` is re-raised explicitly. `monitoring_service.check_network_health()` lazy-imports the same tuples so the /health probe and inline `/api/*` agree on what "unreachable" means; any HTTP response (200..599) records success — symmetric with `_request()` so stale counters clear. `aggregate_health()` adds explicit `status_code >= 500 → UNHEALTHY` branch so a wedged-but-listening agent isn't silently HEALTHY under the new rule. 12 unit + 13 integration tests on the classifier + 1 new monitoring-service integration suite. | [agent-monitoring.md](feature-flows/agent-monitoring.md), [execution-queue.md](feature-flows/execution-queue.md), [scheduling.md](feature-flows/scheduling.md) |

diff --git a/docs/memory/feature-flows/agent-lifecycle.md b/docs/memory/feature-flows/agent-lifecycle.md
@@ -693,10 +693,12 @@ Auto-generated on agent creation with `scope='agent'`, `agent_name=<this agent>`
 | `trinity.created` | Creation timestamp (ISO format) |
 | `trinity.template` | Template used (empty string if none) |
 
-### Container Security Constants (`src/backend/services/agent_service/lifecycle.py:30-65`)
+### Container Security Constants (`src/backend/services/agent_service/capabilities.py`, re-exported from `lifecycle.py`)
 
 **2026-01-14 Security Fix**: All container creation paths now use centralized capability constants for consistent security.
 
+**2026-05-13 (Issue #602 Phase 3c, PR #830)**: `SYS_PTRACE` / `MKNOD` / `NET_RAW` / `FSETID` dropped from FULL set (each was a documented escalation primitive — SYS_PTRACE in particular closes the AISEC-C2 heap-read OAuth-exfil path). Constants moved to a stdlib-only `capabilities.py` sibling so `tests/unit/test_capability_set.py` can import them without dragging the docker / fastapi / database transitive imports of `lifecycle.py`. `lifecycle.py` re-exports the names so runtime callers (`crud.py`, `system_agent_service.py`) are unchanged.
+
 ```python
 # Restricted mode capabilities - minimum for agent operation (default)
 RESTRICTED_CAPABILITIES = [
@@ -709,13 +711,9 @@ RESTRICTED_CAPABILITIES = [
 
 # Full capabilities mode - adds package installation support
 FULL_CAPABILITIES = RESTRICTED_CAPABILITIES + [
-    'DAC_OVERRIDE',      # Bypass file permission checks (needed for apt)
+    'DAC_OVERRIDE',      # Bypass file permission checks (needed for sudo apt)
     'FOWNER',            # Bypass permission checks on file owner
-    'FSETID',            # Don't clear setuid/setgid bits
     'KILL',              # Send signals to processes
-    'MKNOD',             # Create special files
-    'NET_RAW',           # Use raw sockets (ping, etc.)
-    'SYS_PTRACE',        # Trace processes (debugging)
 ]
 ```
 
@@ -738,9 +736,11 @@ tmpfs={'/tmp': 'noexec,nosuid,size=100m'}
 **Files Using These Constants**:
 | File | Line | Usage |
 |------|------|-------|
-| `services/agent_service/crud.py` | 477 | Agent creation |
-| `services/agent_service/lifecycle.py` | 393 | Container recreation |
-| `services/system_agent_service.py` | 250 | System agent creation (FULL_CAPABILITIES only) |
+| `services/agent_service/capabilities.py` | — | Definitions (stdlib-only, test-importable) |
+| `services/agent_service/lifecycle.py` | 94 | Re-exports `RESTRICTED_CAPABILITIES` / `FULL_CAPABILITIES` / `PROHIBITED_CAPABILITIES` |
+| `services/agent_service/crud.py` | 615 | Agent creation |
+| `services/agent_service/lifecycle.py` | 535 | Container recreation |
+| `services/system_agent_service.py` | 251 | System agent creation (FULL_CAPABILITIES only) |
 
 ### Network Isolation (line 645)
 - Network: `trinity-agent-network` (Docker network)

diff --git a/docs/memory/feature-flows/container-capabilities.md b/docs/memory/feature-flows/container-capabilities.md
@@ -16,18 +16,19 @@ Controls whether agent containers run with full Docker capabilities (allowing pa
 
 ## What "Full Capabilities" Means
 
+Both modes apply the same baseline security (`cap_drop=['ALL']`, AppArmor, noexec tmpfs) and differ only in which caps are added back. Constants live in `src/backend/services/agent_service/capabilities.py` and are re-exported from `lifecycle.py`.
+
 ### Full Capabilities Mode (`full_capabilities=true`)
-- Container runs with **Docker default capabilities**
-- `cap_drop=[]` (no capabilities dropped)
-- `cap_add=[]` (defaults to Docker defaults)
-- `security_opt=[]` (no additional AppArmor restrictions)
-- `tmpfs={'/tmp': 'size=100m'}` (writable tmp without noexec)
-- **Allows**: `apt-get install`, `sudo`, and system-level operations
+- `cap_drop=['ALL']` (baseline — always)
+- `cap_add=FULL_CAPABILITIES` (9 caps: restricted set + `DAC_OVERRIDE`, `FOWNER`, `KILL`)
+- `security_opt=['apparmor:docker-default']`
+- `tmpfs={'/tmp': 'noexec,nosuid,size=100m'}`
+- **Allows**: `sudo apt-get install` and similar package-installation flows
+- **Still prevents** (Issue #602 / Phase 3c, 2026-05-13): `SYS_PTRACE` (heap-read escalation), `MKNOD` (device-node escape), `NET_RAW` (raw-packet crafting), `FSETID` (setuid-preserve on chmod)
 
 ### Restricted Mode (`full_capabilities=false`, secure default)
-- Container runs with **minimal capabilities**
-- `cap_drop=['ALL']` (all capabilities dropped)
-- `cap_add=['NET_BIND_SERVICE', 'SETGID', 'SETUID', 'CHOWN', 'SYS_CHROOT', 'AUDIT_WRITE']`
+- `cap_drop=['ALL']` (baseline — always)
+- `cap_add=RESTRICTED_CAPABILITIES` (6 caps: `NET_BIND_SERVICE`, `SETGID`, `SETUID`, `CHOWN`, `SYS_CHROOT`, `AUDIT_WRITE`)
 - `security_opt=['apparmor:docker-default']`
 - `tmpfs={'/tmp': 'noexec,nosuid,size=100m'}`
 - **Prevents**: Package installation, most privileged operations
@@ -348,4 +349,5 @@ To enable true per-agent capability control:
 | Date | Change |
 |------|--------|
 | 2026-01-14 | **Security Consistency (HIGH)**: Added `RESTRICTED_CAPABILITIES` and `FULL_CAPABILITIES` constants in `lifecycle.py:31-49`. All container creation paths now ALWAYS apply baseline security (`cap_drop=['ALL']`, AppArmor, noexec tmpfs) before adding back needed capabilities. Previously some paths had inconsistent security settings. See [agent-lifecycle.md](agent-lifecycle.md) for full security constant documentation. |
+| 2026-05-13 | **Cap tightening (Issue #602 Phase 3c, PR #830)**: Dropped `SYS_PTRACE` / `MKNOD` / `NET_RAW` / `FSETID` from `FULL_CAPABILITIES` — each was a documented escalation primitive with no defensible agent use case (SYS_PTRACE closes the AISEC-C2 heap-read OAuth-exfil path). FULL set is now 9 caps (was 13). Constants extracted into `services/agent_service/capabilities.py` so `tests/unit/test_capability_set.py` can pin them stdlib-only; `lifecycle.py` re-exports for runtime callers. Existing containers keep old caps until restart. |
 | 2026-01-13 | Initial documentation - CFG-004 feature flow |
diff --git a/docs/memory/feature-flows/credential-injection.md b/docs/memory/feature-flows/credential-injection.md
@@ -457,7 +457,7 @@ async def decrypt_and_inject(request: InternalDecryptInjectRequest):
 | Agent not running | 400 | "Agent is not running" |
 | No encrypted file | 404 | "No .credentials.enc file found" |
 | Decryption failed | 400 | "Failed to decrypt credentials" |
-| Agent unreachable | 503 | "Failed to connect to agent" |
+| Agent unreachable | 503 | "Failed to connect to agent" — applied symmetrically across `inject_credentials`, `import_credentials` (#35d4e78), and `export_credentials` (#35d4e78). Triggered when the agent container is running but its internal FastAPI server isn't bound to port 8000 yet (`httpx.ConnectError` / `TimeoutException` / `ReadError`). Mirrors the pattern in `routers/agent_files.py:82`. |
 
 ---
 
@@ -505,6 +505,7 @@ import_credentials("my-agent")
 
 | Date | Changes |
 |------|---------|
+| 2026-05-17 | **503 mapping on `import_credentials` / `export_credentials`** (commit 35d4e78e): both endpoints previously surfaced transient agent-server connectivity failures as 500. Now catch `httpx.RequestError` and map to 503 with a warning log, matching the pre-existing pattern in `inject_credentials` and `routers/agent_files.py`. `CredentialsFileNotFoundError(ValueError)` is unaffected — when the agent server is reachable but `.credentials.enc` is missing, the 400 path still fires. |
 | 2026-02-16 | **Security Fix (Credential Sanitization Cache Refresh)**: After credential injection, the agent-side credential sanitizer cache is now refreshed via `refresh_credential_values()` (routers/credentials.py:96, 298). This ensures newly injected credentials are immediately added to the sanitization pattern list, preventing them from appearing in subsequent execution logs. See `docker/base-image/agent_server/utils/credential_sanitizer.py`. |
 | 2026-02-15 | **Claude Max subscription support**: Added documentation about OAuth session authentication as an alternative to API key injection. When "Authenticate in Terminal" is enabled, user can log in via `/login` in web terminal. The OAuth session stored in `~/.claude.json` is then used for all Claude Code executions (including headless), eliminating the need for `ANTHROPIC_API_KEY`. |
 | 2026-02-05 | **Bug fix**: Removed orphaned credential injection loop in `crud.py:312-332` that referenced undefined `agent_credentials` variable. Added comment explaining that credentials are injected post-creation per CRED-002 design. |