Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
85c049b
fix(tests): restore services.agent_client in sys.modules baseline (#762)
AndriiPasternak31 May 14, 2026
da1084f
fix(tests): narrow except in services.agent_client preload to ImportE…
AndriiPasternak31 May 14, 2026
f586532
fix(tests): evict polluted services.task_execution_service stub in CB…
AndriiPasternak31 May 17, 2026
35d4e78
fix(credentials): map agent-server connect errors to 503 on import/ex…
AndriiPasternak31 May 17, 2026
dbdb808
test(env): auto-load TRINITY_TEST_PASSWORD + REDIS_BACKEND_PASSWORD f…
AndriiPasternak31 May 17, 2026
a647c4c
fix(tests): override placeholder REDIS_BACKEND_PASSWORD with .env value
AndriiPasternak31 May 17, 2026
2e69e0c
test(lint): skip .venv/__pycache__ in sys_modules linter; regen baseline
AndriiPasternak31 May 17, 2026
6d9af39
docs(tests): post-recovery report (May 2026 test-suite audit)
AndriiPasternak31 May 17, 2026
0919bde
test(subprocess-pgroup): regression test for #586 setsid pipe-holder
AndriiPasternak31 May 17, 2026
7bead9a
docs(feature-flows): backfill notes for #602/#830, 35d4e78, #759/#779
AndriiPasternak31 May 17, 2026
c21eb66
docs(security): CSO daily audit 2026-05-17 + diff report 2026-05-13
AndriiPasternak31 May 17, 2026
36be733
chore(.claude): bump submodule (DEVELOPMENT_WORKFLOW.md)
AndriiPasternak31 May 17, 2026
e2f4dd5
Merge remote-tracking branch 'origin/dev' into AndriiPasternak31/issu…
AndriiPasternak31 May 17, 2026
fc04003
fix(tests): resolve -e src/cli path from repo root (CI green)
AndriiPasternak31 May 17, 2026
0966243
fix(tests): restore sys.modules["config"] after test_config_fail_fast…
AndriiPasternak31 May 17, 2026
3f7d38a
fix(tests): use project-standard sys.modules restore helper in config…
AndriiPasternak31 May 17, 2026
077d92c
Merge remote-tracking branch 'origin/dev' into AndriiPasternak31/issu…
AndriiPasternak31 May 17, 2026
459b535
chore(tests): baseline 6 pre-existing sys.modules mutations in test_s…
AndriiPasternak31 May 17, 2026
4541916
Revert "chore(tests): baseline 6 pre-existing sys.modules mutations i…
AndriiPasternak31 May 17, 2026
c9008e4
Merge remote-tracking branch 'origin/dev' into AndriiPasternak31/issu…
AndriiPasternak31 May 20, 2026
51bb594
Merge remote-tracking branch 'origin/dev' into AndriiPasternak31/issu…
AndriiPasternak31 May 20, 2026
55a481c
Merge remote-tracking branch 'origin/dev' into AndriiPasternak31/issu…
May 22, 2026
86d31b4
chore(.claude): revert submodule regression to match dev
May 23, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions docs/KNOWN_ISSUES.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,53 @@ Claude Code has a hardcoded 60-second timeout for all MCP HTTP tool calls. This

---

### 🟢 Stop Hooks That Spawn Network Processes Can Hold the Agent Stdout Pipe Open

**Status**: Mitigated platform-side by the orphan-killer (#620); operator-side defense-in-depth recommended
**Priority**: LOW (mitigated)
**Affects**: Any agent whose Stop hook spawns processes that may call `setsid()` — notably `git push` spawning `ssh`

**Symptoms:**
- Agent execution logs show: `Reader thread(s) still busy after process exit ... killing process group`
- Followed by: `force-closing pipes; some buffered data may be lost`
- Then: `Execution completed without a result message` (502)
- Pattern repeats on every execution for the affected agent

**Cause:**
The hook's grandchild (e.g. `ssh` spawned by `git push`) calls `setsid()` and escapes claude's process group. `terminate_process_group(claude_pgid)` doesn't reach it, so it keeps the stdout pipe write-end open during network I/O. The reader's `readline()` never sees EOF, the drain times out, and the force-close fallback discards the final `{"type":"result"}` JSON.

**Platform fix:**
`_kill_orphan_pipe_writers` (`docker/base-image/agent_server/utils/subprocess_pgroup.py`) enumerates `/proc/*/fd` for processes outside our pgid holding the same pipe inode and SIGKILLs them. Shipped via #620. Agents only inherit the fix after `./scripts/deploy/build-base-image.sh` and container recreation — older base images won't have it.

**Operator-side defense-in-depth (bash/sh hooks):**
Redirect inherited fds to a **log file** (not `/dev/null` — failures must stay debuggable) **before any command that could spawn or duplicate a file descriptor**:

```bash
#!/bin/bash
# MUST come before any command that spawns a child or duplicates fd 1/2 —
# including `set -x`, `exec 3>&1`, command substitutions, background jobs.
# `set +e` only affects exit-code propagation; this line is about fd inheritance.
mkdir -p ~/.trinity/logs
exec >> ~/.trinity/logs/stop-hook.log 2>&1
set +e
echo "=== Stop hook fired at $(date -Iseconds) ==="
git push origin HEAD
```

Log to a file instead of `/dev/null`: a hook that silently swallows `git push` failures is its own outage class — branches stop syncing with no operator-visible signal.

**Other shells / languages:**
- **Python hooks**: pass `stdout=open(log_path, 'a'), stderr=subprocess.STDOUT` to every `subprocess.Popen`/`subprocess.run` that calls external processes.
- **Node hooks**: pass `{ stdio: ['ignore', logFd, logFd] }` to `child_process.spawn`.
- **fish**: the `exec` redirect form differs — use `set -gx`-based redirection or wrap external calls in `... &>> log_path`.

**Related Files:**
- `docker/base-image/agent_server/utils/subprocess_pgroup.py` — platform fix: `_kill_orphan_pipe_writers`
- `docker/base-image/agent_server/services/headless_executor.py` — drain + result recovery
- Resolved by: #620 (closes #618); regression-tested via `tests/unit/test_subprocess_pgroup.py::TestDrainReaderThreads::test_setsid_escapee_drained_via_orphan_killer_preserves_result_line` (#586).

---

## Resolved Issues

_No resolved issues yet_
Expand Down
2 changes: 2 additions & 0 deletions docs/memory/feature-flows.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,10 @@
|------|-----|---------|------|
| 2026-05-18 | #887 | fix(read-only): guard moved to base image (`/opt/trinity/hooks/`, root-owned 0555); MultiEdit bypass fixed; fail-closed via `run_hook()`; lifecycle always syncs config on start (stale-volume fix); config file protected by `path_deny` + `bash_deny` in guardrails-baseline.json; 18 unit tests | [read-only-mode.md](feature-flows/read-only-mode.md) |
| 2026-05-18 | #888 | write_user_memory MCP tool — per-user memory write with server-side email resolution, fixing PII cross-user memory leak | [write-user-memory.md](feature-flows/write-user-memory.md) |
| 2026-05-17 | #35d4e78 | fix(credentials): map agent-server connect errors to 503 on `import_credentials` and `export_credentials` — `httpx.RequestError` (ConnectError/TimeoutException/ReadError) now surfaces 503 instead of 500 when the agent container is up but its FastAPI server isn't reachable yet. Mirrors the inject/agent-files pattern. | [credential-injection.md](feature-flows/credential-injection.md) |
| 2026-05-17 | #862 | fix(cleanup): execution retention sweeps were no-ops — `prune_execution_logs`/`prune_execution_rows` queried `status IN ('completed','failed','terminated')` but `TaskExecutionStatus` uses `'success'/'failed'/'cancelled'/'skipped'`; only `'failed'` rows ever pruned; fixed SQL predicates + `idx_executions_completed_terminal` partial index + migration to drop/recreate existing wrong index on live installs | [cleanup-service.md](feature-flows/cleanup-service.md) |
| 2026-05-13 | #586 | obs(agent-runtime): `[METRIC] drain_outcome` emissions on the slow path of `drain_reader_threads` — two reachable sites surface `outcome=natural`/`force_close`/`leaked`, `stuck_initial`, `drain_elapsed_ms`, optional `leaked_count`, plus vestigial `orphan_kill_count=0` (since the #817 cgroup-sweep refactor, actual orphan counts are logged separately as `Cgroup sweep killed N orphan(s)`). Fast path stays silent. New Stop-hook authoring guidance in `TRINITY_COMPATIBLE_AGENT_GUIDE.md` shows how to release the inherited stdout FD before blocking I/O so hooks avoid the slow path entirely. Fleet audit at `scripts/586-fleet-check.sh` gates close-out by scanning Vector agent logs for residual "still stuck after Ns" / "no result message after" events. 2 new unit tests. | [execution-termination.md](feature-flows/execution-termination.md) |
| 2026-05-13 | #602/#830 | sec: drop SYS_PTRACE / MKNOD / NET_RAW / FSETID from `FULL_CAPABILITIES` (Phase 3c). SYS_PTRACE closes the AISEC-C2 heap-read OAuth-exfil path. FULL set is now 9 caps (was 13). Constants extracted to stdlib-only `services/agent_service/capabilities.py`; `lifecycle.py` re-exports. | [container-capabilities.md](feature-flows/container-capabilities.md), [agent-lifecycle.md](feature-flows/agent-lifecycle.md) |
| 2026-05-13 | #831 | feat: platform default model — admin sets `platform_default_model` in Settings General tab; `task_execution_service.execute_task()` resolves `model=None` → platform default (TTL-cached, write-through invalidation); `GET /api/settings/feature-flags` exposes value for frontend; SchedulesPanel shows "platform default (X)" when no model set; PRESET_MODELS updated to canonical Anthropic list (Opus 4.7 / Sonnet 4.6 / Haiku 4.5) | [model-selection.md](feature-flows/model-selection.md), [platform-settings.md](feature-flows/platform-settings.md), [task-execution-service.md](feature-flows/task-execution-service.md) |
| 2026-05-12 | #808 | fix(orphan-killer): `_set_idle_priority()` (SCHED_IDLE/nice) + `_scan_deadline` 8s per-iteration budget — prevents orphan-killer daemon thread from starving uvicorn health probes on 1-CPU containers and triggering circuit breaker | [parallel-headless-execution.md](feature-flows/parallel-headless-execution.md) |
| 2026-05-12 | #474 | fix(circuit-breaker): only TCP unreachability counts toward the circuit — `agent_client._request()` now classifies via shared `is_circuit_failure()` helper backed by `CIRCUIT_FAILURE_EXCEPTIONS` (`ConnectError`, `ConnectTimeout`). `TRANSIENT_TRANSPORT_EXCEPTIONS` (`ReadTimeout`/`WriteTimeout`/`PoolTimeout`/`WriteError`/`ReadError`/`RemoteProtocolError`) still raise `AgentNotReachableError` but no longer increment the 3-failure threshold; raw `OSError` subclasses (`BrokenPipeError`/`ConnectionResetError`) propagate uncaught; `asyncio.CancelledError` is re-raised explicitly. `monitoring_service.check_network_health()` lazy-imports the same tuples so the /health probe and inline `/api/*` agree on what "unreachable" means; any HTTP response (200..599) records success — symmetric with `_request()` so stale counters clear. `aggregate_health()` adds explicit `status_code >= 500 → UNHEALTHY` branch so a wedged-but-listening agent isn't silently HEALTHY under the new rule. 12 unit + 13 integration tests on the classifier + 1 new monitoring-service integration suite. | [agent-monitoring.md](feature-flows/agent-monitoring.md), [execution-queue.md](feature-flows/execution-queue.md), [scheduling.md](feature-flows/scheduling.md) |
Expand Down
18 changes: 9 additions & 9 deletions docs/memory/feature-flows/agent-lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -693,10 +693,12 @@ Auto-generated on agent creation with `scope='agent'`, `agent_name=<this agent>`
| `trinity.created` | Creation timestamp (ISO format) |
| `trinity.template` | Template used (empty string if none) |

### Container Security Constants (`src/backend/services/agent_service/lifecycle.py:30-65`)
### Container Security Constants (`src/backend/services/agent_service/capabilities.py`, re-exported from `lifecycle.py`)

**2026-01-14 Security Fix**: All container creation paths now use centralized capability constants for consistent security.

**2026-05-13 (Issue #602 Phase 3c, PR #830)**: `SYS_PTRACE` / `MKNOD` / `NET_RAW` / `FSETID` dropped from FULL set (each was a documented escalation primitive — SYS_PTRACE in particular closes the AISEC-C2 heap-read OAuth-exfil path). Constants moved to a stdlib-only `capabilities.py` sibling so `tests/unit/test_capability_set.py` can import them without dragging the docker / fastapi / database transitive imports of `lifecycle.py`. `lifecycle.py` re-exports the names so runtime callers (`crud.py`, `system_agent_service.py`) are unchanged.

```python
# Restricted mode capabilities - minimum for agent operation (default)
RESTRICTED_CAPABILITIES = [
Expand All @@ -709,13 +711,9 @@ RESTRICTED_CAPABILITIES = [

# Full capabilities mode - adds package installation support
FULL_CAPABILITIES = RESTRICTED_CAPABILITIES + [
'DAC_OVERRIDE', # Bypass file permission checks (needed for apt)
'DAC_OVERRIDE', # Bypass file permission checks (needed for sudo apt)
'FOWNER', # Bypass permission checks on file owner
'FSETID', # Don't clear setuid/setgid bits
'KILL', # Send signals to processes
'MKNOD', # Create special files
'NET_RAW', # Use raw sockets (ping, etc.)
'SYS_PTRACE', # Trace processes (debugging)
]
```

Expand All @@ -738,9 +736,11 @@ tmpfs={'/tmp': 'noexec,nosuid,size=100m'}
**Files Using These Constants**:
| File | Line | Usage |
|------|------|-------|
| `services/agent_service/crud.py` | 477 | Agent creation |
| `services/agent_service/lifecycle.py` | 393 | Container recreation |
| `services/system_agent_service.py` | 250 | System agent creation (FULL_CAPABILITIES only) |
| `services/agent_service/capabilities.py` | — | Definitions (stdlib-only, test-importable) |
| `services/agent_service/lifecycle.py` | 94 | Re-exports `RESTRICTED_CAPABILITIES` / `FULL_CAPABILITIES` / `PROHIBITED_CAPABILITIES` |
| `services/agent_service/crud.py` | 615 | Agent creation |
| `services/agent_service/lifecycle.py` | 535 | Container recreation |
| `services/system_agent_service.py` | 251 | System agent creation (FULL_CAPABILITIES only) |

### Network Isolation (line 645)
- Network: `trinity-agent-network` (Docker network)
Expand Down
20 changes: 11 additions & 9 deletions docs/memory/feature-flows/container-capabilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,19 @@ Controls whether agent containers run with full Docker capabilities (allowing pa

## What "Full Capabilities" Means

Both modes apply the same baseline security (`cap_drop=['ALL']`, AppArmor, noexec tmpfs) and differ only in which caps are added back. Constants live in `src/backend/services/agent_service/capabilities.py` and are re-exported from `lifecycle.py`.

### Full Capabilities Mode (`full_capabilities=true`)
- Container runs with **Docker default capabilities**
- `cap_drop=[]` (no capabilities dropped)
- `cap_add=[]` (defaults to Docker defaults)
- `security_opt=[]` (no additional AppArmor restrictions)
- `tmpfs={'/tmp': 'size=100m'}` (writable tmp without noexec)
- **Allows**: `apt-get install`, `sudo`, and system-level operations
- `cap_drop=['ALL']` (baseline — always)
- `cap_add=FULL_CAPABILITIES` (9 caps: restricted set + `DAC_OVERRIDE`, `FOWNER`, `KILL`)
- `security_opt=['apparmor:docker-default']`
- `tmpfs={'/tmp': 'noexec,nosuid,size=100m'}`
- **Allows**: `sudo apt-get install` and similar package-installation flows
- **Still prevents** (Issue #602 / Phase 3c, 2026-05-13): `SYS_PTRACE` (heap-read escalation), `MKNOD` (device-node escape), `NET_RAW` (raw-packet crafting), `FSETID` (setuid-preserve on chmod)

### Restricted Mode (`full_capabilities=false`, secure default)
- Container runs with **minimal capabilities**
- `cap_drop=['ALL']` (all capabilities dropped)
- `cap_add=['NET_BIND_SERVICE', 'SETGID', 'SETUID', 'CHOWN', 'SYS_CHROOT', 'AUDIT_WRITE']`
- `cap_drop=['ALL']` (baseline — always)
- `cap_add=RESTRICTED_CAPABILITIES` (6 caps: `NET_BIND_SERVICE`, `SETGID`, `SETUID`, `CHOWN`, `SYS_CHROOT`, `AUDIT_WRITE`)
- `security_opt=['apparmor:docker-default']`
- `tmpfs={'/tmp': 'noexec,nosuid,size=100m'}`
- **Prevents**: Package installation, most privileged operations
Expand Down Expand Up @@ -348,4 +349,5 @@ To enable true per-agent capability control:
| Date | Change |
|------|--------|
| 2026-01-14 | **Security Consistency (HIGH)**: Added `RESTRICTED_CAPABILITIES` and `FULL_CAPABILITIES` constants in `lifecycle.py:31-49`. All container creation paths now ALWAYS apply baseline security (`cap_drop=['ALL']`, AppArmor, noexec tmpfs) before adding back needed capabilities. Previously some paths had inconsistent security settings. See [agent-lifecycle.md](agent-lifecycle.md) for full security constant documentation. |
| 2026-05-13 | **Cap tightening (Issue #602 Phase 3c, PR #830)**: Dropped `SYS_PTRACE` / `MKNOD` / `NET_RAW` / `FSETID` from `FULL_CAPABILITIES` — each was a documented escalation primitive with no defensible agent use case (SYS_PTRACE closes the AISEC-C2 heap-read OAuth-exfil path). FULL set is now 9 caps (was 13). Constants extracted into `services/agent_service/capabilities.py` so `tests/unit/test_capability_set.py` can pin them stdlib-only; `lifecycle.py` re-exports for runtime callers. Existing containers keep old caps until restart. |
| 2026-01-13 | Initial documentation - CFG-004 feature flow |
3 changes: 2 additions & 1 deletion docs/memory/feature-flows/credential-injection.md
Original file line number Diff line number Diff line change
Expand Up @@ -457,7 +457,7 @@ async def decrypt_and_inject(request: InternalDecryptInjectRequest):
| Agent not running | 400 | "Agent is not running" |
| No encrypted file | 404 | "No .credentials.enc file found" |
| Decryption failed | 400 | "Failed to decrypt credentials" |
| Agent unreachable | 503 | "Failed to connect to agent" |
| Agent unreachable | 503 | "Failed to connect to agent" — applied symmetrically across `inject_credentials`, `import_credentials` (#35d4e78), and `export_credentials` (#35d4e78). Triggered when the agent container is running but its internal FastAPI server isn't bound to port 8000 yet (`httpx.ConnectError` / `TimeoutException` / `ReadError`). Mirrors the pattern in `routers/agent_files.py:82`. |

---

Expand Down Expand Up @@ -505,6 +505,7 @@ import_credentials("my-agent")

| Date | Changes |
|------|---------|
| 2026-05-17 | **503 mapping on `import_credentials` / `export_credentials`** (commit 35d4e78e): both endpoints previously surfaced transient agent-server connectivity failures as 500. Now catch `httpx.RequestError` and map to 503 with a warning log, matching the pre-existing pattern in `inject_credentials` and `routers/agent_files.py`. `CredentialsFileNotFoundError(ValueError)` is unaffected — when the agent server is reachable but `.credentials.enc` is missing, the 400 path still fires. |
| 2026-02-16 | **Security Fix (Credential Sanitization Cache Refresh)**: After credential injection, the agent-side credential sanitizer cache is now refreshed via `refresh_credential_values()` (routers/credentials.py:96, 298). This ensures newly injected credentials are immediately added to the sanitization pattern list, preventing them from appearing in subsequent execution logs. See `docker/base-image/agent_server/utils/credential_sanitizer.py`. |
| 2026-02-15 | **Claude Max subscription support**: Added documentation about OAuth session authentication as an alternative to API key injection. When "Authenticate in Terminal" is enabled, user can log in via `/login` in web terminal. The OAuth session stored in `~/.claude.json` is then used for all Claude Code executions (including headless), eliminating the need for `ANTHROPIC_API_KEY`. |
| 2026-02-05 | **Bug fix**: Removed orphaned credential injection loop in `crud.py:312-332` that referenced undefined `agent_credentials` variable. Added comment explaining that credentials are injected post-creation per CRED-002 design. |
Expand Down
Loading
Loading