Skip to content

feat(agent): build-scoped tool isolation, typed errors, run reliability#211

Merged
vmelikyan merged 2 commits into
mainfrom
feat/agent-tool-isolation-and-typed-errors
Jun 2, 2026
Merged

feat(agent): build-scoped tool isolation, typed errors, run reliability#211
vmelikyan merged 2 commits into
mainfrom
feat/agent-tool-isolation-and-typed-errors

Conversation

@vmelikyan
Copy link
Copy Markdown
Contributor

@vmelikyan vmelikyan commented Jun 1, 2026

Security & hardening

  • Build-scoped tenant isolation across all agent diagnostic tools — DB queries get a mandatory per-build WHERE and fail closed with no scope; GitHub repo allow-listing; k8s namespace locking. Scope resolved from the session's
    Build (databaseClient/githubClient/k8sClient + capabilitySessionContext.ts).
  • Tool-level enforcement in k8s + github tools: defaults to the build's own namespace/repo and rejects mismatches with coded errors (NAMESPACE_NOT_ALLOWED, FILE_ACCESS_DENIED, BUILD_NOT_ALLOWED).
  • Removed the destructive delete op from patchK8sResource (no more pod/job deletion); enum now patch/scale/restart. k8s secret listing returns name/type/keys only — never values.
  • Least-privilege Debug agent: dropped external_mcp_write and read_context; build-context chats strip workspace/sandbox tools by source kind (fixes the workspace-provision leak).
  • Admin-only authz on org-wide config mutators (mcp-servers, agent-session, runtime, repos).

Features

  • Unified typed error contract — new AppError (Unauthorized/Forbidden/NotFound/Conflict/BadRequest); handlers honor httpStatus (409/410/422/429 instead of always 500) and emit code/details/nextAction, documented in
    openApiSpec.ts.
  • Provider/run error classifier — maps SDK/OAuth/ownership failures to stable codes (provider_rate_limited, provider_quota_exhausted, model_unavailable, mcp_oauth_required, …) with retryable + recovery action.
  • Sites hosting config API — admin GET/PUT, sitesConfig service (defaults + validation), OpenAPI schemas.
  • Reasoning/thinking provider options — Gemini 2.x budget vs 3+ level, Anthropic budget tokens, wired into the streaming path.

Reliability fixes

  • Cross-process run cancellation via Postgres LISTEN/NOTIFY — a cancel on one replica aborts the worker on another; status + terminal event written atomically (no more hung SSE streams).
  • Heartbeat-staleness recovery reclaims stuck starting/running runs; graceful Debug repair synthesis when the step budget exhausts with no commit (instead of a blank failure).
  • Codefresh empty/failed log fetch returns retryable LOGS_UNAVAILABLE instead of reporting "clean"; CrashLoopBackOff diagnosis via getPodLogs previous=true + container waiting/terminated reason+exitCode.
  • Editor proxy hardening (ws-server.ts): connection caps, connect/idle timeouts, coded failures → branded HTML error page, dead/suspended-pod detection.
  • Resume never destroys the persisted PVC; stranded-provisioning reaper flips dead PROVISIONING chats to retryable FAILED; SIGINT/SIGTERM handler leak fixed across hot reloads.

Also adds support for GAR auth configured in global_config buildDefaults key
e.g:

"registryAuth": [
    {
      "type": "gar",
      "registry": "<location>-docker.pkg.dev"
    }
  ]

@vmelikyan vmelikyan requested a review from a team as a code owner June 1, 2026 06:18
@vmelikyan vmelikyan merged commit 186aa9c into main Jun 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant