Skip to content

Fix CV platform audit findings: authz, job lifecycle, secrets, storage, frontend, infra#21

Open
whittenator wants to merge 2 commits into
mainfrom
claude/cv-platform-audit-8god0a
Open

Fix CV platform audit findings: authz, job lifecycle, secrets, storage, frontend, infra#21
whittenator wants to merge 2 commits into
mainfrom
claude/cv-platform-audit-8god0a

Conversation

@whittenator

Copy link
Copy Markdown
Owner

Summary

Fixes all bugs and production-readiness issues identified in the platform audit (AUDIT.md §2–§4 and §6). Net-new product features from §5 (per-project MinIO/S3 selection, duplicate-detection UI, training checkpoint/resume, the active-learning retraining loop) are intentionally left as roadmap work and called out below.

The first commit (AUDIT.md) is the audit itself; this PR's changes implement the remediation. A status matrix is at the top of AUDIT.md (§0).

Backend — security & correctness

  • Object-level authorization (IDOR fixes C1–C5): new shared services/authz.py (require_workspace_access / require_project_access / require_dataset_access) applied across projects, workspaces, datasets, assets, annotations, experiments, artifacts, evaluations, active-learning, and ops routers. The previously unauthenticated GET /api/jobs/{id} and the job SSE stream now authenticate and check the job's project access. List endpoints scope to the caller's workspaces.
  • Training dispatch leak (C6): when the broker can't be reached, the job is marked failed and the reserved cluster released (HTTP 502), instead of returning a phantom queued job that hangs forever. Mirrors the existing ONNX behavior; covered by a new test.
  • Worker-death recovery (C7): Celery now uses task_acks_late, task_reject_on_worker_lost, and time limits, plus a beat-scheduled sweep_stale_jobs maintenance task that fails stuck jobs and frees their clusters.
  • Fail-fast secrets (H1/H2): settings.require_secure_setting refuses to boot with default SECRET_KEY / DB password / MinIO credentials when APP_ENV=production (dev still works out of the box).
  • Auth hardening: /auth/refresh re-checks the user still exists (H3); password length bounds (M3); CORS restricted to explicit methods/headers (M2); Redis-backed auth rate limiting with a bounded in-memory fallback (H4).
  • Migrations: startup alembic upgrade serialized with a Postgres advisory lock so concurrent replicas don't deadlock (M7).

Backend — data model & lifecycle

  • Snapshot counts (H5): a locked version now records the asset count of the working version being frozen, not the whole dataset.
  • Embeddings / pgvector (H6): image embeddings move from a JSON blob in meta_data to a real Asset.embedding vector(512) column with an ANN index (migration 0009), unblocking future similarity search / dedup. Degrades to text storage on SQLite.
  • AL resolved_at is set on resolve (H8); label_status spelling normalized (L1); is_superuser column used consistently (L2).

Agent & infra

  • Constant-time agent token comparison (hmac.compare_digest); heartbeat auth accepted via Authorization header (H9/M9); supervisor respawns crashed children with backoff (M8).
  • docker-compose: per-service resource limits, Prometheus/Grafana persistent volumes, embedded Celery beat; nginx.conf no longer proxies /metrics, /docs, /redoc, /openapi.json (L3/L4); Grafana datasource password via env interpolation.
  • .env.example: APP_ENV and optional METRICS_BEARER_TOKEN.

Frontend

  • Single-flight 401 refresh-and-retry in api.ts (no more hard logout mid-session) (H7); polling stops after a run/eval reaches a terminal state (M4); load failures surface as visible errors instead of silent .catch(() => {}) (M5); annotations are kept on a failed server delete (M6); top-level error boundary replaces white-screens (L5).

Deferred (roadmap, not bugs)

Per-project MinIO/S3 storage selection, duplicate-detection endpoint + UI, training checkpoint/resume, and the AL retraining feedback loop. The embedding-storage groundwork for similarity/dedup is in place; the search endpoints and UI are not.

Test plan

  • Backend: 180 unit tests + new dispatch-failure test pass; ruff/black clean; app imports; migration 0009 is a clean single head.
  • Agent: 20 tests pass.
  • Frontend: production build succeeds; ESLint clean; 17 vitest tests pass; tsc error set byte-identical to base (no new type errors).
  • docker compose config validates.

Note: some test files needed updating to reflect intended new behavior — endpoints now require referenced projects to exist/be accessible, and unit tests mock the broker rather than relying on dispatch failures being silently swallowed.

https://claude.ai/code/session_019z4CBrSBy1zDWkJb7KhjKr


Generated by Claude Code

claude added 2 commits June 9, 2026 22:36
Documents verified security, job-lifecycle, and data-integrity bugs with
file references, a feature-gap matrix against the full CV lifecycle
(curation, versioning, analysis, storage selection, training, evaluation,
model registry, ONNX export), and a prioritized remediation roadmap.

https://claude.ai/code/session_019z4CBrSBy1zDWkJb7KhjKr
… infra

Backend security & correctness:
- Add object-level authorization (services/authz.py) across projects,
  workspaces, datasets, assets, annotations, experiments, artifacts,
  evaluations, AL, and ops routers; authenticate the job + SSE-stream
  endpoints (fixes IDOR/missing-auth findings C1-C5).
- Fail the job and release the reserved cluster when a training task can't
  be dispatched, returning 502 instead of a phantom queued job (C6).
- Harden Celery against worker death: task_acks_late,
  task_reject_on_worker_lost, time limits, plus a beat-scheduled
  sweep_stale_jobs maintenance task that fails stuck jobs and frees
  clusters (C7).
- Fail fast on default SECRET_KEY / DB / MinIO secrets when
  APP_ENV=production (settings.require_secure_setting).
- Re-check user existence in /auth/refresh; add password length bounds;
  tighten CORS to explicit methods/headers.
- Back auth rate limiting with Redis (bounded in-memory fallback).
- Serialize startup migrations with a Postgres advisory lock.

Data model & lifecycle:
- Snapshot versions now count the version being frozen, not the whole
  dataset (H5).
- Store image embeddings in a pgvector Asset.embedding column with an ANN
  index (migration 0009) instead of JSON in meta_data (H6).
- Set ALItem.resolved_at on resolve (H8); normalize label_status spelling.
- Use the is_superuser column consistently.

Agent & infra:
- Constant-time agent token comparison; heartbeat auth via Authorization
  header; supervisor respawns crashed children with backoff.
- docker-compose resource limits, prometheus/grafana volumes, embedded
  celery beat; nginx no longer exposes /metrics, /docs, /openapi.json.

Frontend:
- Single-flight 401 refresh-and-retry in api.ts; stop polling after
  terminal run/eval state; surface load errors instead of swallowing them;
  keep annotations on failed delete; top-level error boundary.

Tests: 180 backend unit tests + new dispatch-failure test, 20 agent tests,
17 frontend vitest tests all pass; ruff/black clean; frontend builds.

https://claude.ai/code/session_019z4CBrSBy1zDWkJb7KhjKr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants