Fix CV platform audit findings: authz, job lifecycle, secrets, storage, frontend, infra#21
Open
whittenator wants to merge 2 commits into
Open
Fix CV platform audit findings: authz, job lifecycle, secrets, storage, frontend, infra#21whittenator wants to merge 2 commits into
whittenator wants to merge 2 commits into
Conversation
Documents verified security, job-lifecycle, and data-integrity bugs with file references, a feature-gap matrix against the full CV lifecycle (curation, versioning, analysis, storage selection, training, evaluation, model registry, ONNX export), and a prioritized remediation roadmap. https://claude.ai/code/session_019z4CBrSBy1zDWkJb7KhjKr
… infra Backend security & correctness: - Add object-level authorization (services/authz.py) across projects, workspaces, datasets, assets, annotations, experiments, artifacts, evaluations, AL, and ops routers; authenticate the job + SSE-stream endpoints (fixes IDOR/missing-auth findings C1-C5). - Fail the job and release the reserved cluster when a training task can't be dispatched, returning 502 instead of a phantom queued job (C6). - Harden Celery against worker death: task_acks_late, task_reject_on_worker_lost, time limits, plus a beat-scheduled sweep_stale_jobs maintenance task that fails stuck jobs and frees clusters (C7). - Fail fast on default SECRET_KEY / DB / MinIO secrets when APP_ENV=production (settings.require_secure_setting). - Re-check user existence in /auth/refresh; add password length bounds; tighten CORS to explicit methods/headers. - Back auth rate limiting with Redis (bounded in-memory fallback). - Serialize startup migrations with a Postgres advisory lock. Data model & lifecycle: - Snapshot versions now count the version being frozen, not the whole dataset (H5). - Store image embeddings in a pgvector Asset.embedding column with an ANN index (migration 0009) instead of JSON in meta_data (H6). - Set ALItem.resolved_at on resolve (H8); normalize label_status spelling. - Use the is_superuser column consistently. Agent & infra: - Constant-time agent token comparison; heartbeat auth via Authorization header; supervisor respawns crashed children with backoff. - docker-compose resource limits, prometheus/grafana volumes, embedded celery beat; nginx no longer exposes /metrics, /docs, /openapi.json. Frontend: - Single-flight 401 refresh-and-retry in api.ts; stop polling after terminal run/eval state; surface load errors instead of swallowing them; keep annotations on failed delete; top-level error boundary. Tests: 180 backend unit tests + new dispatch-failure test, 20 agent tests, 17 frontend vitest tests all pass; ruff/black clean; frontend builds. https://claude.ai/code/session_019z4CBrSBy1zDWkJb7KhjKr
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes all bugs and production-readiness issues identified in the platform audit (
AUDIT.md§2–§4 and §6). Net-new product features from §5 (per-project MinIO/S3 selection, duplicate-detection UI, training checkpoint/resume, the active-learning retraining loop) are intentionally left as roadmap work and called out below.The first commit (
AUDIT.md) is the audit itself; this PR's changes implement the remediation. A status matrix is at the top ofAUDIT.md(§0).Backend — security & correctness
services/authz.py(require_workspace_access/require_project_access/require_dataset_access) applied across projects, workspaces, datasets, assets, annotations, experiments, artifacts, evaluations, active-learning, and ops routers. The previously unauthenticatedGET /api/jobs/{id}and the job SSE stream now authenticate and check the job's project access. List endpoints scope to the caller's workspaces.queuedjob that hangs forever. Mirrors the existing ONNX behavior; covered by a new test.task_acks_late,task_reject_on_worker_lost, and time limits, plus a beat-scheduledsweep_stale_jobsmaintenance task that fails stuck jobs and frees their clusters.settings.require_secure_settingrefuses to boot with defaultSECRET_KEY/ DB password / MinIO credentials whenAPP_ENV=production(dev still works out of the box)./auth/refreshre-checks the user still exists (H3); password length bounds (M3); CORS restricted to explicit methods/headers (M2); Redis-backed auth rate limiting with a bounded in-memory fallback (H4).alembic upgradeserialized with a Postgres advisory lock so concurrent replicas don't deadlock (M7).Backend — data model & lifecycle
meta_datato a realAsset.embeddingvector(512)column with an ANN index (migration0009), unblocking future similarity search / dedup. Degrades to text storage on SQLite.resolved_atis set on resolve (H8);label_statusspelling normalized (L1);is_superusercolumn used consistently (L2).Agent & infra
hmac.compare_digest); heartbeat auth accepted viaAuthorizationheader (H9/M9); supervisor respawns crashed children with backoff (M8).docker-compose: per-service resource limits, Prometheus/Grafana persistent volumes, embedded Celery beat;nginx.confno longer proxies/metrics,/docs,/redoc,/openapi.json(L3/L4); Grafana datasource password via env interpolation..env.example:APP_ENVand optionalMETRICS_BEARER_TOKEN.Frontend
api.ts(no more hard logout mid-session) (H7); polling stops after a run/eval reaches a terminal state (M4); load failures surface as visible errors instead of silent.catch(() => {})(M5); annotations are kept on a failed server delete (M6); top-level error boundary replaces white-screens (L5).Deferred (roadmap, not bugs)
Per-project MinIO/S3 storage selection, duplicate-detection endpoint + UI, training checkpoint/resume, and the AL retraining feedback loop. The embedding-storage groundwork for similarity/dedup is in place; the search endpoints and UI are not.
Test plan
ruff/blackclean; app imports; migration0009is a clean single head.tscerror set byte-identical to base (no new type errors).docker compose configvalidates.https://claude.ai/code/session_019z4CBrSBy1zDWkJb7KhjKr
Generated by Claude Code