Chore/comment cleanup#13
Open
johnnybabs wants to merge 102 commits into
Open
Conversation
- Added comprehensive .gitignore covering Terraform state, k8s secrets, build artifacts, Python cache, Node modules, and IDE files - Untracked 6 secret.yaml files that should never be in git history - Created directory structure for terraform/, monitoring/, docs/, src/frontend/, .github/workflows/ - Added terraform.tfvars.example template - Added CLAUDE.md and VIDCAST_UPGRADE_PLAN.md project context files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- VPC module: VPC, 2 public subnets (eu-west-2a/b), IGW, route table - IAM module: EKS cluster role + node role with correct policy attachments - EKS module: cluster v1.31, managed node group, OIDC provider for IRSA - Validation block rejects T-type instances (blocked by account SCP) - Security groups module: NodePort rules for ports 30002-30008 - Dev environment: root module wiring all child modules + S3/DynamoDB backend - All resources tagged: Project=vidcast, ManagedBy=terraform, Environment=dev Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g + Trivy) - ci.yml: matrix build for 4 services — ruff lint, Trivy CRITICAL/HIGH scan, Docker build + push tagged with short git SHA (never :latest) - cd.yml: EKS deployment triggered by workflow_run on CI success - Jenkinsfile: parallel builds, Trivy scan, Docker Hub push, Swarm staging deploy, smoke test via /healthz, manual approval gate, EKS production deploy with automatic rollback on pipeline failure - docker-compose.swarm.yml: overlay network, named volumes, rollback on failure for all services — mirrors EKS deployment for staging parity - GITHUB_SECRETS_REQUIRED.md: documents all secrets needed for CI/CD Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…port
Auth service:
- Added /healthz endpoint testing PostgreSQL connectivity (200 ok / 503 error)
Gateway service:
- Added /healthz endpoint testing MongoDB + RabbitMQ connectivity
- Added flask-cors to requirements.txt; CORS(server) for frontend support
Converter + Notification services:
- Added pathlib.Path('/tmp/healthy').touch() after each successful message
All 4 deployment manifests:
- Liveness + readiness probes (HTTP for auth/gateway, exec for converter/notification)
- Resource requests/limits: auth 50m/200m 64Mi/128Mi, gateway 100m/300m 128Mi/256Mi,
converter 250m/500m 256Mi/512Mi, notification 50m/100m 64Mi/128Mi
- securityContext: runAsNonRoot, runAsUser=1000, readOnlyRootFilesystem,
allowPrivilegeEscalation=false, capabilities.drop ALL
- Converter + notification: emptyDir volume mounted at /tmp for temp file writes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… alerts - monitoring/values.yaml: kube-prometheus-stack config — Grafana NodePort 30007 (admin/vidcast-demo), Alertmanager NodePort 30008, 7d retention, 10Gi storage, etcd/scheduler/controller-manager disabled (EKS manages these) - monitoring/dashboards/vidcast-operations.json: custom Grafana dashboard with pod status, restart counts, node CPU/memory gauges, RabbitMQ queue depth timeseries, per-pod CPU and memory usage - monitoring/alerts/vidcast-alerts.yaml: PrometheusRule CRD with 4 alerts: PodCrashLoopBackOff (critical), HighNodeMemory >85% (warning), HighNodeCPU >85% (warning), RabbitMQQueueBacklog >10 msgs (warning), RabbitMQUnavailable (critical) - monitoring/README.md: install, access, and uninstall instructions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rchitecture - React 18 + Vite + Tailwind CSS single-page application - Pages: Login (JWT auth), Upload (drag-and-drop MP4), Download (file ID input), Dashboard (Grafana iframe + links), Architecture (interactive service diagram) - src/api.js: axios wrapper for login, uploadVideo, downloadMp3 - Dockerfile: multi-stage — Node 18 build, nginx 1.25 serve as non-root (uid 1001) - nginx.conf: proxy /api/ to gateway service, SPA routing, security headers - manifest/: Deployment (NodePort 30006), Service, ConfigMap Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…notes - README.md: rewritten for public GitHub — product overview, architecture diagram, quick-start deploy guide, CI/CD overview, security summary, teardown - docs/architecture.md: full service inventory, data flow walkthrough (13-step upload path), port map, security architecture (implemented vs discussed-but-not-built) - docs/deployment-guide.md: step-by-step guide for Terraform, Helm, PostgreSQL init, RabbitMQ queues, secret creation, microservice deploy, E2E test, monitoring install, operational commands, cost management, full teardown - docs/presentation-notes.md: 12-15 min timing guide, opening script, architecture analogies (restaurant/post office/security badge), platform engineering walkthrough, what-I'd-do-next talking points, 7 common interview questions with full model answers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This edit triggers the CI process for Docker image builds.
Removed a line indicating an edit to trigger CI.
Split all multi-import lines (E401) across 7 files. Additional fixes: - auth/server.py: bare except → except Exception (E722) - auth/validate.py: not "x" in → "x" not in (E713) - gateway/server.py: remove unused DispatcherMiddleware import (F401) - converter/consumer.py: remove unused time import (F401) - converter/to_mp3.py: remove unused err variable in except clause (F841) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python:3.10-slim-bullseye (Debian 11) has CRITICAL/HIGH CVEs with fixes available, causing Trivy to fail CI. python:3.10-slim-bookworm (Debian 12, current stable) resolves these. Applied to all 4 service Dockerfiles. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
prometheus-client was declared in requirements.txt but never imported or initialised. The only intended consumer was the unauth_count counter, whose call sites (unauth_count.inc()) were already removed as a NameError crash fix. Dropping the dependency shrinks the image and removes a dead transitive. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The notification service only reads the mp3 queue and sends email via smtplib. It has no media-processing code path, so the ffmpeg install (~100MB) was pure waste copied from the converter Dockerfile. Removing it shrinks the image and reduces the CVE surface Trivy has to scan. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the four Python service Dockerfiles dropped privileges; the final image ran as root. Added USER 1000 before CMD in each, matching the Kubernetes securityContext (runAsNonRoot: true, runAsUser: 1000) already enforced on the deployments. This makes the images non-root by default even outside k8s (e.g. the Docker Swarm staging environment). All listen ports are >1024 and the only runtime writes target /tmp (1777), so no privileged access is required. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No service had a .dockerignore, so docker build sent the entire context (including manifest/, secret.yaml files, __pycache__, .git, and docs) to the daemon. The new files exclude that cruft, keeping build contexts small and ensuring Kubernetes secrets can never be baked into an image layer by accident. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The MongoDB connection strings (with embedded username/password) lived in gateway-configmap and converter-configmap. ConfigMaps are not treated as sensitive — they are trivially dumped via `kubectl get configmap -o yaml` and were committed in plaintext. Moved them to the gateway-secret / converter-secret Secret objects. Env var names are unchanged and the deployments already mount both configMapRef and secretRef via envFrom, so this is transparent to the apps. Also in this change: - Removed unused VIDEO_QUEUE from notification-configmap (consumer only reads MP3_QUEUE; the video queue is the converter's). - Added secret.yaml.example templates for all four services (committed) so operators have the key structure without any real secret entering git. - Added imagePullPolicy: IfNotPresent to the four backend deployments, which CD re-tags with immutable git-SHA images. Left the frontend on the default (Always) since it still uses a mutable :latest tag. - Updated the deployment guide's secret-creation step for the moved keys. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ning Comment-only changes documenting known issues that cannot be safely fixed in a surgical pass without coordinated schema/data work: - auth-service/server.py + Postgres/init.sql: flag plaintext password storage and comparison; recommend bcrypt/argon2 + constant-time verify for production. - MongoDB pvc.yaml: flag that the 1Gi claim binds a 10Gi PV, leaving ~9Gi unused. No behaviour changes; these guide the next engineer toward the proper fixes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Trivy (CRITICAL,HIGH, ignore-unfixed) was failing on vulnerabilities that the bookworm base-image bump alone did not clear, at two layers below the app deps: - OS packages: added `apt-get upgrade -y` to pull patched libgnutls30 (CRITICAL CVE-2026-33845, CVE-2026-42010) and the libkrb5* family (HIGH). - Build toolchain: added `pip install --upgrade setuptools wheel` so the image ships patched wheel (CVE-2026-24049) and setuptools-vendored jaraco.context (CVE-2026-23949), neither of which the app imports but Trivy still scans. Also: dropped the unused build-essential/libpq-dev/python3-dev from the notification image (its deps are pure-Python wheels), and added apt-cache cleanup (`rm -rf /var/lib/apt/lists/*`) to keep the images slim. Verified the debian target reports 0 vulnerabilities on all four images locally. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrote all four requirements.txt as minimal >= floors so pip resolves patched transitive deps (Jinja2, MarkupSafe, idna, charset-normalizer, etc.) instead of the old fully-frozen 2022 pins. Dropped dev-only tooling (pylint/astroid/jedi/ isort) that was never imported at runtime, and auth's cryptography (the service signs JWTs with HS256 = stdlib hmac; cryptography is only needed for RS256). Key version floors (each clears a Trivy-flagged fixable CVE): - Flask >=3.0.3 / Werkzeug >=3.0.3 — CVE-2024-34069 (debugger RCE) is only fixed in Werkzeug 3.0.3, which requires Flask 3. gateway's flask-pymongo bumped to >=3.0.1 for Flask-3 compatibility (the .db API it uses is unchanged). - Flask-Cors >=4.0.2 — CVE-2024-6221 (CORS bypass). - requests >=2.31.0 — CVE-2023-32681. - certifi >=2023.7.22 — CVE-2023-37920. - urllib3 >=2.6.0 — the latest 1.26.x still has 4 fixable HIGH CVEs (e.g. CVE-2025-66418) patched only in the 2.x line; safe because requests supports urllib3 2.x and no app code uses urllib3 directly. - converter: numpy <2.0 (moviepy 1.0.3 compat) + Pillow >=10.3.0 (CVE-2023-44271 / CVE-2023-50447, CRITICAL). Verified locally: all four images pass `trivy image --severity CRITICAL,HIGH --ignore-unfixed --exit-code 1` (0 findings), and Flask-3/Flask-PyMongo-3 and moviepy imports were smoke-tested in-container. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aform Replaces static AWS access keys in the CD pipeline with short-lived, OIDC-issued credentials — no long-lived secrets stored in GitHub. Terraform: - New module terraform/modules/github-oidc: creates the GitHub Actions OIDC identity provider and a deploy IAM role whose trust policy is scoped to repo:johnnybabs/microservices-python-app:* (aud sts.amazonaws.com). The role grants only eks:DescribeCluster (for `aws eks update-kubeconfig`). - eks module: set access_config.authentication_mode = API_AND_CONFIG_MAP so EKS access entries work alongside aws-auth. - root module: wire the github-oidc module and add an aws_eks_access_entry + access_policy_association granting the deploy role AmazonEKSEditPolicy at cluster scope — this is what lets `kubectl set image` actually run. Added github_org/github_repo variables and a github_actions_role_arn output. Workflow: - cd.yml now uses aws-actions/configure-aws-credentials@v4 with role-to-assume and adds `permissions: id-token: write` to request the OIDC token. Drops the AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY inputs. - GITHUB_SECRETS_REQUIRED.md: CD secrets section rewritten for OIDC (AWS_DEPLOY_ROLE_ARN from `terraform output github_actions_role_arn`). Validated with `terraform fmt` + `terraform validate` (backend=false). Not yet applied — cluster provisioning runs next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both StatefulSets referenced a Secret (mongodb-secret, rabbitmq-secret)
that no chart template produced. Fresh helm installs hung in
ContainerCreating (Mongo: FailedMount) or CreateContainerConfigError
(RabbitMQ: secret not found) until the secrets were created manually.
- MongoDB: 5 keys (MONGO_ROOT_USERNAME/PASSWORD, MONGO_USERNAME/PASSWORD,
MONGO_USERS_LIST) sourced from values.yaml.secret.*
- RabbitMQ: 2 keys (RABBITMQ_DEFAULT_USER/PASS) sourced from
values.yaml.secret.* (new section - values.yaml had no secret config)
Postgres chart intentionally untouched: it has no referenced-but-missing
secret; it injects POSTGRES_USER/PASSWORD/DB directly as env vars from
values.yaml, so it renders and runs cleanly as-is.
.gitignore: the blanket **/secret.yaml rule (meant for real app-manifest
secrets) was also hiding these chart templates. Added scoped negations so
the templates are tracked; they hold no literal credentials, only
{{ .Values.secret.* }} references.
Manual secrets remain in place for the current deployment to avoid Helm
ownership conflicts. Charts are now self-contained for the next clean
install.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Without bootstrap_cluster_creator_admin_permissions=true, the principal that runs terraform apply has no kubectl access to the resulting cluster and must manually create their own access entry. This locked out johnadmin today after the first terraform apply. Fix makes the access grant automatic on cluster creation, preventing recurrence on rebuild. NOT applied to the live cluster: this attribute is creation-only (ForceNew in the AWS provider), so applying against the existing vidcast-cluster would force-replace it. The fix takes effect on the next greenfield rebuild. terraform CLI is also not present in this operator environment, so fmt/validate/plan were not re-run here; the edit is a single aligned attribute addition matching terraform fmt style. Also gitignore the local 'tfplan'/'*.tfplan' binary plan artifacts. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Previously the pika connection was constructed with no credentials, which silently defaulted to guest:guest. With the RabbitMQ Helm chart now configuring rabbituser as the only user, connections failed with ACCESS_REFUSED. This change reads RABBITMQ_DEFAULT_USER and RABBITMQ_DEFAULT_PASS from the container environment, with a guest:guest fallback so local development without a secret still works. The env vars are injected in production via envFrom: secretRef: rabbitmq-secret in each deployment manifest. Gateway has two connection sites (module-level publish channel and the /healthz probe); both now use a shared PlainCredentials object. Resolves the credential mismatch between the chart and the running application code. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Image references updated from nasi101/* (upstream tutorial) to johnbaabalola/*-service (this fork's CI-built images), pinned to commit SHA c91216a for deterministic deploys. Image names match the CI matrix (auth-service, gateway-service, etc.), not the short nasi101 names. - Gateway, converter, and notification deployments now load RabbitMQ credentials from rabbitmq-secret via an additional envFrom: secretRef (appended to existing envFrom blocks, not replacing them). - Auth service image bumped but no RabbitMQ secret added (it does not connect to RabbitMQ). Works with the prior commit that reads RABBITMQ_DEFAULT_USER/PASS from the environment. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The CVE dependency bump (5c224a3) upgraded PyMongo to a release that requires MongoDB >= 4.2 (wire version 8). The chart pinned mongo:4.0.8 (wire version 7), so gateway and converter failed at runtime with: 'Server at mongodb:27017 reports wire version 7, but this version of PyMongo requires at least 8 (MongoDB 4.2).' This surfaced as gateway /healthz 503 (mongodb check) and would have broken all GridFS upload/download. mongo:4.2 is the minimum compatible version and the supported single-step upgrade from 4.0 (a direct jump to 4.4+ refuses to start against a 4.0 feature-compatibility-version data dir). Live cluster already bumped via 'kubectl set image statefulset/mongodb' (no app data existed, so the in-place upgrade was non-destructive). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The converter and notification deployments use an exec liveness probe (test -f /tmp/healthy), but the file was only created AFTER a message was successfully processed. An idle consumer with no traffic therefore never created the file and was killed by the probe (~45s), crash-looping forever. For notification this was unrecoverable: with a placeholder Gmail password, email.notification() always errors -> basic_nack -> the per-message touch never runs, so the pod could never become healthy. Now each consumer touches /tmp/healthy once immediately after connecting to RabbitMQ and being ready to consume (a meaningful 'connected and consuming' signal), and still refreshes it after each processed message. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… to 16f49a0 Three deploy-time fixes found during the live rollout to vidcast-cluster: - gateway: add an emptyDir volume mounted at /tmp. With readOnlyRootFilesystem=true and no writable temp dir, Werkzeug's multipart upload buffering failed -> POST /upload returned 500 ('No usable temporary directory found'). Other consumers already had this volume; gateway was missing it. - converter: 4 -> 2 replicas (and maxSurge 8 -> 1). The single m7i-flex.large node (2 vCPU) could not schedule 4 converters @ 250m CPU request alongside the rest; the extra pods sat Pending with 'Insufficient cpu'. 2 replicas comfortably handle demo throughput. - all four services pinned to johnbaabalola/<svc>:16f49a0 (the SHA that includes the RabbitMQ-credential and /tmp/healthy startup fixes). End-to-end verified: login -> upload -> convert (MoviePy) -> mp3 queue -> notification consume. Email itself fails by design (placeholder Gmail App Password). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Uploads through the frontend /api proxy failed with 413 Request Entity Too Large: nginx defaults client_max_body_size to 1m, but VidCast uploads MP4s (the bundled test asset alone is 2.8MB). Direct gateway uploads (NodePort 30002) were unaffected because they bypass nginx; only the frontend path (30006 -> /api/) hit the limit. Raised to 256m. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI does not build the frontend (matrix covers only the 4 backend services), so johnbaabalola/frontend:latest never existed on Docker Hub. Built locally and pushed to this account's ECR (501562869470.dkr.ecr.eu-west-2.amazonaws.com/vidcast-frontend); the EKS node IAM role can pull from ECR in-account, so no registry credentials or imagePullSecret are needed. Pinned to commit fd35335 (includes the nginx client_max_body_size upload fix). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…redis Service env collision Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s never built) Argo CD dev auto-sync rolled outbox-relay to johnbaabalola/outbox-relay:e4d2669, a placeholder SHA that was never built and pushed, causing ImagePullBackOff. 65f2f57 is the real image in the registry (the version that was running before the sync). This lets vidcast-dev self-heal to a working image. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…st local values, Argo ignoreDifferences for KEDA-managed replicas Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Feature/phase up sprint 1 4
Only notification was missed in the image-tag bump; the 16f49a0 image predates the B4 /metrics instrumentation, so its PodMonitor target was down and the consumer lacked the metrics endpoint. 65f2f57 exists in the registry and matches the other backends. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…2) under default-deny The allow-monitoring policy predated B4: converter/notification now expose a :9000 metrics endpoint (scraped by PodMonitors) and rabbitmq is scraped on the :15692 prometheus-plugin port. Without these allows, applying default-deny drops those scrapes and the targets go DOWN. Verified: all targets stay UP post-policy. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Dashboard page bakes VITE_GRAFANA_URL at build time; without it the iframe falls back to localhost:30007 (broken for remote browsers). Expose it as a build arg so the image points at the live Grafana NodePort. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…erification verify-images (B5) pulls the vidcast-frontend image from PRIVATE ECR; the AWS SDK fetches node-role creds from IMDS (169.254.169.254:80). The DNS+443-only egress blocked :80, so the ECR call hung past the 10s webhook deadline -> context canceled -> failurePolicy:Fail rejected every ECR-image admission (e.g. frontend rollout). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Renamed from the (gitignored) handover into a tracked deployment guide. Uses only placeholders (<AWS_ACCOUNT_ID>, <YOUR_DOCKERHUB_USER>, <YOUR_GITHUB_ORG>, <NODE_IP>, …) — no personal data or secrets. Adds prerequisites, a customisation table with how-to-get-each-value, a cost warning, and pointers to ./customise.sh + ./deploy.sh. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deploy.sh automates the full bring-up (datastores → secrets → app → KEDA/Argo/ Kyverno/monitoring/Kubecost → NetworkPolicies → smoke test) and injects DB passwords via helm --set from env vars. customise.sh (now tracked, env-driven, auto-detects current identity — no hardcoded operator values) repoints the repo to your Docker Hub / AWS / GitHub. Helm values carry CHANGEME placeholders instead of real passwords; init.sql holds no admin hash — the admin's bcrypt hash is generated in-DB via pgcrypto at deploy time. No secret lives in a tracked file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace operator-specific identity (AWS account id, Docker Hub user, GitHub org, name) with placeholders across tracked docs/READMEs so the public repo carries no personal data. Functional GitOps config (k8s overlays/argocd/kyverno image+repo refs) is intentionally left intact — Argo CD/AWS need real values and those are inherently public. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Switch Postgres from POSTGRES_HOST_AUTH_METHOD=trust (any password accepted; access was network-only) to scram-sha-256 so the credential is actually enforced. The auth method is now a chart value (default scram-sha-256); deploy.sh already --sets a non-empty password and seeds the admin in-DB, so a fresh deploy is consistent. Also fix a latent bug that trust auth had masked: the auth service reads the password from env DATABASE_PASSWORD, but the ExternalSecret emitted the key as PSQL_PASSWORD (injected via envFrom). Under trust the missing DATABASE_PASSWORD (None) was accepted; under scram it failed. Rename the ExternalSecret key to DATABASE_PASSWORD (dev + prod) so the app gets the password it reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… runbook - terraform: aws-ebs-csi-driver addon + IRSA role (vidcast-cluster-ebs-csi-irsa) - terraform: new storage module — vidcast-backups S3 bucket (private, versioned, AES256, 30-day lifecycle) + backup IRSA role (PutObject/ListBucket scoped) - helm/postgres: StorageClass (gp3/Retain) + 2Gi RWO PVC; deployment mounts PVC at PGDATA subdir; strategy Recreate; gated by persistence.enabled flag - k8s: vidcast-backup ServiceAccount (IRSA-annotated) + nightly mongodump and pg_dump CronJobs (initContainer + aws-cli upload); wired into dev+prod overlays - docs: DISASTER_RECOVERY.md — restore procedures, RTO ≤2h / RPO ≤24h, drill runbook; Last restore test: NOT YET TESTED (close after first drill) Closes A11, I4, P5. No application code touched. Not applied to AWS. Terraform apply order: CSI addon → helm upgrade postgres → CronJob deploy.
…rability-and-backup feat(durability): EBS CSI addon, Postgres PVC, S3 backup CronJobs, DR…
Apply-time fixes discovered during Sprint 1 deployment: 1. NetworkPolicy — default-deny blocked backup pod egress entirely; both CronJobs failed on first run. Added allow-backup-egress.yaml granting backup SA egress to MongoDB (27017), Postgres (5432), and AWS S3 (443). Also added ingress-from-backup rules on the datastore policies. This confirms default-deny is genuinely enforced, not just declared. 2. MongoDB backup credentials — mongodb-secret root user fails SCRAM-SHA-256 auth (stale password under new auth method). Rewrote CronJob to use gateway-secret mongouser URIs instead; dumps both videos and mp3s databases successfully (38MB + 7MB confirmed in S3). DR runbook updated: last restore test 2026-06-10, Postgres drilled, Mongo restore noted as outstanding.
Closes P1, I7, I2. Terraform: - modules/lbc/: LBC IRSA role + official AWS LBC IAM policy (v2.8.1) - dev/main.tf: wire lbc module; lbc_irsa_role_arn output added Ingress: - k8s/ingress/vidcast-ingress.yaml: ALB Ingress, internet-facing, IP target mode, HTTP→HTTPS redirect; routes / to frontend (nginx proxies /api internally) - k8s/ingress/alb-controller-values.yaml: LBC Helm values (placeholder ARN/VPC) - k8s/ingress/cert-manager/cluster-issuer.yaml: Let's Encrypt ClusterIssuer (alternative to ACM; documented in runbook) Perimeter (I2) — all five services NodePort→ClusterIP: - Helm_charts/MongoDB, Postgres, RabbitMQ: nodePort fields removed - k8s/base/gateway/service.yaml, k8s/base/frontend/service.yaml: ClusterIP Deviations from prompt (all correctness-driven, documented in INGRESS_DEPLOY.md): - Routing via frontend/nginx, not ALB prefix-strip (ALB can't strip /api prefix) - TLS via ACM annotation path, not cert-manager secret (ALB incompatibility) - No new NetworkPolicy (existing gateway/frontend rules already allow :8080) - LBC IRSA in modules/lbc/ to avoid iam↔eks dependency cycle - Grafana routing deferred (needs subpath config, out of scope) Not applied to AWS. Deploy via docs/INGRESS_DEPLOY.md after sign-off. Cost delta: +~£22/mo (ALB, within approved envelope).
Feature/phase up sprint 1 4
…ad audit, rate limiting Closes I8, P3, A12, A10. Structured logging: - jsonlog.py JSON logger inlined into all 5 services (no Dockerfile change; COPY . /app includes it in each build context) - All print() calls replaced with one-JSON-object-per-line structured logging - Fields: timestamp, level, service, correlation_id, message + kwargs Correlation IDs (I8/P3): - Gateway mints UUID4 per request, stamps into RabbitMQ message body - Converter and notification read and propagate correlation_id on every log line - Outbox relay republishes payload verbatim, preserving correlation_id - Backwards compatible: messages missing correlation_id default to 'legacy' - Single correlation_id greppable from upload through to email notification Download audit (A12): - Structured 'File downloaded' log line on every successful GET /download - Captures: correlation_id, fid, user (from JWT), file_size_bytes Rate limiting (A10): - flask-limiter on /login (10/min) + /upload (20/hr), Redis-backed - XFF-aware client key (gateway sits behind nginx/ALB) - in_memory_fallback_enabled=True — app never breaks if Redis is unreachable - Redis port hardcoded 6379 (REDIS_PORT env unsafe: K8s injects tcp:// URI) - Gateway→redis:6379 NetworkPolicy egress rule needed for cross-worker limit sharing — documented in OBSERVABILITY.md; degrades gracefully to per-process in-memory until that follow-on infra PR lands Docs: - docs/OBSERVABILITY.md: grep examples, field reference, rate limit values, known limitations (XFF spoofability), log shipping TODO - docs/IMPROVEMENT_ASSESSMENT.md: A10/A12/I8/P3 marked IMPLEMENTED Not applied to cluster. Rebuild images via CI on push. Deploy note: add gateway→redis NetworkPolicy egress rule before relying on shared rate limiting across gunicorn workers.
…l.py logging These two files were left out of 3397772. util.py is where correlation_id is added to the RabbitMQ message body, so without it the converter/notification correlation tracing read "legacy" for every request. email.py switches the notification logging from print() to structured JSON. Completes I8/P3. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…talogue Closes UX1–UX9. UX1 — Display name: - display_name derived from email prefix (custom name deferred — needs Postgres schema migration + init.sql change, out of scope this sprint) - JWT carries display_name claim; frontend userFromToken reads it for nav UX2 — Improved email: - Subject names the original file; body includes filename, correlation_id as reference number, link to conversions page via VIDCAST_URL env var - mp3 fid removed from email body (closes A8 leakage incidentally) - VIDCAST_URL defaults to localhost:30002 in code; documented in email.py UX3 — Badge clears on conversions page visit: - Frontend-only: unseen count is a client-side since timestamp; no server state to clear, no mark-seen endpoint needed UX4 — Three-state status (Queued → Processing → Ready): - New additive job_status collection in MongoDB - Gateway writes 'queued' on upload - Converter writes 'processing' on start, 'ready'/'failed' on completion (notification has no Mongo connection — 'ready' stays with converter) - /status/<fid> endpoint added to gateway - /my-files extended to merge job_status; pre-Sprint-4 files default 'ready' - Frontend polls every 10s while any file is queued/processing; stops on all ready/failed; status pills rendered per file UX5/6/7 — Upload area improvements: - Post-upload confirmation shows filename + size + 'email on ready' message - Single-file guidance text always visible below upload area - Format list + real 256MB size limit (nginx client_max_body_size binding cap, not 5GB — corrected from prompt assumption) UX8 — Downloads page: - Original filename, friendly date, status badge, audio size, ready-only download button; all from extended /my-files response UX9 — Empty state: - 'No conversions yet' + upload CTA on downloads page and dashboard Docs: - IMPROVEMENT_ASSESSMENT.md: Sprint 4 table (UX1-UX9 IMPLEMENTED with deviations noted), Sprint 5 planned table (B1-B5), £0 cost row, updated overall assessment paragraph Deviations from prompt all documented in assessment table. Not applied to cluster. CI will rebuild all modified service images on push.
…x-and-notifications Feature/improvement sprint 4 ux and notifications
… batch UI
Closes B1–B5. Completes five-sprint improvement programme.
B1 — Multi-file upload:
- /upload accepts N files via getlist('file'), MAX_BATCH_SIZE=20
- Per-file loop: one bad file doesn't abort the batch
- Returns JSON 202 {batch_id, results[], queued, failed} (was '200 success')
- util.upload refactored to return (video_fid, err)
- Single-file upload = N=1 case; path unchanged in behaviour
B2 — Batch status tracking:
- batch_id (UUID per request) + batch_size added to RabbitMQ message
and job_status documents; None/1 for single-file uploads
- New /batch/<id> endpoint returning per-file status + completion summary
- /my-files extended with batch_id and batch_size fields
- Converter unchanged — writes status by video_fid as before;
KEDA scales on queue depth automatically
B3 — Batch summary email:
- Implemented entirely in email.py (consumer untouched)
- Last-file detection via job_status query with atomic per-batch claim
to prevent duplicate summary emails from concurrent consumers
- Graceful fallback to per-file emails if MongoDB unreachable
- Deploy prerequisites (documented, not blocking app):
(a) notification→mongodb:27017 NetworkPolicy egress rule needed
(b) credentialed MONGODB_URI in notification-secret (not configmap)
Until both land, falls back to one email per file — same pattern
as Sprint 1 backup egress and Sprint 3 gateway→redis
B4 — Multi-file drop zone UI:
- Multi-select input with removable file list preview before upload
- 'Upload N files' button label; batch confirmation on success
- Downloads page groups batched files under batch header with
per-file status pills; single uploads unchanged
- No new npm packages
B5 — Rate limiter per-file cost:
- @limiter.limit('20 per hour', cost=lambda: len(request.files.getlist('file')))
- Each file in a batch consumes one token (flask-limiter >= 3.5 cost param)
- Cleaner than limiter.check() loop; keeps the decorator pattern
Docs:
- IMPROVEMENT_ASSESSMENT.md: B1–B5 IMPLEMENTED with evidence, £0 cost
row, closing assessment paragraph; now tracked in repo
- VIDCAST_PRODUCTION_NARRATIVE.md: now tracked in repo
Two deferred platform gaps remain documented:
- P4 workspace isolation (schema migration required)
- P2 S3 file storage (staged migration, documented upgrade path)
…atch-conversion feat(batch): multi-file upload, batch status tracking, summary email,…
…x-and-notifications Feature/improvement sprint 4 ux and notifications
…ngress-tls Feature/improvement sprint 2 ingress tls
…bservability-and-abuse-protection Feature/improvement sprint 3 observability and abuse protection
Comment/docstring-only across the Sprint 1–5 files: drop tutorial narration, step-by-step annotations, obvious restatements, and sprint-ID tags; keep the decision rationale, traps, backward-compat notes, and deploy prerequisites. No logic changes (ruff + frontend build + terraform fmt all clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Comment-only cleanup across all files modified in Sprints 1-5.
No logic changes. ruff passes. Frontend builds.