Skip to content

Chore/comment cleanup#13

Open
johnnybabs wants to merge 102 commits into
N4si:mainfrom
johnnybabs:chore/comment-cleanup
Open

Chore/comment cleanup#13
johnnybabs wants to merge 102 commits into
N4si:mainfrom
johnnybabs:chore/comment-cleanup

Conversation

@johnnybabs

Copy link
Copy Markdown

Comment-only cleanup across all files modified in Sprints 1-5.
No logic changes. ruff passes. Frontend builds.

johnnybabs and others added 30 commits June 1, 2026 09:11
- Added comprehensive .gitignore covering Terraform state, k8s secrets,
  build artifacts, Python cache, Node modules, and IDE files
- Untracked 6 secret.yaml files that should never be in git history
- Created directory structure for terraform/, monitoring/, docs/,
  src/frontend/, .github/workflows/
- Added terraform.tfvars.example template
- Added CLAUDE.md and VIDCAST_UPGRADE_PLAN.md project context files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- VPC module: VPC, 2 public subnets (eu-west-2a/b), IGW, route table
- IAM module: EKS cluster role + node role with correct policy attachments
- EKS module: cluster v1.31, managed node group, OIDC provider for IRSA
  - Validation block rejects T-type instances (blocked by account SCP)
- Security groups module: NodePort rules for ports 30002-30008
- Dev environment: root module wiring all child modules + S3/DynamoDB backend
- All resources tagged: Project=vidcast, ManagedBy=terraform, Environment=dev

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g + Trivy)

- ci.yml: matrix build for 4 services — ruff lint, Trivy CRITICAL/HIGH scan,
  Docker build + push tagged with short git SHA (never :latest)
- cd.yml: EKS deployment triggered by workflow_run on CI success
- Jenkinsfile: parallel builds, Trivy scan, Docker Hub push, Swarm staging
  deploy, smoke test via /healthz, manual approval gate, EKS production
  deploy with automatic rollback on pipeline failure
- docker-compose.swarm.yml: overlay network, named volumes, rollback on
  failure for all services — mirrors EKS deployment for staging parity
- GITHUB_SECRETS_REQUIRED.md: documents all secrets needed for CI/CD

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…port

Auth service:
- Added /healthz endpoint testing PostgreSQL connectivity (200 ok / 503 error)

Gateway service:
- Added /healthz endpoint testing MongoDB + RabbitMQ connectivity
- Added flask-cors to requirements.txt; CORS(server) for frontend support

Converter + Notification services:
- Added pathlib.Path('/tmp/healthy').touch() after each successful message

All 4 deployment manifests:
- Liveness + readiness probes (HTTP for auth/gateway, exec for converter/notification)
- Resource requests/limits: auth 50m/200m 64Mi/128Mi, gateway 100m/300m 128Mi/256Mi,
  converter 250m/500m 256Mi/512Mi, notification 50m/100m 64Mi/128Mi
- securityContext: runAsNonRoot, runAsUser=1000, readOnlyRootFilesystem,
  allowPrivilegeEscalation=false, capabilities.drop ALL
- Converter + notification: emptyDir volume mounted at /tmp for temp file writes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… alerts

- monitoring/values.yaml: kube-prometheus-stack config — Grafana NodePort 30007
  (admin/vidcast-demo), Alertmanager NodePort 30008, 7d retention, 10Gi storage,
  etcd/scheduler/controller-manager disabled (EKS manages these)
- monitoring/dashboards/vidcast-operations.json: custom Grafana dashboard with
  pod status, restart counts, node CPU/memory gauges, RabbitMQ queue depth
  timeseries, per-pod CPU and memory usage
- monitoring/alerts/vidcast-alerts.yaml: PrometheusRule CRD with 4 alerts:
  PodCrashLoopBackOff (critical), HighNodeMemory >85% (warning),
  HighNodeCPU >85% (warning), RabbitMQQueueBacklog >10 msgs (warning),
  RabbitMQUnavailable (critical)
- monitoring/README.md: install, access, and uninstall instructions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rchitecture

- React 18 + Vite + Tailwind CSS single-page application
- Pages: Login (JWT auth), Upload (drag-and-drop MP4), Download (file ID input),
  Dashboard (Grafana iframe + links), Architecture (interactive service diagram)
- src/api.js: axios wrapper for login, uploadVideo, downloadMp3
- Dockerfile: multi-stage — Node 18 build, nginx 1.25 serve as non-root (uid 1001)
- nginx.conf: proxy /api/ to gateway service, SPA routing, security headers
- manifest/: Deployment (NodePort 30006), Service, ConfigMap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…notes

- README.md: rewritten for public GitHub — product overview, architecture
  diagram, quick-start deploy guide, CI/CD overview, security summary, teardown
- docs/architecture.md: full service inventory, data flow walkthrough
  (13-step upload path), port map, security architecture (implemented vs
  discussed-but-not-built)
- docs/deployment-guide.md: step-by-step guide for Terraform, Helm, PostgreSQL
  init, RabbitMQ queues, secret creation, microservice deploy, E2E test,
  monitoring install, operational commands, cost management, full teardown
- docs/presentation-notes.md: 12-15 min timing guide, opening script,
  architecture analogies (restaurant/post office/security badge), platform
  engineering walkthrough, what-I'd-do-next talking points, 7 common
  interview questions with full model answers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This edit triggers the CI process for Docker image builds.
Removed a line indicating an edit to trigger CI.
Split all multi-import lines (E401) across 7 files. Additional fixes:
- auth/server.py: bare except → except Exception (E722)
- auth/validate.py: not "x" in → "x" not in (E713)
- gateway/server.py: remove unused DispatcherMiddleware import (F401)
- converter/consumer.py: remove unused time import (F401)
- converter/to_mp3.py: remove unused err variable in except clause (F841)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python:3.10-slim-bullseye (Debian 11) has CRITICAL/HIGH CVEs with fixes
available, causing Trivy to fail CI. python:3.10-slim-bookworm (Debian 12,
current stable) resolves these. Applied to all 4 service Dockerfiles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
prometheus-client was declared in requirements.txt but never imported or
initialised. The only intended consumer was the unauth_count counter, whose
call sites (unauth_count.inc()) were already removed as a NameError crash fix.
Dropping the dependency shrinks the image and removes a dead transitive.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The notification service only reads the mp3 queue and sends email via smtplib.
It has no media-processing code path, so the ffmpeg install (~100MB) was pure
waste copied from the converter Dockerfile. Removing it shrinks the image and
reduces the CVE surface Trivy has to scan.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the four Python service Dockerfiles dropped privileges; the final image
ran as root. Added USER 1000 before CMD in each, matching the Kubernetes
securityContext (runAsNonRoot: true, runAsUser: 1000) already enforced on the
deployments. This makes the images non-root by default even outside k8s (e.g.
the Docker Swarm staging environment). All listen ports are >1024 and the only
runtime writes target /tmp (1777), so no privileged access is required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No service had a .dockerignore, so docker build sent the entire context
(including manifest/, secret.yaml files, __pycache__, .git, and docs) to the
daemon. The new files exclude that cruft, keeping build contexts small and
ensuring Kubernetes secrets can never be baked into an image layer by accident.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The MongoDB connection strings (with embedded username/password) lived in
gateway-configmap and converter-configmap. ConfigMaps are not treated as
sensitive — they are trivially dumped via `kubectl get configmap -o yaml` and
were committed in plaintext. Moved them to the gateway-secret / converter-secret
Secret objects. Env var names are unchanged and the deployments already mount
both configMapRef and secretRef via envFrom, so this is transparent to the apps.

Also in this change:
- Removed unused VIDEO_QUEUE from notification-configmap (consumer only reads
  MP3_QUEUE; the video queue is the converter's).
- Added secret.yaml.example templates for all four services (committed) so
  operators have the key structure without any real secret entering git.
- Added imagePullPolicy: IfNotPresent to the four backend deployments, which CD
  re-tags with immutable git-SHA images. Left the frontend on the default
  (Always) since it still uses a mutable :latest tag.
- Updated the deployment guide's secret-creation step for the moved keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ning

Comment-only changes documenting known issues that cannot be safely fixed in a
surgical pass without coordinated schema/data work:
- auth-service/server.py + Postgres/init.sql: flag plaintext password storage
  and comparison; recommend bcrypt/argon2 + constant-time verify for production.
- MongoDB pvc.yaml: flag that the 1Gi claim binds a 10Gi PV, leaving ~9Gi unused.

No behaviour changes; these guide the next engineer toward the proper fixes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Trivy (CRITICAL,HIGH, ignore-unfixed) was failing on vulnerabilities that the
bookworm base-image bump alone did not clear, at two layers below the app deps:

- OS packages: added `apt-get upgrade -y` to pull patched libgnutls30
  (CRITICAL CVE-2026-33845, CVE-2026-42010) and the libkrb5* family (HIGH).
- Build toolchain: added `pip install --upgrade setuptools wheel` so the image
  ships patched wheel (CVE-2026-24049) and setuptools-vendored jaraco.context
  (CVE-2026-23949), neither of which the app imports but Trivy still scans.

Also: dropped the unused build-essential/libpq-dev/python3-dev from the
notification image (its deps are pure-Python wheels), and added apt-cache
cleanup (`rm -rf /var/lib/apt/lists/*`) to keep the images slim. Verified the
debian target reports 0 vulnerabilities on all four images locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrote all four requirements.txt as minimal >= floors so pip resolves patched
transitive deps (Jinja2, MarkupSafe, idna, charset-normalizer, etc.) instead of
the old fully-frozen 2022 pins. Dropped dev-only tooling (pylint/astroid/jedi/
isort) that was never imported at runtime, and auth's cryptography (the service
signs JWTs with HS256 = stdlib hmac; cryptography is only needed for RS256).

Key version floors (each clears a Trivy-flagged fixable CVE):
- Flask >=3.0.3 / Werkzeug >=3.0.3 — CVE-2024-34069 (debugger RCE) is only
  fixed in Werkzeug 3.0.3, which requires Flask 3. gateway's flask-pymongo
  bumped to >=3.0.1 for Flask-3 compatibility (the .db API it uses is unchanged).
- Flask-Cors >=4.0.2 — CVE-2024-6221 (CORS bypass).
- requests >=2.31.0 — CVE-2023-32681.
- certifi >=2023.7.22 — CVE-2023-37920.
- urllib3 >=2.6.0 — the latest 1.26.x still has 4 fixable HIGH CVEs
  (e.g. CVE-2025-66418) patched only in the 2.x line; safe because requests
  supports urllib3 2.x and no app code uses urllib3 directly.
- converter: numpy <2.0 (moviepy 1.0.3 compat) + Pillow >=10.3.0
  (CVE-2023-44271 / CVE-2023-50447, CRITICAL).

Verified locally: all four images pass `trivy image --severity CRITICAL,HIGH
--ignore-unfixed --exit-code 1` (0 findings), and Flask-3/Flask-PyMongo-3 and
moviepy imports were smoke-tested in-container.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aform

Replaces static AWS access keys in the CD pipeline with short-lived,
OIDC-issued credentials — no long-lived secrets stored in GitHub.

Terraform:
- New module terraform/modules/github-oidc: creates the GitHub Actions OIDC
  identity provider and a deploy IAM role whose trust policy is scoped to
  repo:johnnybabs/microservices-python-app:* (aud sts.amazonaws.com). The role
  grants only eks:DescribeCluster (for `aws eks update-kubeconfig`).
- eks module: set access_config.authentication_mode = API_AND_CONFIG_MAP so
  EKS access entries work alongside aws-auth.
- root module: wire the github-oidc module and add an aws_eks_access_entry +
  access_policy_association granting the deploy role AmazonEKSEditPolicy at
  cluster scope — this is what lets `kubectl set image` actually run. Added
  github_org/github_repo variables and a github_actions_role_arn output.

Workflow:
- cd.yml now uses aws-actions/configure-aws-credentials@v4 with role-to-assume
  and adds `permissions: id-token: write` to request the OIDC token. Drops the
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY inputs.
- GITHUB_SECRETS_REQUIRED.md: CD secrets section rewritten for OIDC
  (AWS_DEPLOY_ROLE_ARN from `terraform output github_actions_role_arn`).

Validated with `terraform fmt` + `terraform validate` (backend=false). Not yet
applied — cluster provisioning runs next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both StatefulSets referenced a Secret (mongodb-secret, rabbitmq-secret)
that no chart template produced. Fresh helm installs hung in
ContainerCreating (Mongo: FailedMount) or CreateContainerConfigError
(RabbitMQ: secret not found) until the secrets were created manually.

- MongoDB: 5 keys (MONGO_ROOT_USERNAME/PASSWORD, MONGO_USERNAME/PASSWORD,
  MONGO_USERS_LIST) sourced from values.yaml.secret.*
- RabbitMQ: 2 keys (RABBITMQ_DEFAULT_USER/PASS) sourced from
  values.yaml.secret.* (new section - values.yaml had no secret config)

Postgres chart intentionally untouched: it has no referenced-but-missing
secret; it injects POSTGRES_USER/PASSWORD/DB directly as env vars from
values.yaml, so it renders and runs cleanly as-is.

.gitignore: the blanket **/secret.yaml rule (meant for real app-manifest
secrets) was also hiding these chart templates. Added scoped negations so
the templates are tracked; they hold no literal credentials, only
{{ .Values.secret.* }} references.

Manual secrets remain in place for the current deployment to avoid Helm
ownership conflicts. Charts are now self-contained for the next clean
install.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Without bootstrap_cluster_creator_admin_permissions=true, the principal
that runs terraform apply has no kubectl access to the resulting cluster
and must manually create their own access entry. This locked out
johnadmin today after the first terraform apply. Fix makes the access
grant automatic on cluster creation, preventing recurrence on rebuild.

NOT applied to the live cluster: this attribute is creation-only
(ForceNew in the AWS provider), so applying against the existing
vidcast-cluster would force-replace it. The fix takes effect on the next
greenfield rebuild. terraform CLI is also not present in this operator
environment, so fmt/validate/plan were not re-run here; the edit is a
single aligned attribute addition matching terraform fmt style.

Also gitignore the local 'tfplan'/'*.tfplan' binary plan artifacts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Previously the pika connection was constructed with no credentials,
which silently defaulted to guest:guest. With the RabbitMQ Helm chart
now configuring rabbituser as the only user, connections failed with
ACCESS_REFUSED.

This change reads RABBITMQ_DEFAULT_USER and RABBITMQ_DEFAULT_PASS from
the container environment, with a guest:guest fallback so local
development without a secret still works. The env vars are injected in
production via envFrom: secretRef: rabbitmq-secret in each deployment
manifest.

Gateway has two connection sites (module-level publish channel and the
/healthz probe); both now use a shared PlainCredentials object.

Resolves the credential mismatch between the chart and the running
application code.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Image references updated from nasi101/* (upstream tutorial) to
  johnbaabalola/*-service (this fork's CI-built images), pinned to commit
  SHA c91216a for deterministic deploys. Image names match the CI matrix
  (auth-service, gateway-service, etc.), not the short nasi101 names.
- Gateway, converter, and notification deployments now load RabbitMQ
  credentials from rabbitmq-secret via an additional envFrom: secretRef
  (appended to existing envFrom blocks, not replacing them).
- Auth service image bumped but no RabbitMQ secret added (it does not
  connect to RabbitMQ).

Works with the prior commit that reads RABBITMQ_DEFAULT_USER/PASS from
the environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The CVE dependency bump (5c224a3) upgraded PyMongo to a release that
requires MongoDB >= 4.2 (wire version 8). The chart pinned mongo:4.0.8
(wire version 7), so gateway and converter failed at runtime with:
  'Server at mongodb:27017 reports wire version 7, but this version of
   PyMongo requires at least 8 (MongoDB 4.2).'

This surfaced as gateway /healthz 503 (mongodb check) and would have
broken all GridFS upload/download. mongo:4.2 is the minimum compatible
version and the supported single-step upgrade from 4.0 (a direct jump to
4.4+ refuses to start against a 4.0 feature-compatibility-version data
dir).

Live cluster already bumped via 'kubectl set image statefulset/mongodb'
(no app data existed, so the in-place upgrade was non-destructive).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The converter and notification deployments use an exec liveness probe
(test -f /tmp/healthy), but the file was only created AFTER a message was
successfully processed. An idle consumer with no traffic therefore never
created the file and was killed by the probe (~45s), crash-looping
forever.

For notification this was unrecoverable: with a placeholder Gmail
password, email.notification() always errors -> basic_nack -> the
per-message touch never runs, so the pod could never become healthy.

Now each consumer touches /tmp/healthy once immediately after connecting
to RabbitMQ and being ready to consume (a meaningful 'connected and
consuming' signal), and still refreshes it after each processed message.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… to 16f49a0

Three deploy-time fixes found during the live rollout to vidcast-cluster:

- gateway: add an emptyDir volume mounted at /tmp. With
  readOnlyRootFilesystem=true and no writable temp dir, Werkzeug's
  multipart upload buffering failed -> POST /upload returned 500
  ('No usable temporary directory found'). Other consumers already had
  this volume; gateway was missing it.
- converter: 4 -> 2 replicas (and maxSurge 8 -> 1). The single
  m7i-flex.large node (2 vCPU) could not schedule 4 converters @ 250m
  CPU request alongside the rest; the extra pods sat Pending with
  'Insufficient cpu'. 2 replicas comfortably handle demo throughput.
- all four services pinned to johnbaabalola/<svc>:16f49a0 (the SHA that
  includes the RabbitMQ-credential and /tmp/healthy startup fixes).

End-to-end verified: login -> upload -> convert (MoviePy) -> mp3 queue ->
notification consume. Email itself fails by design (placeholder Gmail
App Password).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Uploads through the frontend /api proxy failed with 413 Request Entity
Too Large: nginx defaults client_max_body_size to 1m, but VidCast
uploads MP4s (the bundled test asset alone is 2.8MB). Direct gateway
uploads (NodePort 30002) were unaffected because they bypass nginx; only
the frontend path (30006 -> /api/) hit the limit. Raised to 256m.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI does not build the frontend (matrix covers only the 4 backend
services), so johnbaabalola/frontend:latest never existed on Docker Hub.
Built locally and pushed to this account's ECR
(501562869470.dkr.ecr.eu-west-2.amazonaws.com/vidcast-frontend); the EKS
node IAM role can pull from ECR in-account, so no registry credentials
or imagePullSecret are needed. Pinned to commit fd35335 (includes the
nginx client_max_body_size upload fix).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
johnnybabs and others added 30 commits June 9, 2026 17:03
…redis Service env collision

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s never built)

Argo CD dev auto-sync rolled outbox-relay to johnbaabalola/outbox-relay:e4d2669,
a placeholder SHA that was never built and pushed, causing ImagePullBackOff.
65f2f57 is the real image in the registry (the version that was running before
the sync). This lets vidcast-dev self-heal to a working image.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…st local values, Argo ignoreDifferences for KEDA-managed replicas

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Only notification was missed in the image-tag bump; the 16f49a0 image
predates the B4 /metrics instrumentation, so its PodMonitor target was down
and the consumer lacked the metrics endpoint. 65f2f57 exists in the registry
and matches the other backends.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…2) under default-deny

The allow-monitoring policy predated B4: converter/notification now expose a
:9000 metrics endpoint (scraped by PodMonitors) and rabbitmq is scraped on the
:15692 prometheus-plugin port. Without these allows, applying default-deny drops
those scrapes and the targets go DOWN. Verified: all targets stay UP post-policy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Dashboard page bakes VITE_GRAFANA_URL at build time; without it the iframe
falls back to localhost:30007 (broken for remote browsers). Expose it as a build
arg so the image points at the live Grafana NodePort.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…erification

verify-images (B5) pulls the vidcast-frontend image from PRIVATE ECR; the AWS SDK
fetches node-role creds from IMDS (169.254.169.254:80). The DNS+443-only egress
blocked :80, so the ECR call hung past the 10s webhook deadline -> context canceled
-> failurePolicy:Fail rejected every ECR-image admission (e.g. frontend rollout).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Renamed from the (gitignored) handover into a tracked deployment guide. Uses only
placeholders (<AWS_ACCOUNT_ID>, <YOUR_DOCKERHUB_USER>, <YOUR_GITHUB_ORG>, <NODE_IP>,
…) — no personal data or secrets. Adds prerequisites, a customisation table with
how-to-get-each-value, a cost warning, and pointers to ./customise.sh + ./deploy.sh.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
deploy.sh automates the full bring-up (datastores → secrets → app → KEDA/Argo/
Kyverno/monitoring/Kubecost → NetworkPolicies → smoke test) and injects DB
passwords via helm --set from env vars. customise.sh (now tracked, env-driven,
auto-detects current identity — no hardcoded operator values) repoints the repo
to your Docker Hub / AWS / GitHub. Helm values carry CHANGEME placeholders instead
of real passwords; init.sql holds no admin hash — the admin's bcrypt hash is
generated in-DB via pgcrypto at deploy time. No secret lives in a tracked file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace operator-specific identity (AWS account id, Docker Hub user, GitHub org,
name) with placeholders across tracked docs/READMEs so the public repo carries no
personal data. Functional GitOps config (k8s overlays/argocd/kyverno image+repo
refs) is intentionally left intact — Argo CD/AWS need real values and those are
inherently public.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Switch Postgres from POSTGRES_HOST_AUTH_METHOD=trust (any password accepted;
access was network-only) to scram-sha-256 so the credential is actually enforced.
The auth method is now a chart value (default scram-sha-256); deploy.sh already
--sets a non-empty password and seeds the admin in-DB, so a fresh deploy is
consistent.

Also fix a latent bug that trust auth had masked: the auth service reads the
password from env DATABASE_PASSWORD, but the ExternalSecret emitted the key as
PSQL_PASSWORD (injected via envFrom). Under trust the missing DATABASE_PASSWORD
(None) was accepted; under scram it failed. Rename the ExternalSecret key to
DATABASE_PASSWORD (dev + prod) so the app gets the password it reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… runbook

- terraform: aws-ebs-csi-driver addon + IRSA role (vidcast-cluster-ebs-csi-irsa)
- terraform: new storage module — vidcast-backups S3 bucket (private, versioned,
  AES256, 30-day lifecycle) + backup IRSA role (PutObject/ListBucket scoped)
- helm/postgres: StorageClass (gp3/Retain) + 2Gi RWO PVC; deployment mounts PVC
  at PGDATA subdir; strategy Recreate; gated by persistence.enabled flag
- k8s: vidcast-backup ServiceAccount (IRSA-annotated) + nightly mongodump and
  pg_dump CronJobs (initContainer + aws-cli upload); wired into dev+prod overlays
- docs: DISASTER_RECOVERY.md — restore procedures, RTO ≤2h / RPO ≤24h, drill
  runbook; Last restore test: NOT YET TESTED (close after first drill)

Closes A11, I4, P5. No application code touched. Not applied to AWS.
Terraform apply order: CSI addon → helm upgrade postgres → CronJob deploy.
…rability-and-backup

feat(durability): EBS CSI addon, Postgres PVC, S3 backup CronJobs, DR…
Apply-time fixes discovered during Sprint 1 deployment:

1. NetworkPolicy — default-deny blocked backup pod egress entirely; both
   CronJobs failed on first run. Added allow-backup-egress.yaml granting
   backup SA egress to MongoDB (27017), Postgres (5432), and AWS S3 (443).
   Also added ingress-from-backup rules on the datastore policies. This
   confirms default-deny is genuinely enforced, not just declared.

2. MongoDB backup credentials — mongodb-secret root user fails SCRAM-SHA-256
   auth (stale password under new auth method). Rewrote CronJob to use
   gateway-secret mongouser URIs instead; dumps both videos and mp3s
   databases successfully (38MB + 7MB confirmed in S3).

DR runbook updated: last restore test 2026-06-10, Postgres drilled,
Mongo restore noted as outstanding.
Closes P1, I7, I2.

Terraform:
- modules/lbc/: LBC IRSA role + official AWS LBC IAM policy (v2.8.1)
- dev/main.tf: wire lbc module; lbc_irsa_role_arn output added

Ingress:
- k8s/ingress/vidcast-ingress.yaml: ALB Ingress, internet-facing, IP target
  mode, HTTP→HTTPS redirect; routes / to frontend (nginx proxies /api internally)
- k8s/ingress/alb-controller-values.yaml: LBC Helm values (placeholder ARN/VPC)
- k8s/ingress/cert-manager/cluster-issuer.yaml: Let's Encrypt ClusterIssuer
  (alternative to ACM; documented in runbook)

Perimeter (I2) — all five services NodePort→ClusterIP:
- Helm_charts/MongoDB, Postgres, RabbitMQ: nodePort fields removed
- k8s/base/gateway/service.yaml, k8s/base/frontend/service.yaml: ClusterIP

Deviations from prompt (all correctness-driven, documented in INGRESS_DEPLOY.md):
- Routing via frontend/nginx, not ALB prefix-strip (ALB can't strip /api prefix)
- TLS via ACM annotation path, not cert-manager secret (ALB incompatibility)
- No new NetworkPolicy (existing gateway/frontend rules already allow :8080)
- LBC IRSA in modules/lbc/ to avoid iam↔eks dependency cycle
- Grafana routing deferred (needs subpath config, out of scope)

Not applied to AWS. Deploy via docs/INGRESS_DEPLOY.md after sign-off.
Cost delta: +~£22/mo (ALB, within approved envelope).
…ad audit, rate limiting

Closes I8, P3, A12, A10.

Structured logging:
- jsonlog.py JSON logger inlined into all 5 services (no Dockerfile change;
  COPY . /app includes it in each build context)
- All print() calls replaced with one-JSON-object-per-line structured logging
- Fields: timestamp, level, service, correlation_id, message + kwargs

Correlation IDs (I8/P3):
- Gateway mints UUID4 per request, stamps into RabbitMQ message body
- Converter and notification read and propagate correlation_id on every log line
- Outbox relay republishes payload verbatim, preserving correlation_id
- Backwards compatible: messages missing correlation_id default to 'legacy'
- Single correlation_id greppable from upload through to email notification

Download audit (A12):
- Structured 'File downloaded' log line on every successful GET /download
- Captures: correlation_id, fid, user (from JWT), file_size_bytes

Rate limiting (A10):
- flask-limiter on /login (10/min) + /upload (20/hr), Redis-backed
- XFF-aware client key (gateway sits behind nginx/ALB)
- in_memory_fallback_enabled=True — app never breaks if Redis is unreachable
- Redis port hardcoded 6379 (REDIS_PORT env unsafe: K8s injects tcp:// URI)
- Gateway→redis:6379 NetworkPolicy egress rule needed for cross-worker
  limit sharing — documented in OBSERVABILITY.md; degrades gracefully
  to per-process in-memory until that follow-on infra PR lands

Docs:
- docs/OBSERVABILITY.md: grep examples, field reference, rate limit values,
  known limitations (XFF spoofability), log shipping TODO
- docs/IMPROVEMENT_ASSESSMENT.md: A10/A12/I8/P3 marked IMPLEMENTED

Not applied to cluster. Rebuild images via CI on push.
Deploy note: add gateway→redis NetworkPolicy egress rule before relying
on shared rate limiting across gunicorn workers.
…l.py logging

These two files were left out of 3397772. util.py is where correlation_id is
added to the RabbitMQ message body, so without it the converter/notification
correlation tracing read "legacy" for every request. email.py switches the
notification logging from print() to structured JSON. Completes I8/P3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…talogue

Closes UX1–UX9.

UX1 — Display name:
- display_name derived from email prefix (custom name deferred — needs
  Postgres schema migration + init.sql change, out of scope this sprint)
- JWT carries display_name claim; frontend userFromToken reads it for nav

UX2 — Improved email:
- Subject names the original file; body includes filename, correlation_id
  as reference number, link to conversions page via VIDCAST_URL env var
- mp3 fid removed from email body (closes A8 leakage incidentally)
- VIDCAST_URL defaults to localhost:30002 in code; documented in email.py

UX3 — Badge clears on conversions page visit:
- Frontend-only: unseen count is a client-side since timestamp; no server
  state to clear, no mark-seen endpoint needed

UX4 — Three-state status (Queued → Processing → Ready):
- New additive job_status collection in MongoDB
- Gateway writes 'queued' on upload
- Converter writes 'processing' on start, 'ready'/'failed' on completion
  (notification has no Mongo connection — 'ready' stays with converter)
- /status/<fid> endpoint added to gateway
- /my-files extended to merge job_status; pre-Sprint-4 files default 'ready'
- Frontend polls every 10s while any file is queued/processing; stops on all
  ready/failed; status pills rendered per file

UX5/6/7 — Upload area improvements:
- Post-upload confirmation shows filename + size + 'email on ready' message
- Single-file guidance text always visible below upload area
- Format list + real 256MB size limit (nginx client_max_body_size binding
  cap, not 5GB — corrected from prompt assumption)

UX8 — Downloads page:
- Original filename, friendly date, status badge, audio size, ready-only
  download button; all from extended /my-files response

UX9 — Empty state:
- 'No conversions yet' + upload CTA on downloads page and dashboard

Docs:
- IMPROVEMENT_ASSESSMENT.md: Sprint 4 table (UX1-UX9 IMPLEMENTED with
  deviations noted), Sprint 5 planned table (B1-B5), £0 cost row,
  updated overall assessment paragraph

Deviations from prompt all documented in assessment table.
Not applied to cluster. CI will rebuild all modified service images on push.
…x-and-notifications

Feature/improvement sprint 4 ux and notifications
… batch UI

Closes B1–B5. Completes five-sprint improvement programme.

B1 — Multi-file upload:
- /upload accepts N files via getlist('file'), MAX_BATCH_SIZE=20
- Per-file loop: one bad file doesn't abort the batch
- Returns JSON 202 {batch_id, results[], queued, failed} (was '200 success')
- util.upload refactored to return (video_fid, err)
- Single-file upload = N=1 case; path unchanged in behaviour

B2 — Batch status tracking:
- batch_id (UUID per request) + batch_size added to RabbitMQ message
  and job_status documents; None/1 for single-file uploads
- New /batch/<id> endpoint returning per-file status + completion summary
- /my-files extended with batch_id and batch_size fields
- Converter unchanged — writes status by video_fid as before;
  KEDA scales on queue depth automatically

B3 — Batch summary email:
- Implemented entirely in email.py (consumer untouched)
- Last-file detection via job_status query with atomic per-batch claim
  to prevent duplicate summary emails from concurrent consumers
- Graceful fallback to per-file emails if MongoDB unreachable
- Deploy prerequisites (documented, not blocking app):
  (a) notification→mongodb:27017 NetworkPolicy egress rule needed
  (b) credentialed MONGODB_URI in notification-secret (not configmap)
  Until both land, falls back to one email per file — same pattern
  as Sprint 1 backup egress and Sprint 3 gateway→redis

B4 — Multi-file drop zone UI:
- Multi-select input with removable file list preview before upload
- 'Upload N files' button label; batch confirmation on success
- Downloads page groups batched files under batch header with
  per-file status pills; single uploads unchanged
- No new npm packages

B5 — Rate limiter per-file cost:
- @limiter.limit('20 per hour', cost=lambda: len(request.files.getlist('file')))
- Each file in a batch consumes one token (flask-limiter >= 3.5 cost param)
- Cleaner than limiter.check() loop; keeps the decorator pattern

Docs:
- IMPROVEMENT_ASSESSMENT.md: B1–B5 IMPLEMENTED with evidence, £0 cost
  row, closing assessment paragraph; now tracked in repo
- VIDCAST_PRODUCTION_NARRATIVE.md: now tracked in repo

Two deferred platform gaps remain documented:
- P4 workspace isolation (schema migration required)
- P2 S3 file storage (staged migration, documented upgrade path)
…atch-conversion

feat(batch): multi-file upload, batch status tracking, summary email,…
…x-and-notifications

Feature/improvement sprint 4 ux and notifications
…ngress-tls

Feature/improvement sprint 2 ingress tls
…bservability-and-abuse-protection

Feature/improvement sprint 3 observability and abuse protection
Comment/docstring-only across the Sprint 1–5 files: drop tutorial narration,
step-by-step annotations, obvious restatements, and sprint-ID tags; keep the
decision rationale, traps, backward-compat notes, and deploy prerequisites.
No logic changes (ruff + frontend build + terraform fmt all clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant