Skip to content

Docs/project guide#11

Open
johnnybabs wants to merge 47 commits into
N4si:mainfrom
johnnybabs:docs/project-guide
Open

Docs/project guide#11
johnnybabs wants to merge 47 commits into
N4si:mainfrom
johnnybabs:docs/project-guide

Conversation

@johnnybabs

Copy link
Copy Markdown

No description provided.

johnnybabs and others added 30 commits June 1, 2026 09:11
- Added comprehensive .gitignore covering Terraform state, k8s secrets,
  build artifacts, Python cache, Node modules, and IDE files
- Untracked 6 secret.yaml files that should never be in git history
- Created directory structure for terraform/, monitoring/, docs/,
  src/frontend/, .github/workflows/
- Added terraform.tfvars.example template
- Added CLAUDE.md and VIDCAST_UPGRADE_PLAN.md project context files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- VPC module: VPC, 2 public subnets (eu-west-2a/b), IGW, route table
- IAM module: EKS cluster role + node role with correct policy attachments
- EKS module: cluster v1.31, managed node group, OIDC provider for IRSA
  - Validation block rejects T-type instances (blocked by account SCP)
- Security groups module: NodePort rules for ports 30002-30008
- Dev environment: root module wiring all child modules + S3/DynamoDB backend
- All resources tagged: Project=vidcast, ManagedBy=terraform, Environment=dev

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g + Trivy)

- ci.yml: matrix build for 4 services — ruff lint, Trivy CRITICAL/HIGH scan,
  Docker build + push tagged with short git SHA (never :latest)
- cd.yml: EKS deployment triggered by workflow_run on CI success
- Jenkinsfile: parallel builds, Trivy scan, Docker Hub push, Swarm staging
  deploy, smoke test via /healthz, manual approval gate, EKS production
  deploy with automatic rollback on pipeline failure
- docker-compose.swarm.yml: overlay network, named volumes, rollback on
  failure for all services — mirrors EKS deployment for staging parity
- GITHUB_SECRETS_REQUIRED.md: documents all secrets needed for CI/CD

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…port

Auth service:
- Added /healthz endpoint testing PostgreSQL connectivity (200 ok / 503 error)

Gateway service:
- Added /healthz endpoint testing MongoDB + RabbitMQ connectivity
- Added flask-cors to requirements.txt; CORS(server) for frontend support

Converter + Notification services:
- Added pathlib.Path('/tmp/healthy').touch() after each successful message

All 4 deployment manifests:
- Liveness + readiness probes (HTTP for auth/gateway, exec for converter/notification)
- Resource requests/limits: auth 50m/200m 64Mi/128Mi, gateway 100m/300m 128Mi/256Mi,
  converter 250m/500m 256Mi/512Mi, notification 50m/100m 64Mi/128Mi
- securityContext: runAsNonRoot, runAsUser=1000, readOnlyRootFilesystem,
  allowPrivilegeEscalation=false, capabilities.drop ALL
- Converter + notification: emptyDir volume mounted at /tmp for temp file writes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… alerts

- monitoring/values.yaml: kube-prometheus-stack config — Grafana NodePort 30007
  (admin/vidcast-demo), Alertmanager NodePort 30008, 7d retention, 10Gi storage,
  etcd/scheduler/controller-manager disabled (EKS manages these)
- monitoring/dashboards/vidcast-operations.json: custom Grafana dashboard with
  pod status, restart counts, node CPU/memory gauges, RabbitMQ queue depth
  timeseries, per-pod CPU and memory usage
- monitoring/alerts/vidcast-alerts.yaml: PrometheusRule CRD with 4 alerts:
  PodCrashLoopBackOff (critical), HighNodeMemory >85% (warning),
  HighNodeCPU >85% (warning), RabbitMQQueueBacklog >10 msgs (warning),
  RabbitMQUnavailable (critical)
- monitoring/README.md: install, access, and uninstall instructions

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rchitecture

- React 18 + Vite + Tailwind CSS single-page application
- Pages: Login (JWT auth), Upload (drag-and-drop MP4), Download (file ID input),
  Dashboard (Grafana iframe + links), Architecture (interactive service diagram)
- src/api.js: axios wrapper for login, uploadVideo, downloadMp3
- Dockerfile: multi-stage — Node 18 build, nginx 1.25 serve as non-root (uid 1001)
- nginx.conf: proxy /api/ to gateway service, SPA routing, security headers
- manifest/: Deployment (NodePort 30006), Service, ConfigMap

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…notes

- README.md: rewritten for public GitHub — product overview, architecture
  diagram, quick-start deploy guide, CI/CD overview, security summary, teardown
- docs/architecture.md: full service inventory, data flow walkthrough
  (13-step upload path), port map, security architecture (implemented vs
  discussed-but-not-built)
- docs/deployment-guide.md: step-by-step guide for Terraform, Helm, PostgreSQL
  init, RabbitMQ queues, secret creation, microservice deploy, E2E test,
  monitoring install, operational commands, cost management, full teardown
- docs/presentation-notes.md: 12-15 min timing guide, opening script,
  architecture analogies (restaurant/post office/security badge), platform
  engineering walkthrough, what-I'd-do-next talking points, 7 common
  interview questions with full model answers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This edit triggers the CI process for Docker image builds.
Removed a line indicating an edit to trigger CI.
Split all multi-import lines (E401) across 7 files. Additional fixes:
- auth/server.py: bare except → except Exception (E722)
- auth/validate.py: not "x" in → "x" not in (E713)
- gateway/server.py: remove unused DispatcherMiddleware import (F401)
- converter/consumer.py: remove unused time import (F401)
- converter/to_mp3.py: remove unused err variable in except clause (F841)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
python:3.10-slim-bullseye (Debian 11) has CRITICAL/HIGH CVEs with fixes
available, causing Trivy to fail CI. python:3.10-slim-bookworm (Debian 12,
current stable) resolves these. Applied to all 4 service Dockerfiles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
prometheus-client was declared in requirements.txt but never imported or
initialised. The only intended consumer was the unauth_count counter, whose
call sites (unauth_count.inc()) were already removed as a NameError crash fix.
Dropping the dependency shrinks the image and removes a dead transitive.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The notification service only reads the mp3 queue and sends email via smtplib.
It has no media-processing code path, so the ffmpeg install (~100MB) was pure
waste copied from the converter Dockerfile. Removing it shrinks the image and
reduces the CVE surface Trivy has to scan.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
None of the four Python service Dockerfiles dropped privileges; the final image
ran as root. Added USER 1000 before CMD in each, matching the Kubernetes
securityContext (runAsNonRoot: true, runAsUser: 1000) already enforced on the
deployments. This makes the images non-root by default even outside k8s (e.g.
the Docker Swarm staging environment). All listen ports are >1024 and the only
runtime writes target /tmp (1777), so no privileged access is required.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
No service had a .dockerignore, so docker build sent the entire context
(including manifest/, secret.yaml files, __pycache__, .git, and docs) to the
daemon. The new files exclude that cruft, keeping build contexts small and
ensuring Kubernetes secrets can never be baked into an image layer by accident.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The MongoDB connection strings (with embedded username/password) lived in
gateway-configmap and converter-configmap. ConfigMaps are not treated as
sensitive — they are trivially dumped via `kubectl get configmap -o yaml` and
were committed in plaintext. Moved them to the gateway-secret / converter-secret
Secret objects. Env var names are unchanged and the deployments already mount
both configMapRef and secretRef via envFrom, so this is transparent to the apps.

Also in this change:
- Removed unused VIDEO_QUEUE from notification-configmap (consumer only reads
  MP3_QUEUE; the video queue is the converter's).
- Added secret.yaml.example templates for all four services (committed) so
  operators have the key structure without any real secret entering git.
- Added imagePullPolicy: IfNotPresent to the four backend deployments, which CD
  re-tags with immutable git-SHA images. Left the frontend on the default
  (Always) since it still uses a mutable :latest tag.
- Updated the deployment guide's secret-creation step for the moved keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ning

Comment-only changes documenting known issues that cannot be safely fixed in a
surgical pass without coordinated schema/data work:
- auth-service/server.py + Postgres/init.sql: flag plaintext password storage
  and comparison; recommend bcrypt/argon2 + constant-time verify for production.
- MongoDB pvc.yaml: flag that the 1Gi claim binds a 10Gi PV, leaving ~9Gi unused.

No behaviour changes; these guide the next engineer toward the proper fixes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Trivy (CRITICAL,HIGH, ignore-unfixed) was failing on vulnerabilities that the
bookworm base-image bump alone did not clear, at two layers below the app deps:

- OS packages: added `apt-get upgrade -y` to pull patched libgnutls30
  (CRITICAL CVE-2026-33845, CVE-2026-42010) and the libkrb5* family (HIGH).
- Build toolchain: added `pip install --upgrade setuptools wheel` so the image
  ships patched wheel (CVE-2026-24049) and setuptools-vendored jaraco.context
  (CVE-2026-23949), neither of which the app imports but Trivy still scans.

Also: dropped the unused build-essential/libpq-dev/python3-dev from the
notification image (its deps are pure-Python wheels), and added apt-cache
cleanup (`rm -rf /var/lib/apt/lists/*`) to keep the images slim. Verified the
debian target reports 0 vulnerabilities on all four images locally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrote all four requirements.txt as minimal >= floors so pip resolves patched
transitive deps (Jinja2, MarkupSafe, idna, charset-normalizer, etc.) instead of
the old fully-frozen 2022 pins. Dropped dev-only tooling (pylint/astroid/jedi/
isort) that was never imported at runtime, and auth's cryptography (the service
signs JWTs with HS256 = stdlib hmac; cryptography is only needed for RS256).

Key version floors (each clears a Trivy-flagged fixable CVE):
- Flask >=3.0.3 / Werkzeug >=3.0.3 — CVE-2024-34069 (debugger RCE) is only
  fixed in Werkzeug 3.0.3, which requires Flask 3. gateway's flask-pymongo
  bumped to >=3.0.1 for Flask-3 compatibility (the .db API it uses is unchanged).
- Flask-Cors >=4.0.2 — CVE-2024-6221 (CORS bypass).
- requests >=2.31.0 — CVE-2023-32681.
- certifi >=2023.7.22 — CVE-2023-37920.
- urllib3 >=2.6.0 — the latest 1.26.x still has 4 fixable HIGH CVEs
  (e.g. CVE-2025-66418) patched only in the 2.x line; safe because requests
  supports urllib3 2.x and no app code uses urllib3 directly.
- converter: numpy <2.0 (moviepy 1.0.3 compat) + Pillow >=10.3.0
  (CVE-2023-44271 / CVE-2023-50447, CRITICAL).

Verified locally: all four images pass `trivy image --severity CRITICAL,HIGH
--ignore-unfixed --exit-code 1` (0 findings), and Flask-3/Flask-PyMongo-3 and
moviepy imports were smoke-tested in-container.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…aform

Replaces static AWS access keys in the CD pipeline with short-lived,
OIDC-issued credentials — no long-lived secrets stored in GitHub.

Terraform:
- New module terraform/modules/github-oidc: creates the GitHub Actions OIDC
  identity provider and a deploy IAM role whose trust policy is scoped to
  repo:johnnybabs/microservices-python-app:* (aud sts.amazonaws.com). The role
  grants only eks:DescribeCluster (for `aws eks update-kubeconfig`).
- eks module: set access_config.authentication_mode = API_AND_CONFIG_MAP so
  EKS access entries work alongside aws-auth.
- root module: wire the github-oidc module and add an aws_eks_access_entry +
  access_policy_association granting the deploy role AmazonEKSEditPolicy at
  cluster scope — this is what lets `kubectl set image` actually run. Added
  github_org/github_repo variables and a github_actions_role_arn output.

Workflow:
- cd.yml now uses aws-actions/configure-aws-credentials@v4 with role-to-assume
  and adds `permissions: id-token: write` to request the OIDC token. Drops the
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY inputs.
- GITHUB_SECRETS_REQUIRED.md: CD secrets section rewritten for OIDC
  (AWS_DEPLOY_ROLE_ARN from `terraform output github_actions_role_arn`).

Validated with `terraform fmt` + `terraform validate` (backend=false). Not yet
applied — cluster provisioning runs next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Both StatefulSets referenced a Secret (mongodb-secret, rabbitmq-secret)
that no chart template produced. Fresh helm installs hung in
ContainerCreating (Mongo: FailedMount) or CreateContainerConfigError
(RabbitMQ: secret not found) until the secrets were created manually.

- MongoDB: 5 keys (MONGO_ROOT_USERNAME/PASSWORD, MONGO_USERNAME/PASSWORD,
  MONGO_USERS_LIST) sourced from values.yaml.secret.*
- RabbitMQ: 2 keys (RABBITMQ_DEFAULT_USER/PASS) sourced from
  values.yaml.secret.* (new section - values.yaml had no secret config)

Postgres chart intentionally untouched: it has no referenced-but-missing
secret; it injects POSTGRES_USER/PASSWORD/DB directly as env vars from
values.yaml, so it renders and runs cleanly as-is.

.gitignore: the blanket **/secret.yaml rule (meant for real app-manifest
secrets) was also hiding these chart templates. Added scoped negations so
the templates are tracked; they hold no literal credentials, only
{{ .Values.secret.* }} references.

Manual secrets remain in place for the current deployment to avoid Helm
ownership conflicts. Charts are now self-contained for the next clean
install.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Without bootstrap_cluster_creator_admin_permissions=true, the principal
that runs terraform apply has no kubectl access to the resulting cluster
and must manually create their own access entry. This locked out
johnadmin today after the first terraform apply. Fix makes the access
grant automatic on cluster creation, preventing recurrence on rebuild.

NOT applied to the live cluster: this attribute is creation-only
(ForceNew in the AWS provider), so applying against the existing
vidcast-cluster would force-replace it. The fix takes effect on the next
greenfield rebuild. terraform CLI is also not present in this operator
environment, so fmt/validate/plan were not re-run here; the edit is a
single aligned attribute addition matching terraform fmt style.

Also gitignore the local 'tfplan'/'*.tfplan' binary plan artifacts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Previously the pika connection was constructed with no credentials,
which silently defaulted to guest:guest. With the RabbitMQ Helm chart
now configuring rabbituser as the only user, connections failed with
ACCESS_REFUSED.

This change reads RABBITMQ_DEFAULT_USER and RABBITMQ_DEFAULT_PASS from
the container environment, with a guest:guest fallback so local
development without a secret still works. The env vars are injected in
production via envFrom: secretRef: rabbitmq-secret in each deployment
manifest.

Gateway has two connection sites (module-level publish channel and the
/healthz probe); both now use a shared PlainCredentials object.

Resolves the credential mismatch between the chart and the running
application code.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Image references updated from nasi101/* (upstream tutorial) to
  johnbaabalola/*-service (this fork's CI-built images), pinned to commit
  SHA c91216a for deterministic deploys. Image names match the CI matrix
  (auth-service, gateway-service, etc.), not the short nasi101 names.
- Gateway, converter, and notification deployments now load RabbitMQ
  credentials from rabbitmq-secret via an additional envFrom: secretRef
  (appended to existing envFrom blocks, not replacing them).
- Auth service image bumped but no RabbitMQ secret added (it does not
  connect to RabbitMQ).

Works with the prior commit that reads RABBITMQ_DEFAULT_USER/PASS from
the environment.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The CVE dependency bump (5c224a3) upgraded PyMongo to a release that
requires MongoDB >= 4.2 (wire version 8). The chart pinned mongo:4.0.8
(wire version 7), so gateway and converter failed at runtime with:
  'Server at mongodb:27017 reports wire version 7, but this version of
   PyMongo requires at least 8 (MongoDB 4.2).'

This surfaced as gateway /healthz 503 (mongodb check) and would have
broken all GridFS upload/download. mongo:4.2 is the minimum compatible
version and the supported single-step upgrade from 4.0 (a direct jump to
4.4+ refuses to start against a 4.0 feature-compatibility-version data
dir).

Live cluster already bumped via 'kubectl set image statefulset/mongodb'
(no app data existed, so the in-place upgrade was non-destructive).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The converter and notification deployments use an exec liveness probe
(test -f /tmp/healthy), but the file was only created AFTER a message was
successfully processed. An idle consumer with no traffic therefore never
created the file and was killed by the probe (~45s), crash-looping
forever.

For notification this was unrecoverable: with a placeholder Gmail
password, email.notification() always errors -> basic_nack -> the
per-message touch never runs, so the pod could never become healthy.

Now each consumer touches /tmp/healthy once immediately after connecting
to RabbitMQ and being ready to consume (a meaningful 'connected and
consuming' signal), and still refreshes it after each processed message.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… to 16f49a0

Three deploy-time fixes found during the live rollout to vidcast-cluster:

- gateway: add an emptyDir volume mounted at /tmp. With
  readOnlyRootFilesystem=true and no writable temp dir, Werkzeug's
  multipart upload buffering failed -> POST /upload returned 500
  ('No usable temporary directory found'). Other consumers already had
  this volume; gateway was missing it.
- converter: 4 -> 2 replicas (and maxSurge 8 -> 1). The single
  m7i-flex.large node (2 vCPU) could not schedule 4 converters @ 250m
  CPU request alongside the rest; the extra pods sat Pending with
  'Insufficient cpu'. 2 replicas comfortably handle demo throughput.
- all four services pinned to johnbaabalola/<svc>:16f49a0 (the SHA that
  includes the RabbitMQ-credential and /tmp/healthy startup fixes).

End-to-end verified: login -> upload -> convert (MoviePy) -> mp3 queue ->
notification consume. Email itself fails by design (placeholder Gmail
App Password).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Uploads through the frontend /api proxy failed with 413 Request Entity
Too Large: nginx defaults client_max_body_size to 1m, but VidCast
uploads MP4s (the bundled test asset alone is 2.8MB). Direct gateway
uploads (NodePort 30002) were unaffected because they bypass nginx; only
the frontend path (30006 -> /api/) hit the limit. Raised to 256m.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI does not build the frontend (matrix covers only the 4 backend
services), so johnbaabalola/frontend:latest never existed on Docker Hub.
Built locally and pushed to this account's ECR
(501562869470.dkr.ecr.eu-west-2.amazonaws.com/vidcast-frontend); the EKS
node IAM role can pull from ECR in-account, so no registry credentials
or imagePullSecret are needed. Pinned to commit fd35335 (includes the
nginx client_max_body_size upload fix).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
johnnybabs and others added 17 commits June 2, 2026 22:36
Adds an account-creation flow so new users aren't limited to the single
seeded login.

- auth-service: new POST /register (JSON email+password). Rejects
  duplicates with 409, inserts into auth_user, and returns a JWT so the
  new user is signed in immediately. Password stored plaintext to match
  the existing /login comparison and seeded schema (hashing is a
  separate, coordinated change touching /login too).
- gateway: public POST /register proxying to auth-service via
  access.register().
- frontend: api.register() and a Sign In / Sign Up toggle on the Login
  page (with confirm-password + duplicate/mismatch error handling).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fix 1 of the frontend-improvements plan. Replaces the "every JWT says
admin=true" lie with genuine role-based access control, and closes a
privilege-escalation hole in self-registration.

auth-service:
- JWT now carries the user's real role: emits both admin (bool, back-comp
  for existing gateway/frontend readers) and role (string, forward-comp).
- /login verifies against a bcrypt hash with checkpw (constant-time) and
  issues the role from the DB. Also fixes a latent psycopg2 bug: execute()
  always returns None, so the old `if res is None` made unknown users 500
  instead of 401 — login could not reliably say "no".
- /register hashes with bcrypt and inserts role='user'; returns a non-admin
  token. Previously it minted an admin JWT for anyone who signed up.
- add bcrypt>=4.1.2.

Postgres init.sql:
- add role (default 'user'), UNIQUE(email), created_at.
- seed admins (baabalola@, johnbsignups@) with bcrypt hashes + role=admin,
  idempotent via ON CONFLICT. Hashes generated locally from the gitignored
  plaintext; only the hashes are committed.

gateway:
- /upload and /download now require authentication, not admin
  (if not access -> 401). They were gated on access["admin"], which only
  worked while every token lied; real RBAC would have locked out all users.

frontend:
- auth.js decodes the JWT; App.jsx shows Dashboard/Architecture and routes
  to them only for admins (previously always shown, routes unguarded).

Breaking at deploy time: the bcrypt auth image and the new DB seed must land
together (a bcrypt image against a plaintext DB breaks all logins). Migration
runbook in src/auth-service/RBAC_EXPLAINED.md — run with John at merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rashing

Fix 3 of the frontend-improvements plan. Per-user email routing already
worked end-to-end (gateway puts the JWT username on the video message,
converter forwards it to the mp3 message, send/email.py uses it as the
recipient), so this commit is the robustness half the routing was missing.

send/email.py now obeys a clear contract and never raises:
- returns None  -> consumer ACKs (success, or a deliberate skip)
- returns a str -> consumer NACKs (retryable failure)

Changes:
- json.loads wrapped: unparseable bodies are dropped (ACK), not looped on.
- message.get("username"): messages from before per-user routing (no
  username) are skipped (ACK) instead of raising KeyError. Backward compatible.
- SMTP send wrapped in try/except: a send failure returns an error string so
  the consumer nacks gracefully. This removes the CrashLoopBackOff root cause
  (a bad/placeholder Gmail password let SMTPAuthenticationError propagate out
  of the callback and kill the pod; with a stuck message that was an infinite
  crash loop).
- friendlier subject/body.

Known limitation (documented): a permanently-bad credential requeues in a
loop (poison message). Bounding that needs a dead-letter queue + max-retry —
deliberately out of scope (no new infra). Not reachable today now that the
real Gmail app password is in the secret.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…bcrypt hash

Review follow-up F1-F. bcrypt.checkpw raises ValueError("Invalid salt") if the
stored password isn't a bcrypt hash — e.g. a legacy plaintext row from before the
migration. The unguarded call made /login 500 (and leak a stack trace) for such a
row. Wrap it: on ValueError/TypeError, log and treat as a failed login (401).
Defence-in-depth on top of the merge runbook, which ensures all rows are bcrypt.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review follow-up F1-K. The runbook previously lived only in RBAC_EXPLAINED.md,
which is gitignored (*_EXPLAINED.md = local study aids), so it would not travel
with the branch/PR. Move it to a tracked operational doc. Parameterised — reads
PGPASSWORD from the gitignored config, commits no credentials.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…R + live)

Review follow-up BH-A. The frontend image 8582bf1 exists in account ECR and is
the image the live deployment is already running; the manifest just hadn't been
updated from fd35335. Commit it so the manifest matches reality. Confirmed
deliberate (not applied by CD — CD only set-images the 4 backends).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fix 2 of the frontend-improvements plan. Adds file ownership and an in-app
notification so users see when their conversion is ready without refreshing.

Ownership (metadata.owner_email, sourced from the uploader's JWT username):
- gateway storage/util.py: tag the stored video with owner_email + filename.
- converter to_mp3.py: copy the tag onto the resulting mp3 (.get so legacy
  messages without a username don't crash) + give it a filename.

Gateway endpoints (auth required, scoped to the caller's own files):
- GET /notifications/unseen-count?since=<ISO> -> {count} of the user's mp3s
  created after `since`. Uses count_documents on the GridFS files collection
  (PyMongo 4 removed Cursor.count()); bad `since` falls back to epoch.
- GET /my-files -> {files:[{fid,filename,size,created}]} newest first (feeds
  the My Conversions page in Feature 1).

Frontend:
- api.js: unseenCount() + myFiles() helpers.
- hooks/useUnseenCount.js: 5s polling hook (deliberately polling, not SSE/WS,
  for a single-user demo), cancels cleanly on unmount/token change.
- App.jsx: a `since` "last seen" marker (resets on login and on visiting the
  Download tab); red badge on the Download nav link when count > 0.

No backfill for pre-ownership files (no correct owner to assign); they simply
don't appear in any user's list.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Feature 1 of the frontend-improvements plan. A token-guarded /my-files page
listing the user's converted MP3s (filename, date, size) newest-first, each with
a Download button. Almost entirely a view over Fix 2's work: it calls the
existing myFiles() helper / gateway /my-files endpoint and reuses Download.jsx's
blob-download pattern. No new backend or infra.

- pages/MyConversions.jsx: fetch on mount (with unmount-cancel guard), loading/
  error/empty states, per-row download with a per-row spinner, null-safe size/
  date formatting.
- App.jsx: "My Conversions" nav link + /my-files route (redirects to / if logged
  out).

The page is the concrete demo of per-user ownership: the gateway scopes results
to the caller's owner_email, so a user only ever sees their own files.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generated by npm install while building the frontend locally to verify the
RBAC/notifications + My Conversions changes. Committing it pins transitive
dependency versions so local and (future) CI builds resolve identically.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ns note

Fix 4 polish. The sign-up endpoint, gateway proxy and React form already existed
(commit 8582bf1 + the RBAC hardening); this adds the spec's remaining bits:

- auth /register: reject passwords shorter than 8 chars with a 400 (server-side
  is the real guard).
- Login.jsx: matching client-side length check (fails fast before the request),
  an "At least 8 characters" hint under the password field in signup mode, and an
  "About email notifications" info box explaining that the download link is
  emailed to the address they sign up with.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Feature 4 of the frontend-improvements plan. An admin-only /admin/users page that
makes RBAC concrete: list every user with role, signup date, and conversion
count, and promote/demote between user and admin.

auth-service (internal, ClusterIP — no role check of its own; the gateway
enforces admin):
- GET  /users           -> [{email, role, created_at}]
- PATCH /users/<email>  -> validate role in {user,admin}, UPDATE ... RETURNING;
                           404 if no such email.

gateway (enforces admin + guardrails):
- GET   /admin/users         -> admin only; merges the auth user list with
                                per-user conversion counts (Mongo aggregation on
                                fs.files by metadata.owner_email).
- PATCH /admin/users/<email> -> admin only; guardrails before proxying:
    * self-demotion -> 403 (no accidental self-lockout)
    * last-admin demotion -> 409 (no cluster-wide admin lockout)
    * unknown email -> 404 (passed through from auth)
  Emits an audit line: AUDIT admin_role_change admin=<caller> target=<email>
  new_role=<role> result=<status>.

frontend:
- api.js: adminUsers() + setUserRole().
- pages/AdminUsers.jsx: table with role badges + Promote/Demote buttons; disables
  the button on your own row (mirrors the 403 guard); maps 403/409/404 to clear
  messages and reloads after a change.
- App.jsx: admin-only "Users" nav link + admin-guarded /admin/users route.

No new dependencies, no new deployments. Known limitations (in-cluster trust gap;
stdout audit is not tamper-evident) documented in ADMIN_USERS_EXPLAINED.md with
the "real fix would be" framing.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tion)

Six decisions made on this branch, each as choose/alternatives/trade-off/where-it-
breaks/real-fix: bcrypt-now, polling-vs-SSE, stats-panel-skip, in-cluster trust
gap, stdout audit, admin guardrails.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ication

Python print() to stdout is block-buffered in the containers, so diagnostics —
notably the gateway admin role-change AUDIT line — never reached `kubectl logs`
(Werkzeug access logs did, because they go through logging->stderr). Setting
PYTHONUNBUFFERED=1 flushes stdout per line so the audit trail is visible
immediately. Same one-line env on all three Python services that print at runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ary)

Two learnings from the integration test: (A) the bcrypt migration is forward-only
— once Postgres holds bcrypt hashes the pre-bcrypt auth image can't verify them,
so post-migration recovery is fix-forward not rollback; (B) the self-demote 403
and last-admin 409 guards are complementary, not redundant — 409 is the defense
for the stale-admin-token case that 403 doesn't cover.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A single self-contained guide explaining VidCast from inception to current
state, written for three audiences at once (group members, technical assessors,
non-technical guests) with analogies inline rather than segregated.

16 sections: what it does, architecture, the microservices, data layer, the
upload->download journey, an authn/authz deep dive, infrastructure, the CI and
CD pipelines stage-by-stage, the Docker-Hub<->Git trust chain, dev-vs-prod
(GitHub Actions vs the written-but-not-yet-running Jenkins pipeline),
observability, the eight problems-faced stories, decisions & trade-offs, known
limitations, and a glossary.

Synthesised from the code, git history, DECISIONS_MADE.md, the merge runbook,
and the *_EXPLAINED companions — not stitched. Corrects several aspirational
points to match reality (no unit-test stage, SHA-only image tags, MoviePy drives
ffmpeg, cluster-level monitoring) and parks genuine gaps honestly in section 15.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant