fix(deploy): Recreate strategy + 20m helm timeout (helm rollout timeout) by gaurav0107 · Pull Request #8 · tracebility-ai/tracebility

gaurav0107 · 2026-06-07T12:00:22Z

Symptom

After PR #7 unblocked the migrator, helm-deploy still failed:

Error: UPGRADE FAILED: release tracebility failed, and has been rolled back
due to atomic being set: context deadline exceeded

Run 27091541251.

Root cause

Cluster events during the failed rollout:

FailedScheduling  pod/tracebility-web-6cb65969dd-5rf59
  0/2 nodes are available: 1 Insufficient memory, ...

GKE Autopilot's just-in-time node scaling kicks in when a rolling update needs 2x the capacity of one pod. With replicaCount: 1 + the default maxSurge: 25% (which rounds up to 1 extra pod), every deploy needs a second node briefly. Autopilot scale-up takes ~5 min on top of image pull + readiness probes; helm --wait --timeout 10m wasn't enough. --atomic then reverted everything to the previous SHA, hence the deploy "fails" but the live site (running the old image) stayed up.

The ingest-api deployment already had strategy: type: Recreate for exactly this reason (visible in its template comment: "recreate strategy keeps cluster topology simpler at the cost of a brief outage on rollout"). The other deployments missed the same treatment.

Fix

api and ingest-worker switch to strategy: type: Recreate. Rolling update force-needs 2x capacity; Recreate kills the old pod first so the new one schedules onto the freed slot. Brief outage during rollout is acceptable for these two — api sits behind the web frontend (not on the public LB path), and ingest-worker is a Redis-stream consumer where redelivery is already in the failure model.
web stays on RollingUpdate. Web fronts the public LB; a few seconds of 503 there is user-visible. With api + worker now Recreate, web is the only deployment that still needs a second node — so the scheduling pressure during a rollout is much smaller and Autopilot may not even need to scale up.
--timeout 10m → 20m in the workflow. Belt-and-braces: even when the Recreate strategy avoids the worst case, a cold image pull + readiness delay can still push a single-replica rollout past 10m on Autopilot. 20m is generous without being absurd.

Verified

helm template tracebility deploy/helm/tracebility -f values-gke.yaml --set image.tag=test renders all three deployments (api, ingest-api, ingest-worker) with the Recreate strategy block. web renders without a strategy: field (default = RollingUpdate).

Test plan

CI on this PR goes green.
Merge.
Post-merge helm-deploy succeeds within the 20m budget.
Live site at https://langprobe.daz.co.in serves the new image (web pod's image tag should match the merged commit SHA).

🤖 Generated with Claude Code

…ut to 20m The previous deploy hit ``Error: UPGRADE FAILED ... context deadline exceeded`` after 10m. Cluster events showed: FailedScheduling pod/tracebility-web-... 0/2 nodes are available: 1 Insufficient memory ... GKE Autopilot's just-in-time node scaling kicks in when a rolling update needs 2x the capacity of one pod (default maxSurge=1). Scale-up takes ~5 min on top of image pull + probes; helm --wait at 10m wasn't enough. The atomic rollback then reverted everything to the previous SHA. Two changes: 1. ``api`` and ``ingest-worker`` deployments switch to ``strategy: type: Recreate`` so a single-replica rollout doesn't briefly need 2x capacity. ``ingest-api`` already uses Recreate (predates this PR). ``web`` keeps RollingUpdate because it fronts the LB and a brief outage there is user-visible. 2. ``--timeout 10m`` -> ``20m`` in the workflow so an Autopilot scale-up (rare now but possible) doesn't trip the gate. Verified: ``helm template`` renders all three deployments with the Recreate strategy block. Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com> Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>

…est-api (#9) Two crash-loops on the live deploy after PR #8 unblocked the rollout: ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required api: clickhouse_connect.driver.exceptions.DatabaseError: Code: 516. Authentication failed: ... default user Root cause (single pattern, two symptoms): PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api and ingest-api but invented a credential-passing convention the helm chart didn't ship — splitting URL / USER / PASSWORD / DATABASE into four env vars. The chart's existing tracebility-clickhouse secret has ONE key (``url``) holding the full DSN with embedded credentials, matching the postgres / redis pattern. So in production: - ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all → config.load() raised on the missing required env var. - api template DID set TRACEBILITY_CLICKHOUSE_URL, but the new code passed username='default' / password='' as kwargs to clickhouse_connect.get_async_client(dsn=URL, username=...). Those kwargs override the DSN's embedded credentials, so it tried to auth as 'default' (which doesn't exist in the cluster). Fix: - AuditWriter.from_url(url) now accepts only the DSN; no override kwargs. The DSN's embedded credentials carry through. - reconciler_loop() in both reconcilers: same simplification. - Settings on api + ingest-api: drop clickhouse_user / clickhouse_password / clickhouse_database. Only clickhouse_url. - ingest-api deployment template: add the missing TRACEBILITY_CLICKHOUSE_URL env-from-secret line. - api deployment template: also pass TRACEBILITY_REDIS_URL (PR #3 added optional Redis support to the api for api-key invalidation + reconciler hooks). Verified locally: - ``uv run pytest services/`` → 68 passed. - ``helm template`` confirms api + ingest-api templates render with PG_DSN + REDIS_URL + CLICKHOUSE_URL all sourced from secrets. - ``ruff check`` and ``ruff format --check`` clean. Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com> Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>

gaurav0107 merged commit d32c5d0 into main Jun 7, 2026
3 checks passed

gaurav0107 mentioned this pull request Jun 7, 2026

fix(deploy): pass full CH DSN; wire TRACEBILITY_CLICKHOUSE_URL to ingest-api #9

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deploy): Recreate strategy + 20m helm timeout (helm rollout timeout)#8

fix(deploy): Recreate strategy + 20m helm timeout (helm rollout timeout)#8
gaurav0107 merged 1 commit into
mainfrom
fix/deploy-recreate-strategy

gaurav0107 commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gaurav0107 commented Jun 7, 2026

Symptom

Root cause

Fix

Verified

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant