Skip to content

fix(deploy): Recreate strategy + 20m helm timeout (helm rollout timeout)#8

Merged
gaurav0107 merged 1 commit into
mainfrom
fix/deploy-recreate-strategy
Jun 7, 2026
Merged

fix(deploy): Recreate strategy + 20m helm timeout (helm rollout timeout)#8
gaurav0107 merged 1 commit into
mainfrom
fix/deploy-recreate-strategy

Conversation

@gaurav0107

Copy link
Copy Markdown
Collaborator

Symptom

After PR #7 unblocked the migrator, helm-deploy still failed:

Error: UPGRADE FAILED: release tracebility failed, and has been rolled back
due to atomic being set: context deadline exceeded

Run 27091541251.

Root cause

Cluster events during the failed rollout:

FailedScheduling  pod/tracebility-web-6cb65969dd-5rf59
  0/2 nodes are available: 1 Insufficient memory, ...

GKE Autopilot's just-in-time node scaling kicks in when a rolling update needs 2x the capacity of one pod. With replicaCount: 1 + the default maxSurge: 25% (which rounds up to 1 extra pod), every deploy needs a second node briefly. Autopilot scale-up takes ~5 min on top of image pull + readiness probes; helm --wait --timeout 10m wasn't enough. --atomic then reverted everything to the previous SHA, hence the deploy "fails" but the live site (running the old image) stayed up.

The ingest-api deployment already had strategy: type: Recreate for exactly this reason (visible in its template comment: "recreate strategy keeps cluster topology simpler at the cost of a brief outage on rollout"). The other deployments missed the same treatment.

Fix

  1. api and ingest-worker switch to strategy: type: Recreate. Rolling update force-needs 2x capacity; Recreate kills the old pod first so the new one schedules onto the freed slot. Brief outage during rollout is acceptable for these two — api sits behind the web frontend (not on the public LB path), and ingest-worker is a Redis-stream consumer where redelivery is already in the failure model.

  2. web stays on RollingUpdate. Web fronts the public LB; a few seconds of 503 there is user-visible. With api + worker now Recreate, web is the only deployment that still needs a second node — so the scheduling pressure during a rollout is much smaller and Autopilot may not even need to scale up.

  3. --timeout 10m20m in the workflow. Belt-and-braces: even when the Recreate strategy avoids the worst case, a cold image pull + readiness delay can still push a single-replica rollout past 10m on Autopilot. 20m is generous without being absurd.

Verified

helm template tracebility deploy/helm/tracebility -f values-gke.yaml --set image.tag=test renders all three deployments (api, ingest-api, ingest-worker) with the Recreate strategy block. web renders without a strategy: field (default = RollingUpdate).

Test plan

  • CI on this PR goes green.
  • Merge.
  • Post-merge helm-deploy succeeds within the 20m budget.
  • Live site at https://langprobe.daz.co.in serves the new image (web pod's image tag should match the merged commit SHA).

🤖 Generated with Claude Code

…ut to 20m

The previous deploy hit ``Error: UPGRADE FAILED ... context deadline
exceeded`` after 10m. Cluster events showed:

    FailedScheduling pod/tracebility-web-...
    0/2 nodes are available: 1 Insufficient memory ...

GKE Autopilot's just-in-time node scaling kicks in when a rolling
update needs 2x the capacity of one pod (default maxSurge=1). Scale-up
takes ~5 min on top of image pull + probes; helm --wait at 10m wasn't
enough. The atomic rollback then reverted everything to the previous
SHA.

Two changes:

1. ``api`` and ``ingest-worker`` deployments switch to ``strategy:
   type: Recreate`` so a single-replica rollout doesn't briefly need
   2x capacity. ``ingest-api`` already uses Recreate (predates this
   PR). ``web`` keeps RollingUpdate because it fronts the LB and a
   brief outage there is user-visible.

2. ``--timeout 10m`` -> ``20m`` in the workflow so an Autopilot
   scale-up (rare now but possible) doesn't trip the gate.

Verified: ``helm template`` renders all three deployments with the
Recreate strategy block.

Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
@gaurav0107 gaurav0107 merged commit d32c5d0 into main Jun 7, 2026
3 checks passed
gaurav0107 added a commit that referenced this pull request Jun 7, 2026
…est-api (#9)

Two crash-loops on the live deploy after PR #8 unblocked the rollout:

  ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required
  api:        clickhouse_connect.driver.exceptions.DatabaseError:
              Code: 516. Authentication failed: ... default user

Root cause (single pattern, two symptoms):

PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api
and ingest-api but invented a credential-passing convention the helm
chart didn't ship — splitting URL / USER / PASSWORD / DATABASE into
four env vars. The chart's existing tracebility-clickhouse secret has
ONE key (``url``) holding the full DSN with embedded credentials,
matching the postgres / redis pattern.

So in production:
- ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all
  → config.load() raised on the missing required env var.
- api template DID set TRACEBILITY_CLICKHOUSE_URL, but the new code
  passed username='default' / password='' as kwargs to
  clickhouse_connect.get_async_client(dsn=URL, username=...).
  Those kwargs override the DSN's embedded credentials, so it tried
  to auth as 'default' (which doesn't exist in the cluster).

Fix:

- AuditWriter.from_url(url) now accepts only the DSN; no override
  kwargs. The DSN's embedded credentials carry through.
- reconciler_loop() in both reconcilers: same simplification.
- Settings on api + ingest-api: drop clickhouse_user /
  clickhouse_password / clickhouse_database. Only clickhouse_url.
- ingest-api deployment template: add the missing
  TRACEBILITY_CLICKHOUSE_URL env-from-secret line.
- api deployment template: also pass TRACEBILITY_REDIS_URL (PR #3
  added optional Redis support to the api for api-key invalidation
  + reconciler hooks).

Verified locally:
- ``uv run pytest services/`` → 68 passed.
- ``helm template`` confirms api + ingest-api templates render with
  PG_DSN + REDIS_URL + CLICKHOUSE_URL all sourced from secrets.
- ``ruff check`` and ``ruff format --check`` clean.

Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant