fix(deploy): Recreate strategy + 20m helm timeout (helm rollout timeout)#8
Merged
Merged
Conversation
…ut to 20m
The previous deploy hit ``Error: UPGRADE FAILED ... context deadline
exceeded`` after 10m. Cluster events showed:
FailedScheduling pod/tracebility-web-...
0/2 nodes are available: 1 Insufficient memory ...
GKE Autopilot's just-in-time node scaling kicks in when a rolling
update needs 2x the capacity of one pod (default maxSurge=1). Scale-up
takes ~5 min on top of image pull + probes; helm --wait at 10m wasn't
enough. The atomic rollback then reverted everything to the previous
SHA.
Two changes:
1. ``api`` and ``ingest-worker`` deployments switch to ``strategy:
type: Recreate`` so a single-replica rollout doesn't briefly need
2x capacity. ``ingest-api`` already uses Recreate (predates this
PR). ``web`` keeps RollingUpdate because it fronts the LB and a
brief outage there is user-visible.
2. ``--timeout 10m`` -> ``20m`` in the workflow so an Autopilot
scale-up (rare now but possible) doesn't trip the gate.
Verified: ``helm template`` renders all three deployments with the
Recreate strategy block.
Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
4 tasks
gaurav0107
added a commit
that referenced
this pull request
Jun 7, 2026
…est-api (#9) Two crash-loops on the live deploy after PR #8 unblocked the rollout: ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required api: clickhouse_connect.driver.exceptions.DatabaseError: Code: 516. Authentication failed: ... default user Root cause (single pattern, two symptoms): PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api and ingest-api but invented a credential-passing convention the helm chart didn't ship — splitting URL / USER / PASSWORD / DATABASE into four env vars. The chart's existing tracebility-clickhouse secret has ONE key (``url``) holding the full DSN with embedded credentials, matching the postgres / redis pattern. So in production: - ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all → config.load() raised on the missing required env var. - api template DID set TRACEBILITY_CLICKHOUSE_URL, but the new code passed username='default' / password='' as kwargs to clickhouse_connect.get_async_client(dsn=URL, username=...). Those kwargs override the DSN's embedded credentials, so it tried to auth as 'default' (which doesn't exist in the cluster). Fix: - AuditWriter.from_url(url) now accepts only the DSN; no override kwargs. The DSN's embedded credentials carry through. - reconciler_loop() in both reconcilers: same simplification. - Settings on api + ingest-api: drop clickhouse_user / clickhouse_password / clickhouse_database. Only clickhouse_url. - ingest-api deployment template: add the missing TRACEBILITY_CLICKHOUSE_URL env-from-secret line. - api deployment template: also pass TRACEBILITY_REDIS_URL (PR #3 added optional Redis support to the api for api-key invalidation + reconciler hooks). Verified locally: - ``uv run pytest services/`` → 68 passed. - ``helm template`` confirms api + ingest-api templates render with PG_DSN + REDIS_URL + CLICKHOUSE_URL all sourced from secrets. - ``ruff check`` and ``ruff format --check`` clean. Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com> Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
After PR #7 unblocked the migrator, helm-deploy still failed:
Run 27091541251.
Root cause
Cluster events during the failed rollout:
GKE Autopilot's just-in-time node scaling kicks in when a rolling update needs 2x the capacity of one pod. With
replicaCount: 1+ the defaultmaxSurge: 25%(which rounds up to 1 extra pod), every deploy needs a second node briefly. Autopilot scale-up takes ~5 min on top of image pull + readiness probes; helm--wait --timeout 10mwasn't enough.--atomicthen reverted everything to the previous SHA, hence the deploy "fails" but the live site (running the old image) stayed up.The
ingest-apideployment already hadstrategy: type: Recreatefor exactly this reason (visible in its template comment: "recreate strategy keeps cluster topology simpler at the cost of a brief outage on rollout"). The other deployments missed the same treatment.Fix
apiandingest-workerswitch tostrategy: type: Recreate. Rolling update force-needs 2x capacity; Recreate kills the old pod first so the new one schedules onto the freed slot. Brief outage during rollout is acceptable for these two —apisits behind the web frontend (not on the public LB path), andingest-workeris a Redis-stream consumer where redelivery is already in the failure model.webstays on RollingUpdate. Web fronts the public LB; a few seconds of 503 there is user-visible. With api + worker now Recreate, web is the only deployment that still needs a second node — so the scheduling pressure during a rollout is much smaller and Autopilot may not even need to scale up.--timeout 10m→20min the workflow. Belt-and-braces: even when the Recreate strategy avoids the worst case, a cold image pull + readiness delay can still push a single-replica rollout past 10m on Autopilot. 20m is generous without being absurd.Verified
helm template tracebility deploy/helm/tracebility -f values-gke.yaml --set image.tag=testrenders all three deployments (api,ingest-api,ingest-worker) with theRecreatestrategy block.webrenders without astrategy:field (default = RollingUpdate).Test plan
helm-deploysucceeds within the 20m budget.🤖 Generated with Claude Code