fix(deploy): pass full CH DSN; wire TRACEBILITY_CLICKHOUSE_URL to ingest-api#9
Merged
Conversation
…est-api Two crash-loops on the live deploy after PR #8 unblocked the rollout: ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required api: clickhouse_connect.driver.exceptions.DatabaseError: Code: 516. Authentication failed: ... default user Root cause (single pattern, two symptoms): PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api and ingest-api but invented a credential-passing convention the helm chart didn't ship — splitting URL / USER / PASSWORD / DATABASE into four env vars. The chart's existing tracebility-clickhouse secret has ONE key (``url``) holding the full DSN with embedded credentials, matching the postgres / redis pattern. So in production: - ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all → config.load() raised on the missing required env var. - api template DID set TRACEBILITY_CLICKHOUSE_URL, but the new code passed username='default' / password='' as kwargs to clickhouse_connect.get_async_client(dsn=URL, username=...). Those kwargs override the DSN's embedded credentials, so it tried to auth as 'default' (which doesn't exist in the cluster). Fix: - AuditWriter.from_url(url) now accepts only the DSN; no override kwargs. The DSN's embedded credentials carry through. - reconciler_loop() in both reconcilers: same simplification. - Settings on api + ingest-api: drop clickhouse_user / clickhouse_password / clickhouse_database. Only clickhouse_url. - ingest-api deployment template: add the missing TRACEBILITY_CLICKHOUSE_URL env-from-secret line. - api deployment template: also pass TRACEBILITY_REDIS_URL (PR #3 added optional Redis support to the api for api-key invalidation + reconciler hooks). Verified locally: - ``uv run pytest services/`` → 68 passed. - ``helm template`` confirms api + ingest-api templates render with PG_DSN + REDIS_URL + CLICKHOUSE_URL all sourced from secrets. - ``ruff check`` and ``ruff format --check`` clean. Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com> Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
After PR #8 unblocked the rollout, two pods crash-looped on the new image:
Helm
--atomicrolled back to the previous (working) image, so the public site stayed up — but the deploy failed.Root cause (single pattern, two symptoms)
PR #3 (multi-tenancy) added new ClickHouse-touching code paths in
apiandingest-apibut invented a credential-passing convention the helm chart didn't ship: splitting URL / USER / PASSWORD / DATABASE into four env vars. The cluster's existingtracebility-clickhousesecret has one key (url) holding the full DSN with embedded credentials — matching the postgres / redis pattern that all prior services used.So in production:
TRACEBILITY_CLICKHOUSE_URLat all — PR Multi-tenancy seam: tenant columns, resolver, quotas, sharded ingest #3's newenforce_quotamiddleware (which writesquota.blockevents to the audit log) requires it. config.load() raised on the missing required env var.TRACEBILITY_CLICKHOUSE_URL, but my new code path (AuditWriter.from_url(url, username=..., password=..., database=...)) passedusername='default'andpassword=''as kwargs alongside the DSN. clickhouse-connect treats those kwargs as overrides to the DSN's embedded credentials, so the client tried to authenticate as thedefaultuser (which doesn't exist in the cluster — the DB was provisioned with atracebilityuser).This is a textbook case of "added new functionality, didn't verify the existing config plumbing supports it."
Fix
Stop splitting CH credentials. Match the existing pattern.
AuditWriter.from_url(url)now accepts only the DSN; the override kwargs are gone. The DSN's embedded creds carry through.reconciler_loop()in bothreconciler_quota.pyandreconciler_audit.py: same simplification.Settingson api + ingest-api: dropclickhouse_user/clickhouse_password/clickhouse_database. Onlyclickhouse_url.TRACEBILITY_CLICKHOUSE_URLenv-from-secret line.TRACEBILITY_REDIS_URL(PR Multi-tenancy seam: tenant columns, resolver, quotas, sharded ingest #3 added optional Redis support for the api-key invalidation publish path).Verified locally
uv run pytest services/→ 68 passed, 11 env-gated skipshelm template tracebility deploy/helm/tracebility -f values-gke.yaml --set image.tag=testuvx ruff check .— All checks passed!uvx ruff format --check .— cleanclickhouse_urlflows through.Test plan
helm-deployrolls cleanly: api + ingest-api come upReady 1/1(noCrashLoopBackOff).🤖 Generated with Claude Code