Skip to content

fix(deploy): pass full CH DSN; wire TRACEBILITY_CLICKHOUSE_URL to ingest-api#9

Merged
gaurav0107 merged 1 commit into
mainfrom
fix/ch-dsn-only
Jun 7, 2026
Merged

fix(deploy): pass full CH DSN; wire TRACEBILITY_CLICKHOUSE_URL to ingest-api#9
gaurav0107 merged 1 commit into
mainfrom
fix/ch-dsn-only

Conversation

@gaurav0107

Copy link
Copy Markdown
Collaborator

Symptom

After PR #8 unblocked the rollout, two pods crash-looped on the new image:

ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required

api: clickhouse_connect.driver.exceptions.DatabaseError:
     Code: 516. DB::Exception: default: Authentication failed:
     password is incorrect, or there is no user with such name.

Helm --atomic rolled back to the previous (working) image, so the public site stayed up — but the deploy failed.

Root cause (single pattern, two symptoms)

PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api and ingest-api but invented a credential-passing convention the helm chart didn't ship: splitting URL / USER / PASSWORD / DATABASE into four env vars. The cluster's existing tracebility-clickhouse secret has one key (url) holding the full DSN with embedded credentials — matching the postgres / redis pattern that all prior services used.

So in production:

  • ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all — PR Multi-tenancy seam: tenant columns, resolver, quotas, sharded ingest #3's new enforce_quota middleware (which writes quota.block events to the audit log) requires it. config.load() raised on the missing required env var.
  • api template DID set TRACEBILITY_CLICKHOUSE_URL, but my new code path (AuditWriter.from_url(url, username=..., password=..., database=...)) passed username='default' and password='' as kwargs alongside the DSN. clickhouse-connect treats those kwargs as overrides to the DSN's embedded credentials, so the client tried to authenticate as the default user (which doesn't exist in the cluster — the DB was provisioned with a tracebility user).

This is a textbook case of "added new functionality, didn't verify the existing config plumbing supports it."

Fix

Stop splitting CH credentials. Match the existing pattern.

  • AuditWriter.from_url(url) now accepts only the DSN; the override kwargs are gone. The DSN's embedded creds carry through.
  • reconciler_loop() in both reconciler_quota.py and reconciler_audit.py: same simplification.
  • Settings on api + ingest-api: drop clickhouse_user / clickhouse_password / clickhouse_database. Only clickhouse_url.
  • ingest-api deployment template: add the missing TRACEBILITY_CLICKHOUSE_URL env-from-secret line.
  • api deployment template: also pass TRACEBILITY_REDIS_URL (PR Multi-tenancy seam: tenant columns, resolver, quotas, sharded ingest #3 added optional Redis support for the api-key invalidation publish path).

Verified locally

  • uv run pytest services/68 passed, 11 env-gated skips
  • helm template tracebility deploy/helm/tracebility -f values-gke.yaml --set image.tag=test
    • ingest-api renders with PG_DSN + REDIS_URL + CLICKHOUSE_URL
    • api renders with PG_DSN + REDIS_URL + CLICKHOUSE_URL + SESSION_SECRET ✓
  • uvx ruff check . — All checks passed!
  • uvx ruff format --check . — clean
  • Smoke-imported both apps with realistic env, confirmed clickhouse_url flows through.

Test plan

  • CI on this PR goes green.
  • Merge.
  • Post-merge helm-deploy rolls cleanly: api + ingest-api come up Ready 1/1 (no CrashLoopBackOff).
  • Live site at https://langprobe.daz.co.in/login still serves 200 on the new image.

🤖 Generated with Claude Code

…est-api

Two crash-loops on the live deploy after PR #8 unblocked the rollout:

  ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required
  api:        clickhouse_connect.driver.exceptions.DatabaseError:
              Code: 516. Authentication failed: ... default user

Root cause (single pattern, two symptoms):

PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api
and ingest-api but invented a credential-passing convention the helm
chart didn't ship — splitting URL / USER / PASSWORD / DATABASE into
four env vars. The chart's existing tracebility-clickhouse secret has
ONE key (``url``) holding the full DSN with embedded credentials,
matching the postgres / redis pattern.

So in production:
- ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all
  → config.load() raised on the missing required env var.
- api template DID set TRACEBILITY_CLICKHOUSE_URL, but the new code
  passed username='default' / password='' as kwargs to
  clickhouse_connect.get_async_client(dsn=URL, username=...).
  Those kwargs override the DSN's embedded credentials, so it tried
  to auth as 'default' (which doesn't exist in the cluster).

Fix:

- AuditWriter.from_url(url) now accepts only the DSN; no override
  kwargs. The DSN's embedded credentials carry through.
- reconciler_loop() in both reconcilers: same simplification.
- Settings on api + ingest-api: drop clickhouse_user /
  clickhouse_password / clickhouse_database. Only clickhouse_url.
- ingest-api deployment template: add the missing
  TRACEBILITY_CLICKHOUSE_URL env-from-secret line.
- api deployment template: also pass TRACEBILITY_REDIS_URL (PR #3
  added optional Redis support to the api for api-key invalidation
  + reconciler hooks).

Verified locally:
- ``uv run pytest services/`` → 68 passed.
- ``helm template`` confirms api + ingest-api templates render with
  PG_DSN + REDIS_URL + CLICKHOUSE_URL all sourced from secrets.
- ``ruff check`` and ``ruff format --check`` clean.

Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
@gaurav0107 gaurav0107 merged commit 2affa8b into main Jun 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant