Skip to content

fix(migrator): version-track ClickHouse migrations (unblock helm-deploy)#7

Merged
gaurav0107 merged 1 commit into
mainfrom
fix/ch-migrator-tracking
Jun 7, 2026
Merged

fix(migrator): version-track ClickHouse migrations (unblock helm-deploy)#7
gaurav0107 merged 1 commit into
mainfrom
fix/ch-migrator-tracking

Conversation

@gaurav0107

Copy link
Copy Markdown
Collaborator

Symptom

helm-deploy failed:

Error: UPGRADE FAILED: pre-upgrade hooks failed: 1 error occurred:
        * job tracebility-migrator-20 failed: BackoffLimitExceeded

Migrator pod logs:

==> ClickHouse: applying all migrations (CREATE IF NOT EXISTS)
  apply: 0006_tenant_columns
ERROR: ClickHouse returned HTTP 500 on statement from 0006_tenant_columns.sql
Code: 57. DB::Exception: Table tracebility.run_v1 already exists.

Root cause

The migrator's ClickHouse step re-applied every .sql file every deploy. The runner's comment said "All ClickHouse migrations must be idempotent (CREATE TABLE IF NOT EXISTS). ALTER migrations require a version-tracking table before being added." — but PR #3 (multi-tenancy) added 0006_tenant_columns.sql, which does CREATE-INSERT-RENAME. That's safe on a fresh cluster, but the second time it runs:

  1. create table if not exists run_v2 — no-op (already exists from prior run).
  2. insert into run_v2 select ... from run — succeeds, doubles the data.
  3. rename table run to run_v1, run_v2 to runfails because run_v1 already exists from prior run.

The fixture left behind in production was even worse — a leftover run_v2 table from a prior deploy attempt that I had to manually drop.

Fix

Mirror the postgres pattern: a CH schema_migrations table the runner consults to skip already-applied files.

  • schemas/clickhouse/0000_schema_migrations.sql — bootstrap file (sorts first under glob ordering). The only CH file that still has to be self-idempotent (CREATE IF NOT EXISTS).
  • services/migrator/run.sh — applies 0000 unconditionally, queries schema_migrations, skips any other file already recorded, inserts the version after a successful apply. Each migration file may now use any DDL — CREATE / ALTER / RENAME / DROP — without needing to be self-idempotent.

Production state already restored

I cleaned up the live GKE cluster as part of this work:

  1. Dropped the orphan run_v2 table — leftover from the failed migration attempt.

  2. Backfilled the new schema_migrations table with rows for 0001-0007 (otherwise the new runner would see them as un-applied and try to re-run them on the next deploy):

    insert into schema_migrations (version) values
      ('0000_schema_migrations'),
      ('0001_runs_and_spans'), ('0002_eval_scores'),
      ('0003_replay_captures'), ('0004_dataset_items'),
      ('0005_billing_meters'), ('0006_tenant_columns'),
      ('0007_audit_log');

Verified locally

Built test-migrator:fix from this branch, ran against the local docker stack:

  • First run with no schema_migrations rows: reproduced the run_v1 already exists failure exactly as production.
  • Backfilled tracking, then ran twice:
    ==> ClickHouse: applied 0, skipped 7
    ==> migrator: done
    
  • Migrator now exits 0 cleanly.

Test plan

  • CI on this PR goes green.
  • Merge.
  • helm-deploy job runs the new migrator → applied 0, skipped 7 → helm rollout proceeds → app pods come up → LB stays healthy.

🤖 Generated with Claude Code

The CH migration runner used to re-apply every file every deploy under
the assumption that all SQL was idempotent (CREATE IF NOT EXISTS). That
contract broke with 0006_tenant_columns.sql, which does
CREATE-INSERT-RENAME — not safe to run twice. The result was
helm-deploy failing the pre-upgrade migrator hook with
``BackoffLimitExceeded`` because the rename targeted an already-existing
``run_v1``.

Mirror the postgres pattern:

- 0000_schema_migrations.sql creates a CH ``schema_migrations`` table
  (sorted first via the 0000 prefix). The CREATE IF NOT EXISTS is the
  one piece that still has to be self-idempotent.
- run.sh applies 0000 unconditionally, then queries
  ``schema_migrations`` and skips any other file whose basename is
  already present. After a successful apply, it inserts the version.
- Each migration file may now use any DDL — CREATE / ALTER / RENAME /
  DROP — without needing to be self-idempotent.

For an existing cluster that's already past 0001-0007, manually
backfill before deploying:

    insert into schema_migrations (version) values
      ('0001_runs_and_spans'), ('0002_eval_scores'),
      ('0003_replay_captures'), ('0004_dataset_items'),
      ('0005_billing_meters'), ('0006_tenant_columns'),
      ('0007_audit_log');

(Already done on the live GKE cluster as part of this fix.)

Verified locally: built the migrator image, reproduced the
``run_v1 already exists`` failure on first run, backfilled tracking,
ran twice in a row — both pass with ``applied 0, skipped 7``.

Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
@gaurav0107 gaurav0107 merged commit f77f6ca into main Jun 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant