fix(migrator): version-track ClickHouse migrations (unblock helm-deploy)#7
Merged
Conversation
The CH migration runner used to re-apply every file every deploy under
the assumption that all SQL was idempotent (CREATE IF NOT EXISTS). That
contract broke with 0006_tenant_columns.sql, which does
CREATE-INSERT-RENAME — not safe to run twice. The result was
helm-deploy failing the pre-upgrade migrator hook with
``BackoffLimitExceeded`` because the rename targeted an already-existing
``run_v1``.
Mirror the postgres pattern:
- 0000_schema_migrations.sql creates a CH ``schema_migrations`` table
(sorted first via the 0000 prefix). The CREATE IF NOT EXISTS is the
one piece that still has to be self-idempotent.
- run.sh applies 0000 unconditionally, then queries
``schema_migrations`` and skips any other file whose basename is
already present. After a successful apply, it inserts the version.
- Each migration file may now use any DDL — CREATE / ALTER / RENAME /
DROP — without needing to be self-idempotent.
For an existing cluster that's already past 0001-0007, manually
backfill before deploying:
insert into schema_migrations (version) values
('0001_runs_and_spans'), ('0002_eval_scores'),
('0003_replay_captures'), ('0004_dataset_items'),
('0005_billing_meters'), ('0006_tenant_columns'),
('0007_audit_log');
(Already done on the live GKE cluster as part of this fix.)
Verified locally: built the migrator image, reproduced the
``run_v1 already exists`` failure on first run, backfilled tracking,
ran twice in a row — both pass with ``applied 0, skipped 7``.
Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
helm-deployfailed:Migrator pod logs:
Root cause
The migrator's ClickHouse step re-applied every
.sqlfile every deploy. The runner's comment said "All ClickHouse migrations must be idempotent (CREATE TABLE IF NOT EXISTS). ALTER migrations require a version-tracking table before being added." — but PR #3 (multi-tenancy) added0006_tenant_columns.sql, which doesCREATE-INSERT-RENAME. That's safe on a fresh cluster, but the second time it runs:create table if not exists run_v2— no-op (already exists from prior run).insert into run_v2 select ... from run— succeeds, doubles the data.rename table run to run_v1, run_v2 to run— fails becauserun_v1already exists from prior run.The fixture left behind in production was even worse — a leftover
run_v2table from a prior deploy attempt that I had to manually drop.Fix
Mirror the postgres pattern: a CH
schema_migrationstable the runner consults to skip already-applied files.schemas/clickhouse/0000_schema_migrations.sql— bootstrap file (sorts first under glob ordering). The only CH file that still has to be self-idempotent (CREATE IF NOT EXISTS).services/migrator/run.sh— applies 0000 unconditionally, queriesschema_migrations, skips any other file already recorded, inserts the version after a successful apply. Each migration file may now use any DDL —CREATE/ALTER/RENAME/DROP— without needing to be self-idempotent.Production state already restored
I cleaned up the live GKE cluster as part of this work:
Dropped the orphan
run_v2table — leftover from the failed migration attempt.Backfilled the new
schema_migrationstable with rows for 0001-0007 (otherwise the new runner would see them as un-applied and try to re-run them on the next deploy):Verified locally
Built
test-migrator:fixfrom this branch, ran against the local docker stack:schema_migrationsrows: reproduced therun_v1 already existsfailure exactly as production.Test plan
helm-deployjob runs the new migrator →applied 0, skipped 7→ helm rollout proceeds → app pods come up → LB stays healthy.🤖 Generated with Claude Code