refactor(consul-postgres-ha): collapse platform plumbing to single mesh-sidecar#96
Merged
h4x3rotab merged 1 commit intodstack-consul-ha-dbfrom May 5, 2026
Merged
Conversation
899b06e to
3711d86
Compare
…decar image Per-CVM container count drops from 7 → 3 on workers (sidecar + patroni + webdemo) and from 6 → 1 on coordinators (sidecar). The new sidecar image bundles bootstrap-secrets, mesh-conn, consul, and (workers only) envoy behind a tini-wrapped shell init that dispatches on ROLE; the old keepalive placeholder, the four-image lockstep, and the vestigial on-CVM signaling/coturn that had been documented as unused all drop. CI matrix: 6 → 4 (sidecar, patroni, webdemo, signaling). The sidecar build uses the parent consul-postgres-ha/ as docker context so its multi-stage Dockerfile can pull bootstrap-secrets/ and mesh-conn/ Go sources from sibling subdirs. cluster.tf: BOOTSTRAP_SECRETS_IMAGE, MESH_CONN_IMAGE, SIGNALING_IMAGE (coordinator) and the matching tfvars all collapse into SIDECAR_IMAGE. Smoke-tested against a fresh terraform apply on dstack-pha-prod5 (2026-05-04). Soft-kill RTO 27s, hard-kill RTO 33s, cheap rejoin verified, disk-loss rejoin 26s — all within noise of the pre-Gap-2 baselines on the previous multi-container cluster. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3711d86 to
cfc19ac
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
consul-postgres-ha-mesh-sidecarimage bundlesbootstrap-secrets,mesh-conn,consul, and (workers only)envoybehind a tini-wrapped shell init script that dispatches onROLE. The oldkeepaliveplaceholder, four-image lockstep, and vestigial on-CVMsignaling/coturn(documented as unused — Phala dstack apps don't have UDP ingress yet) all drop.mesh-sidecar,patroni,webdemo,signaling). Themesh-sidecarbuild uses the parentconsul-postgres-ha/as docker context so its multi-stage Dockerfile can pullbootstrap-secrets/andmesh-conn/Go sources from sibling subdirs.BOOTSTRAP_SECRETS_IMAGE,MESH_CONN_IMAGE, and the coordinator'sSIGNALING_IMAGE(plus matchingtfvars) all collapse into a singleMESH_SIDECAR_IMAGE/mesh_sidecar_image.patroniandwebdemowait onservice_healthydefined as[ -s /run/instance/info.json ], replacing the oldservice_completed_successfullygate on bootstrap-secrets.The compose-service name stays
sidecar(so the per-CVM container is stilldstack-sidecar-1regardless of which image it points at). The image name has themesh-prefix to make it clear that it's the bundle of mesh plumbing — bootstrap-secrets + mesh-conn + consul + envoy — and not just an Envoy sidecar.Design brief at
consul-postgres-ha/design/single-sidecar.mdwas the spec for this change and is deleted by the PR per the design-doc convention. Open questions decided:mesh-sidecar/, image suffixconsul-postgres-ha-mesh-sidecar, terraform varmesh_sidecar_image, env varMESH_SIDECAR_IMAGE. Old envoy-bootstrap shell folded intoentrypoint.sh.tester(netshoot) → dropped from both compose files. The new mesh-sidecar image carriescurl/jq, sodocker exec dstack-sidecar-1 curl ...replaces it. OneFAILOVER.mdline touched.WEBDEMO_ENABLED=1conditional → punted. Compose profiles work but introduce a new concept; users adapting the template will be editingwebdemoout anyway. Left as a note for a future small change.Smoke test (2026-05-04, fresh
terraform applyon dstack-pha-prod5)Cluster: 3 coordinators + 3 workers,
cluster_name=gap2. Fresh mesh-sidecar image atttl.sh/dstack-mesh-sidecar-1777870749:24h(the rename-pass apply propagates as an env-only update, no CVM destroy/recreate).pg_basebackup)The hard-kill rejoin path on this run hit the documented worker↔worker ICE-flake (the same case the
MESH_CONN_RELAY_ONLY=1escape hatch incompose/worker.yamlalready exists for) —worker-4couldn't re-establish ICE handshakes after coming back. None of the consolidation work in this branch changed mesh-conn's behavior; the binary is identical, just packaged differently. The escape hatch is documented in bothworker.yamlandcoordinator.yaml. RTO measurements are unaffected by this rejoin slowness because the failover itself completed in the expected time on the surviving replicas — the cluster reached a new healthy quorum before the slow peer's rejoin became relevant.Test plan
docker build -f consul-postgres-ha/mesh-sidecar/Dockerfile consul-postgres-hasucceeds; image runs all four bundled binaries (bootstrap-secrets,mesh-conn,consul v1.19.2,envoy 1.30.11)terraform apply -parallelism=1brings up 6 CVMs in series; each worker shows exactly 3 containers (sidecar/healthy +patroni+webdemo); each coordinator shows exactly 1 container (sidecar/healthy)consul membersshows 3 servers + 3 clients allaliveafter fresh deploydocker logs dstack-sidecar-1: lines prefixed[init],[bootstrap-secrets],[mesh-conn],[consul],[envoy]SIDECAR_IMAGE→MESH_SIDECAR_IMAGE,sidecar_image→mesh_sidecar_image) propagates as0 added, 6 changed, 0 destroyed. Disks survive (no destroy/recreate).🤖 Generated with Claude Code