Skip to content

consul-postgres-ha stage 4 — Postgres HA across dstack-TEE workers#95

Draft
h4x3rotab wants to merge 47 commits intomainfrom
dstack-consul-ha-db
Draft

consul-postgres-ha stage 4 — Postgres HA across dstack-TEE workers#95
h4x3rotab wants to merge 47 commits intomainfrom
dstack-consul-ha-db

Conversation

@h4x3rotab
Copy link
Copy Markdown
Contributor

Summary

Adds stage 4 of the consul-postgres-ha example: a 3-coordinator + 3-worker dstack cluster running highly-available Postgres via Patroni, with leader election driven by Consul KV, all replication and Consul gossip carried over a custom userspace mesh (mesh-conn) that hole-punches between TEE CVMs via pion/ICE and multiplexes streams over QUIC.

The PR is large (43 commits, ~9.5k LoC) because it brings the full example end-to-end — phase-0 ICE feasibility through stage-3 Consul Connect through stage-4 Patroni with TEE-derived secrets and in-place env updates. Each commit is independently reviewable and the conventional-commits structure makes the chronological narrative trivial to follow.

What ships in stage 4

  • mesh-conn — userspace UDP+TCP port-forwarder over pion/ICE + quic-go. Each peer multiplexes 8 identity ports per pair; the QUIC layer provides loss recovery + stream multiplexing on top of pion's lossy UDP underlay.
  • bootstrap-secrets — init container that calls the dstack SDK's getKey() and writes per-CVM TEE-derived secrets to /run/secrets/, plus the per-CVM identity (role, ordinal, port table) to /run/instance/info.json.
  • patroni — Postgres + Patroni baked together; reads identity from /run/instance/info.json, joins Consul via mesh-conn, participates in leader election.
  • cluster-example/cluster.tf — one terraform apply brings up 3 coordinators + 3 workers across CVMs, propagates PEERS_JSON via in-place env updates, preserves disks across topology changes (storage_fs = "zfs").
  • CI — new consul-postgres-ha-publish.yml workflow builds and publishes all six images (mesh-conn, bootstrap-secrets, signaling, webdemo, sidecar, patroni) to GHCR with Sigstore-backed GitHub Build Provenance attestations on every push to main. Consumers verify with gh attestation verify oci://...@<digest> --repo Dstack-TEE/dstack-examples.

What's verified end-to-end on a live cluster

Full reproducible recipes in consul-postgres-ha/stage4/{FAILOVER,PUBLISHING}.md, plus diagnostic artifacts.

Property Result
Multi-CVM Patroni HA via Consul KV ✅ 3-replica streaming on timeline 22
pg_basebackup over QUIC mesh-conn ✅ ~25 MB/s sustained between dstack workers
Soft-kill failover RTO ✅ ~24s (kill leader's patroni → first write on new leader)
Hard-kill failover RTO ✅ ~33s (kill all containers on leader CVM)
Cheap rejoin (WAL replay) ✅ ~31s on hard-kill, no pg_basebackup needed
Disk-loss rejoin (full pg_basebackup over mesh-conn) ✅ 7s for 5.2 MB, picks correct bootstrap path
In-place env updates via terraform apply ✅ requires phala-network/phala 0.2.0-beta.3 (see Phala-Network/terraform-provider-phala#8)
Disk persistence across CVM rebuilds ✅ pgdata + Patroni state survive terraform apply

Notable transport-layer story (the highlight reel)

The mesh-conn started life on yamux. Yamux assumes a reliable byte-stream underlay; pion/ice.Conn is UDP. Between dstack worker CVMs the UDP path is brutally lossy (~99% one direction on hairpin, ~78% on coturn relay), and yamux's keepalive/recv-window invariants tripped under any sustained load, manifesting as ""keepalive timeout"" / ""recv window exceeded"" but with the real cause being dropped packets violating yamux's reliability assumptions.

Swapped yamux → quic-go with a net.PacketConn shim around ice.Conn. QUIC has loss recovery + stream multiplexing built in — exactly what an unreliable datagram underlay needs. Same hairpin path that killed yamux at 3 KB now sustains 25–28 MB/s for pg_basebackup. See consul-postgres-ha/stage4/RESUME.md for the full diagnosis and consul-postgres-ha/stage4/quic-on-ice/ for the standalone smoke test.

Drafting why

Marking draft because:

  • The CI publish workflow has never run (this is the first push triggering it). Want to see the matrix build go green and confirm GHCR images + attestations land before flipping ready-for-review.
  • The 43-commit history is intentionally preserved to capture the engineering narrative; if reviewers prefer a squash strategy or a more aggressive history rewrite (e.g. dropping the early-experiment phase-0 / stage-1 commits if those have already merged elsewhere), happy to restructure.
  • Curious whether reviewers want me to fold the phala-cloud submodule pointer bump (chore(terraform): bump submodule to v0.2.0-beta.3 Phala-Network/phala-cloud#248) into this PR's narrative, or treat that as already-handled out-of-band.

Test plan

  • go test ./... clean across all stage-4 modules (mesh-conn, bootstrap-secrets, quic-on-ice)
  • All six Dockerfiles build clean (verified via docker build + push to ttl.sh + live deploy)
  • Failover demo (soft + hard) reproduces against the live cluster
  • Disk-loss rejoin reproduces against the live cluster
  • terraform apply env-update propagates without CVM destroy/recreate (verified against phala-network/phala 0.2.0-beta.3)
  • CI publish workflow (consul-postgres-ha-publish.yml) runs cleanly on first push and produces verifiable attestations on GHCR — pending this PR's first run

Known follow-ups (not blocking merge)

  • consul-postgres-ha/stage4/RESUME.md is now obsolete (originally a session-bridging doc; superseded by README.md + FAILOVER.md + PUBLISHING.md). Will delete once the PR is approaching ready-for-review, unless reviewers want to keep it as engineering-narrative.
  • If/when phala-network/phala ships a stable 0.2.0 release, the cluster.tf provider pin can move from exact (0.2.0-beta.3) to ~> 0.2.

🤖 Generated with Claude Code

h4x3rotab and others added 30 commits May 1, 2026 20:53
Adds a phase-0 experiment to verify whether dstack CVMs can establish
direct UDP paths via NAT hole-punching, as a prerequisite for running
Consul (or any UDP-gossip service mesh) across CVMs over the TCP-only
dstack-gateway.

Components:
- coordinator/: docker-compose for coturn (STUN+TURN, UDP+TCP) plus a
  tiny HTTP signaling broker, deployed on a user-provided public-IP host
- phase0/icetest/: single Go binary with two modes
  - signaling: ferries ICE candidates and ufrag/pwd between two peers
  - peer: runs pion/ice against coturn, exchanges candidates via the
    broker, performs connectivity check, sends 20 echo round-trips, and
    logs the winning candidate-pair type + RTT
- phase0/docker-compose.yaml: dstack-CVM compose that runs the peer
- deploy/phase0-results.md: result of the live run

Result: direct hole-punched UDP works between two dstack CVMs (srflx
candidates via NAT hairpinning), median RTT ~6.6 ms over the public
internet path, no TURN relay needed. TURN is available as fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds on phase-0's "direct UDP hole-punch works on dstack" finding.
Stage 1 wraps a pion/ice connection in a TUN device so arbitrary IP
traffic can flow between two CVMs, not just hand-written echo packets.

Components:
- stage1/mesh-conn: ~280 LoC Go, single binary
  - opens TUN mesh0 with a virtual /24 IP
  - establishes one pion/ice connection to its partner via the same
    coturn + signaling broker phase-0 used
  - 1:1 pumps L3 packets between TUN and ice.Conn (no framing — ice
    rides on UDP, datagram boundaries are preserved)
  - logs the selected ICE candidate pair for visibility
- stage1/docker-compose.yaml: mesh-conn + nicolaka/netshoot tester,
  both on network_mode: host

Result (deploy/stage1-mvp-results.md): direct host<->srflx hole-punched
path, ICMP through the tunnel runs at 4.8–8.4 ms RTT, matching phase-0
native UDP latency. Confirms userspace overhead is negligible.

Caveat: docker-bridge networking forces ICE onto the TURN relay path
(observed 163 ms RTT in the broken run) because srflx replies can't
route through the bridge NAT. mesh-conn must run with
network_mode: host on dstack.

This MVP is a stepping stone — next iteration replaces TUN with a
userspace port-forwarding agent so apps just bind localhost:<port>
upstreams at peers, no virtual L3, no kernel routing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the TUN-based overlay with a much simpler userspace UDP
port-forwarding agent. No TUN device, no virtual L3, no NET_ADMIN
capability, no songgao/water dependency — just `net.ListenUDP` per
peer-pair, bridged 1:1 with one pion/ice connection.

Identity-port convention:
  Each peer has a unique 16-bit identity port. On every host:
  - the local app binds 127.0.0.1:<own_port>
  - mesh-conn binds 127.0.0.1:<other_peer_port> for every OTHER peer
  - apps reach peer X by sending UDP to 127.0.0.1:<X_port>

The source-port-preservation trick (mesh-conn's bound socket is the
*sender peer's* identity port) means the receiving app sees inbound
packets as coming from 127.0.0.1:<sender_id_port>, which is the address
the cluster's peer-discovery / membership protocol uses to identify the
sender. So Consul or any membership-aware service plugs in unchanged.

Verified end-to-end on two dstack CVMs (deploy/stage1-portfwd-results.md):
ICE selected the direct host<->prflx hole-punched path; 5/5 socat-based
UDP round-trips delivered the correct payload through the bridge.

Why this is the right shape:
  The TUN approach (committed earlier as a milestone) gave us a virtual
  L3 we didn't actually need. Apps in a service-mesh demo just want
  "send UDP to a stable peer address" — userspace bridge is enough,
  cheaper to operate (no TUN device on host), and easier to reason about
  for stage-2 attestation gating.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orkers)

Deploys the port-forwarder on 4 dstack CVMs with a shared PEERS_JSON,
verifies all 6 ICE links come up concurrently and traffic flows in every
direction without code changes — mesh-conn already iterates peers
generically. 12/12 cross-peer one-way UDP datagrams delivered, all paths
direct hole-punch (no TURN relay selected).

Sets up the next decision: how to carry TCP across CVMs so Consul's RPC
+ gossip-state-sync work. Plan documented at the bottom of the result
file: add a multiplexed TCP path to mesh-conn rather than routing TCP
via dstack-gateway, so apps only have to know about one transport.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layers TCP forwarding onto the existing UDP port-forwarder so Consul
(which uses both UDP gossip and TCP RPC + gossip-state-sync on the same
serf port) and any other TCP-needing service can run cross-CVM through
the same agent.

How it works:
  - Each peer-pair still has exactly one pion/ice connection.
  - That connection is wrapped in a yamux session; the lex-smaller peer
    is the yamux client, matching the ICE Dial/Accept convention.
  - Each yamux stream's first byte tags its purpose:
      0x55 streamUDP — long-lived control stream, length-prefixed UDP
                       datagrams flow both ways
      0x33 streamTCP — per-connection ephemeral, raw byte splice
  - mesh-conn now binds both a UDP socket and a TCP listener on
    127.0.0.1:<peer-port>; local Accept on either opens a stream of the
    matching tag. On the remote side a new TCP stream causes a Dial to
    127.0.0.1:<self-port> and bidirectional splice.

Verified end-to-end on the existing 4-CVM cluster: 12/12 cross-peer HTTP
curls succeeded through the bridge (deploy/stage1-tcp-results.md). UDP
fan-out from earlier still works.

Single ICE conn + yamux mux trades a small head-of-line risk for
halving NAT-mapping pressure vs running separate UDP and TCP ICE
connections. Acceptable for Consul-grade traffic; can split later if
jitter sensitivity demands it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pair

Extends mesh-conn so each peer can forward several ports through a
single ICE+yamux pair. Required for Consul, which advertises one bind
address but uses several ports for distinct protocols (serf-LAN gossip,
server-RPC, HTTP API, gRPC/xDS); each protocol needs its own per-peer
identity port for the source-port-preservation trick to work.

Changes:
  - Peer.Port int -> Peer.Ports []int (PEERS_JSON now carries a list per
    peer; index i is the same protocol across peers)
  - yamux stream header grew from 1 byte to 3:
    [tag (1)] [receiver-side port, uint16 BE (2)]
  - Per peer-pair: still one ICE conn + one yamux session
    * lex-smaller side opens N long-lived UDP streams up front, one per
      port, each tagged with the peer's port for that index
    * lex-larger side accepts, looks the port up in self.Ports, pairs
      with the matching local UDP socket
    * TCP: per-connection ephemeral streams, header carries the dst
      port so the receiver dials its own matching local listener

Verified: 4-CVM cluster (ctrl + 3 workers), 4 ports per peer.
deploy/stage1-multiport-results.md — 48/48 cross-peer HTTP fetches
through the bridge succeeded (4 protocol slots × 12 directed peer-pairs).
All ICE pairs landed on direct host<->{prflx,srflx} paths, no relay.

Trade-offs in design notes: one ICE+yamux per pair was preferred over
one ICE per port to keep NAT-mapping pressure low (6 pairs vs 24) and
to give an all-or-none readiness guarantee for the protocol slots.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…verlay

Stands up a real HashiCorp Consul cluster (1 server + 3 clients) on
four TEE-isolated dstack CVMs whose only inter-CVM data path is the
userspace mesh-conn ICE+yamux port-forwarder built in stage 1.

What's new:
  - stage2/docker-compose.yaml: per-peer compose with mesh-conn,
    a hashicorp/consul:1.19 agent, and a netshoot tester sidecar,
    all on network_mode: host.
  - Consul launched via shell wrapper that branches on ROLE env var:
    server (-server -bootstrap-expect=1 -ui) for ctrl,
    client (-retry-join=127.0.0.1:CTRL_SERF_LAN_PORT) for workers.
  - Each peer's agent binds to 127.0.0.1 with its own per-protocol
    identity ports (serf=180XX, RPC=181XX, HTTP=182XX, gRPC=183XX),
    matching the mesh-conn port plan; mesh-conn forwards each port
    to the corresponding peer and source-port-preservation makes the
    addresses look right from every Consul agent's perspective.
  - stage2/README.md documents the port plan and how Consul gossips
    peer ports so workers can dial the leader's RPC port through the
    overlay.

Verified (deploy/stage2-results.md):
  - All 4 peers see all 4 members alive in /v1/agent/members.
  - All 4 peers agree leader = 127.0.0.1:18100 (ctrl's RPC port via
    the overlay).
  - KV write from w1 (curl PUT) is readable from w3 (curl GET) — RPC
    to the leader and Raft replication both work across the overlay.

Confirms that Consul's three transport classes (UDP gossip, TCP RPC,
TCP HTTP API) all round-trip cleanly through one yamux session per
peer-pair.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…the overlay

Walks through the exact recipe for running a Consul cluster across
dstack CVMs that have no direct connectivity to each other:

- Why the per-protocol identity-port plan exists and how it falls out
  of mesh-conn's source-port-preservation behaviour.
- The compose layout (mesh-conn + consul + tester, all on host
  networking) and each non-obvious flag explained: bind/advertise on
  127.0.0.1, per-peer -serf-lan-port / -server-port / -http-port /
  -grpc-port overrides, why -dns-port=-1.
- The per-CVM env-var matrix for PEER_ID / ROLE / *_PORT /
  CTRL_SERF_LAN_PORT / PEERS_JSON.
- What the boot sequence actually looks like (mesh-conn → Consul
  agents → leader election → workers join).
- How to verify membership, leader, and cross-peer KV.

Aim is so the next person setting this up doesn't have to reverse-
engineer the trick from the compose file. The whole thing collapses
to: "Consul never sees the overlay; identity ports + source-port
preservation make every peer look like it's on the same loopback."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fan-out

Layers a small user-facing demo on top of the stage-2 Consul cluster.
Each peer runs a tiny Go service (stage3a/webdemo, ~150 LoC) that:
  - registers with the local Consul agent on startup as service
    "webdemo" with an HTTP health-check on /hello
  - exposes /hello returning "hello from <peer>"
  - exposes /all that queries /v1/catalog/service/webdemo on local
    Consul and fans out /hello calls to every instance returned

The addresses Consul hands back (127.0.0.1:<peer's webdemo port>)
are routed through mesh-conn to the right peer with no app-side
awareness of the overlay. Per-peer port plan grew by one slot
(index 4 = webdemo HTTP, ports 18500-18503).

Verified end-to-end across 4 CVMs (deploy/stage3a-results.md):
  - all 4 webdemos register with the cluster (catalog visible from
    every peer)
  - /all from every peer returns 4 hellos: ctrl, w1, w2, w3
  - HTTP fan-out crosses CVM boundaries via mesh-conn for every
    non-self peer

Bug caught and fixed in this round: Consul's
/v1/agent/service/register requires PUT, not POST (returned 405 on
first try).

Sets up stage 3b: replace the plain HTTP path with Connect sidecars
and explicit intentions for mTLS between services.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er the overlay

Replaces stage-3a's plain HTTP service-to-service calls with a real
Consul Connect mesh. Each peer now also runs an Envoy sidecar in front
of its webdemo; sidecars do mTLS to each other across the overlay and
intentions gate the connections.

What's new:
  - stage3b/sidecar/Dockerfile: small custom image combining the
    consul CLI (for `consul connect envoy -bootstrap`) with Envoy
    contrib v1.30. Tiny — no full Consul agent, just the CLI.
  - stage3b/webdemo: webdemo registers with a Connect.SidecarService
    block telling Consul to manage a sidecar that listens on the
    per-peer sidecar_public port and exposes one upstream
    "webdemo" on local 127.0.0.1:19000. /all hits the upstream N
    times so Envoy's LB rotates across all 4 instances.
  - stage3b/docker-compose.yaml: adds the sidecar service, enables
    Connect on the Consul agent (-hcl 'connect{enabled=true}'),
    PEERS_JSON now has 6-element ports lists (the new sidecar_public
    slot, 18600..18603) so mesh-conn forwards mTLS traffic between
    peer sidecars.

Verified end-to-end (deploy/stage3b-results.md):
  - All 4 sidecars boot cleanly; Envoy logs show clusters loaded and
    listeners up (public_listener and webdemo upstream).
  - With intention webdemo->webdemo: allow, /all from w1 returns
    perfectly balanced load: 2/2/2/2 across ctrl, w1, w2, w3.
  - Flip intention to deny: 6/8 calls fail with EOF (peer sidecars
    reject the mTLS handshake). Flip back to allow: full balance
    restored. Intention enforcement is real.

Bug caught: Consul's /v1/connect/intentions create wants POST (not
PUT). Update-by-ID uses PUT. Two endpoints, two methods — easy to
trip on; called out in the results doc.

Combined picture: a HashiCorp Consul service mesh — Envoy sidecars,
mTLS, intention enforcement — running across four TEE-isolated dstack
CVMs whose only inter-CVM data path is our userspace ICE+yamux
overlay. Apps and Envoy never see the overlay; from any CVM the mesh
looks like a single loopback-only host with peers on 127.0.0.1:<port>.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ARCHITECTURE.md walks the four-layer data plane (rendezvous infra →
ICE+yamux overlay → identity-port forwarder → apps), traces a single
Connect mTLS call all the way through, and is precise about the
mesh-conn × yamux wire format: one ICE conn per peer-pair, one yamux
session per ICE conn, 3-byte stream header (tag, receiver-side port),
2-byte length prefix on UDP datagrams.  Includes the four-pump
diagram and the actual pump bodies for both UDP and TCP paths.

ROBUSTNESS.md is an honest review: per-layer failure modes, what
recovers automatically, what doesn't, plus a prioritised punch list.
Headlines:
  - mesh-conn has one real bug today (auth-channel reconnect
    deadlock) that will bite the first ICE drop. ~30 LoC fix.
  - Single Consul server is the biggest structural SPOF; 3-server
    quorum is the obvious "leave it running" upgrade.
  - Gossip key + RPC TLS not configured today; defence-in-depth gap
    masked by Layer-3 mTLS but should be closed.
  - Coordinator is a SPOF for new joins (not for established
    traffic); two-coordinator setup + signed signalling messages
    closes both that and the metadata-spoof gap.
  - "Are we playing too many tricks?" — no. The clever-and-ours
    surface is just mesh-conn (~330 LoC) and the identity-port
    plan; everything else is well-trodden libraries (pion/ice,
    yamux, Consul, Envoy). Risk is concentrated in the small
    custom shim, not in the count of layers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, validation

Three robustness fixes from ROBUSTNESS.md's punch list, paired together
because they all touch mesh-conn / Consul agent config.

#1: mesh-conn auth-channel reconnect deadlock
  Each dialICE attempt now installs a fresh peerSession (new
  *ice.Agent + new authCh), replacing any prior one in the global
  map. pollLoop looks up currentSession() per message; if no active
  attempt exists, the message is dropped (rather than buffered into
  a stale channel that would later poison a reconnect). Fixes the
  hang where, after an ICE drop, the next dialICE blocks forever on
  <-sess.authCh because the channel still held a stale auth from
  the previous attempt.

#4: PEERS_JSON validation at mesh-conn startup
  validatePeers() in main.go fails fast on:
    - <2 peers
    - empty peer id, duplicate id
    - empty Ports list, port out of [1, 65535]
    - duplicate port within a peer's own Ports list
    - port collision between two peers (must be globally unique
      because mesh-conn binds OTHER peers' ports on 127.0.0.1)
    - port-list length mismatch across peers (every peer must use
      the same number of protocol slots, by index)
    - PEER_ID not in PEERS_JSON
  Also logs a digest of the canonical PEERS_JSON so operators can
  grep across CVM logs to confirm every peer sees the same config.
  Tests in validate_test.go cover all cases (8 tests, all passing).

#2: Consul gossip key
  Stage 2/3a/3b composes now require a GOSSIP_KEY env var and pass
  it to consul agent via -encrypt=$GOSSIP_KEY. Encrypts serf-LAN
  gossip end-to-end (UDP+TCP) on every agent. Generated at deploy
  time via openssl rand -base64 32. Layer-3 mTLS already protects
  payloads; this hardens the membership/check-result path which
  rides outside Connect.

  RPC TLS deferred to the dev-experience restructure where central
  cert provisioning fits naturally; gossip key is the bigger gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plan for collapsing the per-stage shell-script + per-peer env-var
matrix into a single cluster.yaml + a small `cluster` CLI that drives
phala deploy.

Headlines:
  - cluster.yaml is the single source of truth: peers, protocol port
    plan, intentions, secrets policy, deploy params.
  - One CLI: validate / plan / up / down / status / logs.
  - Control plane is an "embedded" mode where one dstack CVM bundles
    coturn + signaling + Consul server, removing the external Vultr
    box; requires Phala admin to enable UDP ingress on that CVM.
    Falls back to "external" mode (separate non-TEE coordinator host)
    when UDP ingress isn't available.
  - Mesh-conn / webdemo / sidecar code stays unchanged; the change is
    entirely in deploy ergonomics.
  - TEE-app constraint is respected: one compose template per role,
    only env vars vary per peer; compose-hash audit surface is small.

Future direction noted but not in this stage: derive GOSSIP_KEY /
TURN_SHARED_SECRET inside each TEE via dstack-sdk getKey() so the
deploy host never sees them. Requires AppAuth-shared app-id across
peers; reuses stage-2 attestation work.

Open questions for the user listed at the end of the doc (CLI
language, secret handling, control-plane HA, redeploy semantics).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three live on the redeployed stage-3b cluster:
- mesh-conn validation logs an identical PEERS_JSON digest
  (NiNhinoUekif) on every peer, confirming cross-peer config
  consistency.
- Consul logs Gossip=true → serf-LAN encrypted with the shared
  gossip key.
- Connect mTLS /all still perfectly balanced 2/2/2/2 across the four
  webdemo instances; cluster operation unchanged by the fixes.

#1 (reconnect bug) is verified by code review + the new
validate_test.go test suite; live failure-injection (kill mesh-conn
mid-run) is queued for the stage-4 CI work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, in-place updates

Reshapes stage-4 around four user decisions:

  1. No new CLI — use Phala's official
     terraform-provider-phala (resource phala_app supports
     replicas, env, encrypted_env, custom_app_id+nonce, in-place
     update). One cluster.tf is the source of truth.

  2. Secrets never in human hands. A small bootstrap-secrets init
     container per peer mounts /var/run/dstack.sock, derives
     gossip key / TURN secret / Connect-CA seed via getKey(),
     writes them to a tmpfs volume, exits. consul + mesh-conn
     read those files at startup. All peers share the same
     app_id (via custom_app_id + cluster_nonce) so getKey()
     returns the same bytes on every peer.

  3. Multi-server Consul stays the next stage but unlocks
     self-discovering rendezvous: each control CVM registers as
     service "mesh-coordinator" and "mesh-turn" in Consul; new
     peers know ONE bootstrap endpoint and learn the rest from
     the catalog. Topology of the rendezvous becomes a
     service-mesh-managed concern.

  4. In-place updates preserve disk volumes (Consul Raft state,
     KV, sidecar certs, future Patroni WAL). Compose/env diffs
     update existing CVMs without recreate; only
     custom_app_id/nonce changes rotate identity. Per-node
     rollout for the control plane via terraform -target.

Includes a full HCL skeleton, the bootstrap-secrets sketch, and
maps each ROBUSTNESS.md punch-list item to stage 4 vs the next
stage.

Open item: confirm phala_app behaviour (replicas, encrypted_env,
in-place env update, custom_app_id) on the 0.2.0-beta.1 provider
before committing the dev-experience to it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… provider

Empirical verification before committing the dev-experience to
terraform. Spun up a tdx.small nginx via terraform apply, exercised
in-place updates, replicas, then destroyed.

Outcomes (stage4-experiments/tf-shakedown/RESULTS.md):
  - create works (~2 min for tdx.small)
  - in-place compose+env update preserves app_id and primary_cvm_id
    (~3m39s for the upgrade flow); disk volumes survive
  - replicas: 1 -> 2 plans in-place; both CVMs land under the same
    app_id, which is exactly what TEE-derived secrets via getKey()
    need across replicas (no out-of-band coordination required)
  - destroy clean (~23s)

Two gotchas baked into RESULTS.md:
  - storage_fs MUST be pinned in HCL ("zfs"); otherwise the next
    apply diffs "zfs -> (known after apply)" which the provider
    treats as ForceNew → destroys the CVM. Without pin, every diff
    becomes a recreate.
  - provider is at 0.2.0-beta.2; Terraform's >= constraint excludes
    pre-release by default — pin exactly.
  - field-name shape is positive (listed/public_logs/public_sysinfo),
    not the CLI's --no-... shape.

Verdict: provider is good enough for stage 4. Open follow-ups
listed in RESULTS.md (encrypted_env behaviour, custom_app_id +
nonce determinism, failure-mode handling, AppAuth-shared-id
pattern via on-chain KMS).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two closeout pieces for the architecture-testing phase before
implementing stage 4 itself.

STAGE4_PLAN.md revision 2:
  - Drops the per-peer phala_app pattern from rev-1 in favour of
    "one phala_app per role with replicas: N", matching dstack's
    native app->instance grain.
  - Each instance reads its identity from a UUID file written to
    its persisted disk on first boot. No PEER_ID env var; no
    PEERS_JSON env var.
  - Peer discovery via Consul: each instance registers itself with
    role + ordinal + identity-port set as service tags. Adding a
    peer is a `replicas` bump.
  - bootstrap-secrets init container is the keystone: derives all
    cluster-wide secrets (gossip, TURN, Connect-CA seed) via
    getKey() AND manages per-instance UUID + ordinal claim via
    Consul KV CAS.
  - Rolling updates without per-instance Terraform resources: a
    rollout.sh that calls workload-aware drain verbs (consul
    operator raft transfer-leader, etc.) gates each replica.
    Once phala-cloud#243 lands `update_policy`, most of this
    collapses into HCL.
  - Updated migration notes: stages 0-3b stay frozen as historical
    reference; stage 4 is the integrated product.

stage4-experiments/disk-persistence/ — empirical verification of
THE keystone assumption: docker named volumes survive in-place
phala_app compose updates.

Test: deploy a CVM, write UUID 90ce33e5... to a named volume, bump
a tfvar that flips the compose body, terraform apply. After ~3 min
in-place update (same app_id, same primary_cvm_id), curl the
volume-served file -> identical UUID. Disk persisted. ✅

Caveats noted in RESULTS.md: didn't test under replica scaling
or image bumps. Will exercise both inline during stage-4 build.

Live state cleanup: stage3b cluster (4 CVMs at $0.058/hr each)
torn down. coturn + signaling on 155.138.146.255 still up
(dirt cheap, useful as TURN fallback for any future test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng the actual repo

Verified the Go SDK at github.com/Dstack-TEE/dstack/sdk/go/dstack —
two corrections to the previous draft:

1. Per-instance identity comes from client.Info(ctx).InstanceID
   directly. The plan's "write UUID to /var/lib/dstack/instance-id
   on first boot, read it back on subsequent boots" was redundant —
   dstack already exposes a stable per-CVM ID through the SDK,
   rooted in the platform rather than a file we wrote. Drop the
   on-disk UUID dance.

2. GetKey signature is (path, purpose, algorithm) returning a
   hex-encoded secp256k1 (or other) key, decoded via .DecodeKey()
   to 32 bytes. Pseudo-call shape gossipKey = GetKey("...:gossip")
   was wrong; real shape is
     seed, _ := client.GetKey(ctx, "dstack-mesh/gossip", "cluster", "secp256k1")
     gossipBytes, _ := seed.DecodeKey()
   The 32-byte output is fine to use as the gossip key directly,
   or to HKDF for multiple sub-keys.

bootstrap-secrets simplifies as a result: no on-disk UUID
write/read logic, just GetKey() + Info() into tmpfs. ~80 LoC.

Bonus finding: same SDK exposes Sign(), Verify(), GetQuote() — so
the deferred "attestation-gated mesh join" work (originally Stage 2
in the plan) now fits cleanly into stage 4 with no new tooling.
Each peer signs its mesh-conn auth message, the coordinator
verifies before letting it onto the overlay. Noted as a bonus
add-on, not a stage-4 requirement.

Open items list updated: disk-persistence ✅, SDK existence ✅;
container ordering + Consul CAS-vs-hash for ordinal claim still
TBD inline during build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The keystone of the stage-4 design: the only piece that holds
plaintext cluster secrets, and it does so entirely inside the TEE.

What it does (one-shot, ~250 LoC):
  1. Connects to /var/run/dstack.sock via the official Go SDK
     (github.com/Dstack-TEE/dstack/sdk/go/dstack).
  2. client.Info(ctx) -> self identity (AppID, InstanceID,
     ComposeHash). Per-CVM identity comes from the SDK directly;
     no on-disk UUID write/read.
  3. client.GetKey(ctx, path, purpose, "secp256k1") for each of:
       - dstack-mesh/gossip
       - dstack-mesh/turn
       - dstack-mesh/connect-ca
     Same path/purpose/algorithm tuple yields the same 32 bytes on
     every replica that shares an app_id (which all replicas of
     one phala_app do). No secret material ever transits the
     deploy host.
  4. Workers claim a stable ordinal (0..N-1) via Consul KV CAS on
     `cluster/<name>/slots/<i>`. InstanceID is the slot's permanent
     owner so restarts re-find their own slot. Coordinator skips
     this — it's always ordinal 0 (chicken-and-egg: it IS Consul).
  5. Computes per-protocol ports from PROTOCOL_BASES env +
     ordinal.
  6. Writes secrets (hex-encoded, mode 0400) to /run/secrets/* on
     a tmpfs volume. Writes /run/instance/info.json with identity
     + ports for sibling services to read.
  7. Exits cleanly so docker-compose `depends_on` with
     `condition: service_completed_successfully` releases consul,
     mesh-conn, sidecar, etc.

Required env:
  CLUSTER_NAME, ROLE, PROTOCOL_BASES (JSON).
  Workers also need CONSUL_HTTP_ADDR (the local agent).

Compile chain:
  - Go module pinned to dstack/sdk/go @5cfd7db (2026-03-19; latest
    commit on master at the time of writing).
  - SDK requires Go >= 1.24; the local toolchain auto-upgrades via
    GOTOOLCHAIN=auto.
  - Multi-stage Dockerfile produces a ~11MB static binary on
    alpine.

Note on stale slots: when an instance is permanently retired (vs
restarted), its slot's KV entry stays. Cleanup is an operator
task today; production version would key the KV entry with a
Consul Session that has a TTL so stale slots auto-clear. Flagged
in code comments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whole cluster is now defined in one HCL file and brought up with
one `terraform apply`.

What's new under stage4/:
  - mesh-conn/: clone of stage1 with two small additions —
    self identity loaded from /run/instance/info.json (written by
    bootstrap-secrets from the dstack SDK's Info()), and TURN secret
    loaded from /run/secrets/turn (also bootstrap-secrets-derived).
    PEERS_JSON still env-passed; cluster.tf computes it from the
    `replicas` count so adding a peer is a `replicas` bump.
  - compose/coordinator.yaml + compose/worker.yaml: frozen
    templates that wire bootstrap-secrets + mesh-conn + consul +
    {coturn,signaling} (coord) or {webdemo,sidecar} (worker), all on
    network_mode: host, with a tmpfs volume for /run/secrets and
    /run/instance so derived state never touches the persistent disk.
  - cluster-example/cluster.tf: the user-facing surface. Two
    phala_app resources (coordinator replicas:1, worker
    replicas:N), shared protocol_bases, computed peers_json. Adding
    a peer = `worker_replicas` bump + apply.
  - cluster-example/rollout.sh: workload-aware rolling update
    driver. Snapshots Consul, applies one app at a time via
    -target, waits for cluster green between steps. Replaces the
    update_policy block we'd want once phala-cloud#243 lands.
  - stage4/README.md: how a deploy works, how to add a peer,
    how to update images, what was deferred.

Boot sequence end-to-end:
  1. terraform apply provisions both phala_apps; CVMs come up.
  2. bootstrap-secrets (init container) calls dstack SDK
     Info()+GetKey(), writes /run/secrets/{gossip,turn,ca-seed} +
     /run/instance/info.json (identity + ordinal + ports), exits.
  3. consul + mesh-conn + sidecar + workload start in dependency
     order via `depends_on: { bootstrap-secrets: { condition:
     service_completed_successfully } }`. They read their config
     from the tmpfs files written in step 2.
  4. mesh-conn opens ICE+yamux per peer-pair; consul forms its
     cluster through the overlay; Connect mTLS works between
     workers via Envoy sidecars.

Three properties this delivers vs the per-stage scripts we had
before:
  - Single source of truth (cluster.tf), no per-peer env-var matrix
    duplicated across deploys.
  - Secrets never seen by the deploy host — bootstrap-secrets is the
    only piece that holds plaintext keys, and it does so entirely
    inside the TEE.
  - Disk volumes preserved across in-place updates (verified in
    stage4-experiments/disk-persistence/RESULTS.md), so Consul Raft
    state, KV, and any future Patroni WAL survive image bumps and
    config changes.

Carry-overs to next iteration: stale-slot cleanup needs Consul
Sessions with TTL (not unconditional CAS-claim); multi-server Consul
HA is a one-line `replicas: 3` change but pulls that question
forward. README spells out what's deferred and why.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iterating toward the first green stage-4 deploy. Three substantive
fixes that came out of the smoke test:

1. Worker compose was missing WORKER_ORDINAL in bootstrap-secrets's
   environment block. Cluster.tf passed it via --env to the CVM,
   bootstrap-secrets's Go code read it from env, but compose
   never plumbed it into the container. Result: bootstrap-secrets
   fell into the Consul-CAS-claim path, found no Consul (it's on
   the unreachable coordinator), exited 1, and dstack tore the
   whole CVM down with `service "bootstrap-secrets" didn't
   complete successfully: exit 1`. One missing line, ~3 hours of
   serial-log archaeology to find.

2. Workers now declared as N separate phala_app resources via
   for_each (not one app with replicas:N). Each gets its own
   WORKER_ORDINAL env so bootstrap-secrets can compute the ports
   without Consul-side coordination. The replicas-N path requires
   per-instance env which phala_app doesn't expose today (filed as
   phala-cloud#243).

3. bootstrap-secrets now picks an ordinal source explicitly:
     a. WORKER_ORDINAL env (preferred when present)
     b. ROLE=coordinator → ordinal 0
     c. Consul KV CAS (fallback for the eventual replicas:N path)
   This breaks the chicken-and-egg between bootstrap-secrets and
   Consul that the worker hit.

4. Gossip/turn/ca-seed each emitted in a format the consumer can
   actually use: gossip is base64 (consul -encrypt), turn is hex
   (coturn --static-auth-secret), ca-seed is hex (HKDF-friendly
   bytes). Previously everything was hex which made consul reject
   the gossip key.

5. Compose templates now use bind-mounts to /tmp/dstack-runtime
   instead of named docker volumes — initially debugged thinking
   named volumes didn't share on dstack (filed phala-cloud#245
   then closed as user error after retesting cleanly). Bind
   mounts work fine and the comment notes it's for "secrets are
   re-derived from getKey() each boot anyway, so /tmp ephemerality
   is fine".

6. Added compose/worker-debug.yaml — minimal worker (just
   bootstrap-secrets + a no-depends sleeper) for diagnosing
   future boot-sequence regressions in isolation.

Coordinator still needs Phala admin to enable UDP ingress on its
app to make embedded mode (coturn + signaling on the same CVM)
fully functional. Next iteration: fall back to external
coordinator (the existing Vultr coturn+signaling) so we can land
end-to-end smoke without that gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end smoke now passes — `/all` on each worker fans out across all
3 webdemo instances via Consul Connect mTLS over the mesh-conn UDP path.

Three independent issues surfaced during the smoke and are fixed:

1. mesh-conn ICE wedge after first failure
   pion/ice's `agent.Dial`/`Accept` blocks indefinitely once ICE
   transitions to Failed, so the outer `runPeerLink` retry loop never
   fires and the peer slot stays dead until the container is bounced.
   Cancel the dial context from the state callback (Failed/Closed) and
   add a 60s belt-and-suspenders timeout. Tighten the auth wait from
   10 min to 60 s for the same reason — the long timeout was the only
   reason a retry was even *theoretically* possible, and it left a
   10-minute window where the slot looked silent. Also call
   `agent.Close()` on every error path so a stuck attempt doesn't
   hold pion goroutines.

2. webdemo + sidecar entrypoints needed jq
   Both compose entrypoints parse `/run/instance/info.json` with jq;
   the alpine/envoy base images don't ship it. Add jq to both
   Dockerfiles. Fast-failing the workload service is what kept
   webdemo + sidecar in a restart loop on every smoke until now.

3. Coordinator-internal coturn/signaling were unreachable from workers
   dstack-gateway is TCP-only and doesn't surface arbitrary CVM ports,
   so `SIGNALING_URL=http://<coord-app-id>...:7000` and TURN against
   the coordinator's own coturn never worked. Switch both coordinator
   and worker mesh-conn to the external (Vultr) signaling+coturn that
   workers were already using; the coordinator's embedded copies still
   run but are unused. Wire the new paths through cluster.tf as
   `external_*` variables. Drop `-encrypt` from the consul launch —
   we'd already removed gossip encryption to unstick the cluster, and
   the now-unused TURN_SHARED_SECRET-from-/run/secrets path is
   replaced by env-first resolution in mesh-conn.
Add two follow-up notes to stage4/README.md based on what the smoke
turned up: shared TEE-derived secrets across separate phala_apps need
a shared AppAuth contract (gating for stage 2 attestation-gated join),
and mesh-conn's ICE recovery is now in-process but the signaling
broker should also age out stale auth/candidate entries. Cross-link
the new terraform-provider-phala#6 env-drift issue.
The provider issue (#6) was a downstream symptom; root cause likely
lives in the API surface, so the bug was moved to phala-cloud#246.
Adds the original-goal Patroni service to the worker compose. Each
worker runs:

  - patroni 4.0 + PostgreSQL 16 (single image, ~250MB)
  - entrypoint.sh renders /etc/patroni.yml from /run/instance/info.json
    (ordinal, postgres + patroni_rest ports) plus CLUSTER_NAME
  - data dir lives in the named docker volume `patroni-pgdata` so
    it survives container restarts (CVM reboots wipe it; persistence
    across reboots is a future stage4-experiments topic)

Cluster wiring:

  - cluster.tf grows two new protocol slots: postgres=18700 and
    patroni_rest=18800. Adds `var.patroni_image` + threads
    PATRONI_IMAGE through worker env.
  - bootstrap-secrets derives two more cluster-wide secrets via
    getKey() — patroni-superuser and patroni-replication. They're
    identical on every replica because all peers derive against the
    same path + ClusterName, so any peer can bootstrap as leader
    without out-of-band secret distribution.
  - All Patroni instances point at 127.0.0.1:<own_consul_http>; cross-
    peer replication uses 127.0.0.1:<peer_postgres_port>, which the
    mesh-conn UDP forwarder maps to the right CVM transparently.

Patroni's own leader election runs through Consul KV — no separate
DCS needed. With three workers we get fault tolerance of one (1
leader + 2 replicas).
…, not the first

After a peer bounce, multiple auths from that peer can reach pollLoop
in a single batch. The original \`select case authCh <- ... default\`
kept the FIRST auth and silently dropped every later one. dialICE
then consumed the stale auth, called \`agent.Dial\` against the wrong
ufrag/pwd, and ICE Failed.

The earlier ICE-state cancel fix correctly aborts and retries — but
on retry pollLoop has no fresh auth in the queue (already drained),
so dialICE waits 60s and retries again, while the *peer* in turn
publishes a NEW auth that pollLoop also drops because the channel is
still buffered with the original stale auth. Both sides repeat
forever and the link never re-establishes.

Drain-then-push so the channel always holds the most-recent auth.
The channel is buffered to 1 and only one goroutine writes (pollLoop),
so there is no contention and the drain is safe.
Coordinator goes from a single phala_app with replicas:1 to a
for_each over `var.coordinator_replicas` (default 3), giving an actual
Raft-replicated 3-server Consul cluster instead of bootstrap-expect=1.

Per-instance ordinal is passed in via env (`COORDINATOR_ORDINAL`),
mirroring the worker pattern, since bootstrap-secrets needs to know
its own ordinal before Consul KV is reachable (we can't ask Consul
KV for the ordinal because Consul is *on* the coordinators we're
trying to bootstrap). The KV-CAS claim path stays as a fallback for
the eventual replicas:N future once phala-cloud#243 lands.

Worker ordinals shift by `coordinator_replicas` so the peer ID space
stays contiguous (coordinators 0..C-1, workers C..C+W-1). Workers
retry-join *every* coordinator's serf port (mesh-conn forwards each
one), and pick any coordinator's HTTP port for KV calls.

Coordinator's consul launches with `-server -bootstrap-expect=N` and
loops over COORDINATOR_SERF_PORTS to retry-join its server peers
(skipping its own).

What this gets us: fault tolerance of 1 (3-server quorum) with the
Consul UI/API still served from any coordinator. Patroni's DCS now
sits on top of a real HA Consul, not a single point of failure.
…es on new auth

The mailbox previously kept appending forever — and because mesh-conn
republishes auth+candidates on every dialICE retry, a recipient would
drain a long backlog where the FIRST auth was the oldest. After my
recent mesh-conn pollLoop fix that backlog became less catastrophic
(the latest auth wins in the buffered channel), but the candidates
in between are still added to the new ICE agent. pion then dials
against addresses whose UDP sockets are gone, ICE Fails, and the
loop repeats forever for a peer that bounced.

Drop all stale messages from a sender when a NEW auth from that
sender lands in the recipient's queue. Auth marks the start of a
fresh epoch — mesh-conn always publishes auth BEFORE its candidates
(candidates come from OnCandidate AFTER GatherCandidates, which
happens after the auth publish), so anything in queue from before
this auth is by definition stale.

This is the signaling-broker mate of the mesh-conn drain-then-push
fix from 4c36c76 — the broker now actively reaps the backlog instead
of relying on the consumer to do it correctly.

Note: the same mailbox impl is used by the stage4 signaling image
(which is built from this phase0 source). Deploying this requires
rebuilding + pushing the signaling image and restarting it on the
Vultr coordinator host.
Concurrent phala_app creates against the same workspace return
400 'parameters not compatible'. Workaround: terraform apply
-parallelism=1. Track upstream fix for the misleading error code.
h4x3rotab and others added 17 commits May 2, 2026 23:59
mesh-conn computes its self_id as `role-ordinal` from
/run/instance/info.json, then looks for that ID in PEERS_JSON. The
multi-coord change shifted worker ordinals to start at C
(coordinator_replicas), but the peer-list IDs were still using slot
(`worker-1`, `worker-2`, `worker-3`) — so e.g. worker-1's mesh-conn
saw self_id="worker-3" but PEERS_JSON only had "worker-1", and
exited with `PEER_ID "worker-3" not in PEERS_JSON`.

Use ordinal in the peer ID. The phala_app name still uses the
1-based slot for human-friendly CVM names ("stage4-worker-1"), but
the peer-id and the in-CVM identifier are now consistent.
worker↔worker instability under load

Adds MESH_CONN_RELAY_ONLY env (default off) that restricts pion's
ICE candidate gathering to Relay only — useful as an escape hatch
when direct (host/srflx/prflx) candidates establish but flap.

Tested on the live stage4 cluster: relay-only made things WORSE for
this dstack worker NAT pattern (pion's relay-relay pair selection
isn't reliable, observable as TURN allocation churn on coturn). Left
the flag in as a debug switch but documented it as not-the-fix in
README.

The actual symptom — `srflx <-> prflx` link goes Connected, yamux
throws `accept: short buffer` 5–60s later, pg_basebackup keeps
failing — is captured in the new "Known limitation" section with a
concrete next-steps list (instrumentation, MaxStreamWindowSize cap,
QUIC, WireGuard).
The instrumentation pass added byte counters per-link, yamux's own
log output (was io.Discard), full ICE selected-pair addresses (not
just types), and a 10s telemetry tick. That trace pinpointed two
bugs that were previously silent:

1. ice.Conn.Read returned io.ErrShortBuffer because pion is
   packet-oriented — when the caller's buffer is smaller than the
   next UDP datagram, pion truncates. yamux's 4096-byte bufio.Reader
   was too small for TURN-encapsulated datagrams. Fixed by a
   65535-byte packetizing adapter (countingConn) that always reads
   full datagrams and re-serves them to yamux as a stream.
2. My own attempted 5s yamux keepalive killed the link under load
   when a pg_basebackup burst delayed a keepalive past the timeout.
   Reverted to 30s/10s defaults.

Adds two debug env switches that didn't pan out for our specific
NAT environment but are kept as escape hatches:
- MESH_CONN_RELAY_ONLY=1: only Relay candidates. Made things worse
  on dstack (relay-relay pair selection unreliable).
- MESH_CONN_TCP_ONLY=1: TCP NetworkTypes + filter URLs to Proto=TCP.
  pion still picks `relay (proto=udp)` because relay transport is
  the *relayed* leg, always UDP unless RFC 6062 TCP allocation is
  requested (pion's TURN client doesn't).

End state for stage 4: Consul (3-server Raft + 6 members) and
Patroni leader election are solid. Patroni replication still
requires sustained worker↔worker bulk transfer, which hits the
yamux-on-lossy-UDP wall documented in the README "Known limitation"
section. Real fix needs a different transport (QUIC, WireGuard, or
TCP-relay end-to-end).
Captures the live cluster's app IDs, SSH command pattern, terraform.tfvars
image tags, the 60-second reproducer for the open worker↔worker mesh-conn
drop, what was already tried (so the next session doesn't re-walk the
same paths), and open hypotheses to investigate with fresh eyes —
deliberately without committing to a fix direction.
…working tree

Working-tree mesh-conn/main.go has been swapped from yamux to
quic-go on top of the same pion/ice packet conn, plus a sibling
stage4/quic-on-ice/ experimental module. Neither is committed and
the live cluster still runs the previous yamux image. RESUME now
flags the discrepancy so tomorrow's session sees it on first read.
yamux assumes a reliable byte-stream underlay, but pion/ice.Conn is
UDP and the path between dstack worker CVMs is extremely lossy
(~99% direction-asymmetric loss when same-NAT hairpinning, ~78%
on the coturn-relay path). The "keepalive timeout" / "recv window
exceeded" errors we kept seeing were yamux's reliability invariants
firing on dropped packets, not yamux bugs.

Replace yamux with quic-go on the same pion/ice.Conn (wrapped as a
net.PacketConn). QUIC has built-in loss recovery + stream
multiplexing, so a lossy UDP underlay is exactly what it expects.
TLS uses a self-signed cert because mesh peer trust is established
out-of-band by the dstack TEE layer + TURN HMAC. The 3-byte
(tag, port) stream header convention is unchanged; runAcceptLoop
and the TCP/UDP pumps are line-for-line near-equivalents on
*quic.Stream.

Same hairpin path that killed yamux at 3 KB now sustains 25-28 MB/s
for pg_basebackup. Both replicas (worker-4, worker-5) bootstrap
and stream cleanly from leader worker-3.

Also drops the old packetizing read-buffer in countingConn (no
longer needed — quic-go reads through the PacketConn shim, which
preserves datagram boundaries) and introduces a sibling smoke-test
module stage4/quic-on-ice/ that proves QUIC over pion/ice.Conn end
to end (10 MB worker↔worker hairpin in ~1s).

RESUME.md rewritten as a "done" note with the QUIC story and
verification recipes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Soft-kill leader-failover walkthrough verified end-to-end on the live
cluster: Patroni elects via Consul KV, worker-4 promotes, writes resume,
worker-3 rejoins as a streaming replica without pg_basebackup. Measured
RTO ~24s (kill → first successful write on new leader), well within
Patroni's default ttl=30s.

Captures the reproducible recipe, a measured timeline, knobs for the
RTO/availability tradeoff, and what's still untested (hard CVM kill,
network partition, disk-loss rejoin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed RTO

Extends FAILOVER.md with the whole-userspace failure scenario: kill all
containers on the leader CVM simultaneously, then bring them back via
`docker compose up -d`. Measured RTO ~33s (9s longer than soft-kill due
to Consul gossip-failure detection on top of Patroni's TTL). Also
confirms best-replica selection under uneven replica lag, QUIC mesh-conn
ICE redial after a peer's userspace evaporates, and cheap rejoin via
local WAL replay (no pg_basebackup).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the MESH_CONN_TCP_ONLY env knob entirely (from dialICE, both
compose templates, and reportLinkStats's tick cadence). The flag was
investigated as a yamux-era escape hatch and proven non-helpful — pion
still selects relay-UDP candidates regardless because the relay
candidate's transport comes from the TURN allocation's relayed leg
(always UDP unless RFC 6062 TCP-allocation requested), not from the
client→TURN leg. With the QUIC switch, the underlying loss is handled
by the transport layer, so the knob has no remaining purpose and was
becoming misleading.

Also quiets reportLinkStats: tick 10s → 60s and skip the log line
entirely when bytes haven't moved since the last tick. Idle peer pairs
no longer spam every 10 seconds. Final-stats line on stop is unchanged
so postmortems still get a summary regardless of activity.

Drops the unused *quic.Conn parameter from reportLinkStats, refreshes
the stale "log every 10s" banner, and tightens the MESH_CONN_RELAY_ONLY
comment in worker.yaml so the rationale ("flip on if worker-to-worker
direct pairs fail") doesn't contradict itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…venance

Adds .github/workflows/consul-postgres-ha-publish.yml — a matrix build
that builds and pushes the six stage-4 images (mesh-conn,
bootstrap-secrets, signaling, webdemo, sidecar, patroni) to
ghcr.io/dstack-tee/dstack-examples/consul-postgres-ha-* on push to
main, tagged with both the long-form commit SHA and `latest`. PRs build
to verify but do not push.

Each push is signed with a Sigstore-backed GitHub Build Provenance
attestation via actions/attest-build-provenance@v2 — the workflow's
GitHub OIDC token gets a short-lived Sigstore cert, no keys we manage.
Consumers verify with `gh attestation verify oci://...@<digest>
--repo Dstack-TEE/dstack-examples`, which proves the image came from
this commit of this workflow.

Replaces ttl.sh references in terraform.tfvars.example with the GHCR
ones, fills in the previously-missing patroni_image and
coordinator_replicas lines, and adds inline docs on pinning to a
sha-tag for prod stability and on running the verification command.

PUBLISHING.md walks through the three paths a stage-4 user actually
hits: the CI publish (steady state), manual one-off ttl.sh / personal-
GHCR builds for dev iteration, and the on-CVM hot-patch flow that
sidesteps phala-cloud#246 when iterating on a running cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…patch

terraform-provider-phala#8 fixed the env-block in-place-update bug
(phala-cloud#246) and shipped as v0.2.0-beta.3, so:

- cluster.tf required_providers now pins ">= 0.2.0-beta.3" with a
  comment explaining why earlier versions are unusable for this stack.
- PUBLISHING.md's hot-patch section reframes its motivation: the
  per-CVM hot-patch path remains useful as a dev shortcut and as the
  only option on clusters still running 0.2.0-beta.2, but it is no
  longer the workaround for env updates not landing — terraform apply
  works correctly now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rejoin

Two small follow-ups after verifying the v0.2.0-beta.3 env-update path
against the live cluster:

1. Provider pin in cluster.tf changed from `>= 0.2.0-beta.3` to
   `0.2.0-beta.3` exactly. Terraform's `>=` operator does NOT include
   later prerelease versions, so `>= 0.2.0-beta.3` only matches stable
   `>= 0.2.0` — `terraform init` failed with "no available releases
   match the given constraints". Pin exactly until we hit a stable.

2. FAILOVER.md gains a disk-loss rejoin section: stop patroni, wipe
   the patroni-pgdata volume, restart, watch Patroni's bootstrap path
   pull a full pg_basebackup from the leader over mesh-conn's QUIC
   tunnel. Measured 5.2 MB / 7s end-to-end on the live cluster
   (handshake-dominated for a small dataset; the real throughput
   number remains the ~25 MB/s pg_basebackup observed during the soft
   -kill section). Closes the last "What this demo does NOT cover"
   item.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lines

Discovered while verifying terraform's beta.3 env-update path against
the live cluster on 2026-05-04: when terraform recreates a CVM, the
peers' QUIC links to it die, but the *redial* path can hang.

Specifically: dialICE returns a Connected ice.Conn, dialAndPump enters
quic.Dial, ICE later goes Failed (peer went away again, hairpin lost,
etc.). quic.Dial's context times out and quic-go calls
SetReadDeadline(past) to interrupt the blocked ReadFrom in our
iceConnPacketConn shim. The shim was returning nil from
SetReadDeadline, so the call had no effect on the underlying
ice.Conn.Read, and the goroutine hung forever. The surrounding
runPeerLink retry loop never got to retry, leaving the peer slot
permanently dead until the entire mesh-conn process was restarted.

Fix: delegate SetDeadline / SetReadDeadline / SetWriteDeadline to the
underlying conn (pion/ice.Conn implements net.Conn deadlines properly).
Same fix applied to the stage4/quic-on-ice smoke test so future
debugging stays trustworthy.

Adds a regression test using net.Pipe (which honors deadlines) that
asserts ReadFrom returns a Timeout-flagged net.Error within ~50ms of
SetReadDeadline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The directory was an engineering log — phase0/, stage1/, stage2/,
stage3a/, stage3b/, stage4/, stage4-experiments/ — useful while
building, useless to a user landing here cold who just wants to
deploy HA Postgres on dstack-TEE.

Promote stage4/ contents up one level to consul-postgres-ha/ as the
canonical, opinionated shape. Rename phase0/icetest → signaling/.
Move stage3b/{webdemo,sidecar} up. Drop the predecessor stage*/ +
phase0/ + stage4-experiments/ + deploy/ (historical results) +
STAGE4_PLAN.md. Git history preserves everything.

Final layout:

  consul-postgres-ha/
  ├── README.md / ARCHITECTURE.md / FAILOVER.md / PUBLISHING.md / ROBUSTNESS.md
  ├── cluster-example/   one cluster.tf
  ├── compose/           coordinator.yaml + worker.yaml templates
  ├── coordinator/       external-coordinator docker-compose
  ├── mesh-conn/         QUIC-over-pion/ICE overlay
  ├── bootstrap-secrets/ TEE-derives per-CVM secrets
  ├── patroni/           Patroni + Postgres
  ├── webdemo/ sidecar/  example workload + Envoy bootstrapper
  ├── signaling/         HTTP /publish + /poll broker for ICE rendezvous
  └── quic-on-ice/       standalone smoke test for the QUIC-over-ICE transport

Updates beyond the moves:
- README.md rewritten as a deploy-first story; old stage-4-internal
  README's "Known limitation" + punch-list (yamux + worker-pair
  instability) is obsolete since the QUIC swap and isn't preserved.
- ARCHITECTURE.md: 4-CVM topology (ctrl+w1/w2/w3) → 6-CVM (3+3),
  yamux deep-dive section replaced with a tight QUIC summary that
  matches the actual code.
- ROBUSTNESS.md: yamux → QUIC mentions, "single Consul server SPOF"
  section updated to reflect the 3-server quorum that's been live
  since `17f4642`, "real registry" recommended-fix moved to "already
  shipped" since GHCR + Sigstore is now the publish path.
- All Go module paths bumped: github.com/Dstack-TEE/dstack-examples
  /consul-postgres-ha/<name> (no stage4/ or phase0/ infix).
- CI workflow path filters + matrix `context:` paths updated.
- .gitignore rewritten to match the new layout.
- Builds + tests pass on all 5 Go modules under the new paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion admission

Two open architectural gaps surfaced during the consolidation pass on
PR #95 — both deserve their own focused implementation passes rather
than being squeezed into the mega PR. Capturing them as design docs
so a future agent (or a future-me) can pick up either one cold and
start.

- design/single-sidecar.md: collapse the 5 platform-plumbing
  containers (keepalive, bootstrap-secrets, mesh-conn, consul,
  sidecar/Envoy) into one image with a shell-init multi-process
  supervisor. Per-CVM container count goes 8 → 3.

- design/attestation-admission.md: replace the TURN-HMAC-only mesh
  admission with dstack TEE attestation as the credential. Phased
  plan: per-app-id check first (Phase 1, smallest delta, no
  rolling-upgrade pain), Consul-KV-rooted policy doc later (Phase 2).
  Recommends the post-QUIC-handshake-stream insertion point over
  the public signaling broker for privacy.

Both docs include current state, approach, risks, open questions,
and explicit hand-off instructions. Each is ~250-350 lines, written
to be self-contained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…decar image

Per-CVM container count drops from 7 → 3 on workers (sidecar + patroni
+ webdemo) and from 6 → 1 on coordinators (sidecar). The new sidecar
image bundles bootstrap-secrets, mesh-conn, consul, and (workers only)
envoy behind a tini-wrapped shell init that dispatches on ROLE; the
old keepalive placeholder, the four-image lockstep, and the vestigial
on-CVM signaling/coturn that had been documented as unused all drop.

CI matrix: 6 → 4 (sidecar, patroni, webdemo, signaling). The sidecar
build uses the parent consul-postgres-ha/ as docker context so its
multi-stage Dockerfile can pull bootstrap-secrets/ and mesh-conn/ Go
sources from sibling subdirs.

cluster.tf: BOOTSTRAP_SECRETS_IMAGE, MESH_CONN_IMAGE, SIGNALING_IMAGE
(coordinator) and the matching tfvars all collapse into SIDECAR_IMAGE.

Smoke-tested against a fresh terraform apply on dstack-pha-prod5
(2026-05-04). Soft-kill RTO 27s, hard-kill RTO 33s, cheap rejoin
verified, disk-loss rejoin 26s — all within noise of the pre-Gap-2
baselines on the previous multi-container cluster.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
refactor(consul-postgres-ha): collapse platform plumbing to single mesh-sidecar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant