fix(gke): run cloudsql-proxy as a native sidecar to end the startup race#347
fix(gke): run cloudsql-proxy as a native sidecar to end the startup race#347Cre-eD wants to merge 3 commits into
Conversation
The runtime CloudSQL proxy was injected as a plain sidecar container, so it started in parallel with the app container with no ordering guarantee. On every pod (re)start -- node scale-down, VPA eviction, rollout -- the app dialed localhost:5432 before the proxy was listening and logged connection-refused; app containers carrying a startup probe took an extra restart to recover. The probe machinery only ever gated the ingress/app container, never the proxy, and port-less worker containers got no gate at all. Run the runtime proxy instead as a native sidecar: an init container with RestartPolicy: Always, gated by a startup probe against the proxy's built-in --health-check server. Kubernetes will not start the app containers until the proxy is listening, so the race is gone for web and worker tiers alike. Scoped to the long-lived Deployment proxy (timeout == 0). The init-Job proxy (timeout > 0) stays an ordinary terminating container on purpose -- RestartPolicy: Always there would keep a RestartPolicy: Never Job from ever completing. Requires GKE >= 1.29 (SidecarContainers, GA in 1.33); prod is 1.34. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Semgrep Scan ResultsRepository:
Scanned at 2026-06-27 20:03 UTC |
Security Scan ResultsRepository:
Scanned at 2026-06-27 20:03 UTC |
📊 Statement coverageMeasured on the documented included set (see
Baseline: |
Addresses a multimodel review of the cloudsql-proxy native-sidecar change: - Add a LivenessProbe (/liveness) to the runtime proxy. On a native sidecar a failing readiness probe neither restarts the container nor gates pod readiness; only liveness recovers a proxy that passed startup then hung (deadlock / pool exhaustion / partial-OOM). /liveness is already served by --health-check. - Extract attachCloudsqlProxyAsNativeSidecar() so the load-bearing placement (InitContainerOutputs, NOT SidecarOutputs) is unit-tested -- a future refactor appending to SidecarOutputs would otherwise silently reintroduce the race. - Pin the previously NotNil-only assertions: exact health port (9090/csql-hc), probe paths (/startup,/readiness,/liveness), and probe<->port-name agreement, plus the base command flags (--address, --credentials-file) and the credential VolumeMount name/path. The init-Job proxy is asserted to carry none of these. - docs: document the native-sidecar Cloud SQL Auth Proxy connectivity model in the gcp-cloudsql-postgres reference. No behavior change to the shipped wiring (review verdict: APPROVE-WITH-NITS, no blockers); these are robustness + coverage hardening. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Multimodel review — verdict: APPROVE-WITH-NITS (no blockers)Ran a 4-lens review (k8s native-sidecar correctness, regression/blast-radius, test-coverage, Go-correctness) across two model families, then an adversarial verify pass on every finding (6 confirmed / 0 rejected) and a synthesis. Core change is SOUND: native sidecar via init container + Follow-up commit should-fix
nice-to-have All assertions also verify the init-Job proxy carries none of these (no probes, no RestartPolicy, no health port). Build + Still staging-first before bumping the pinned SC version in consumers. |
Problem
The runtime CloudSQL proxy is injected as a plain sidecar container (
compute_proc.go→SidecarOutputs→PodSpec.Containers). Sidecar containers start in parallel with the app container with no ordering guarantee, and the proxy container carries no probe. So on every pod (re)start — GKE Autopilot node scale-down, VPAupdateMode: Autoeviction, or a rollout — the app process dialslocalhost:5432before the proxy is listening and logs:It self-heals (the client retries), so there's no data loss — but it's recurring log noise across every GCP SC stack, and for app containers that have a startup probe it costs an extra restart to recover (observed in prod: a single-replica web pod took a
137/startup-probe-fail restart on relocation, briefly dropping its uptime check).The existing probe machinery in
deployment.gocan't help: readiness/startup probes are only attached to the ingress/app container (or a container with a single port), never to the proxy — and port-less worker containers get no gate at all.Fix
Run the runtime proxy as a native sidecar — an init container with
RestartPolicy: Always— gated by a startup probe against the proxy's built-in--health-checkHTTP server. Per the Kubernetes SidecarContainers contract, the app containers don't start until the proxy's startup probe passes, i.e. until the proxy is actually listening. The race is eliminated for web and worker tiers alike.Two files:
cloudsql_proxy.go— whentimeout == 0(runtime proxy): enable--health-check(--http-address=0.0.0.0 --http-port=9090), declare the health port, addRestartPolicy: Always+StartupProbe/ReadinessProbeon/startup&/readiness. Refactored into purecloudsqlProxyCommandArgs/cloudsqlProxyContainerArgshelpers so the behaviour is unit-testable.compute_proc.go— append the runtime proxy toInitContainerOutputsinstead ofSidecarOutputs.Scope / safety
timeout > 0, the self-killing variant used by the db-user-init Job) stays an ordinary terminating container —RestartPolicy: Alwayson aRestartPolicy: NeverJob's container would keep the Job from ever completing. This invariant is now covered by a test.--health-checkbinds0.0.0.0(default is localhost) so the kubelet probe can reach it.#340per-container VPA policy matches the proxy by name (cloudsql-proxy), unaffected by the init/regular placement.SidecarContainers, beta-on-by-default in 1.29, GA in 1.33). Target clusters are on 1.34.Follow-up (not in this PR)
Native sidecars also make the init-Job proxy's
sleep/kill -9timeout hack unnecessary (native sidecars in Jobs terminate when the main container exits) — can be simplified in a later change.Tests
TestCloudsqlProxyCommandArgs_RuntimeEnablesHealthCheck— runtime proxy runs the binary directly with--health-check+--http-address=0.0.0.0, not shell-wrapped.TestCloudsqlProxyCommandArgs_InitJobSelfKills— init-Job proxy issh -cwrapped, self-kills, no health server.TestCloudsqlProxyContainerArgs_RuntimeIsNativeSidecar—RestartPolicy: Always+ startup/readiness probes + health port present.TestCloudsqlProxyContainerArgs_InitJobIsNotSidecar— noRestartPolicy, no probe, no port (regression guard against accidentally making the Job proxy a sidecar).go build ./pkg/clouds/...,go vet, and the fullgcp+kubernetespackage suites pass.Rollout
Staging-first. This changes pod startup ordering; validate on a staging stack (confirm the proxy appears under
initContainerswithrestartPolicy: Always, the app waits for it, and no connection-refused on a forced pod delete) before bumping the pinned SC version in the consuming repos.