Skip to content

fix(gke): run cloudsql-proxy as a native sidecar to end the startup race#347

Open
Cre-eD wants to merge 3 commits into
mainfrom
fix/cloudsql-proxy-native-sidecar
Open

fix(gke): run cloudsql-proxy as a native sidecar to end the startup race#347
Cre-eD wants to merge 3 commits into
mainfrom
fix/cloudsql-proxy-native-sidecar

Conversation

@Cre-eD

@Cre-eD Cre-eD commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Problem

The runtime CloudSQL proxy is injected as a plain sidecar container (compute_proc.goSidecarOutputsPodSpec.Containers). Sidecar containers start in parallel with the app container with no ordering guarantee, and the proxy container carries no probe. So on every pod (re)start — GKE Autopilot node scale-down, VPA updateMode: Auto eviction, or a rollout — the app process dials localhost:5432 before the proxy is listening and logs:

error connecting in 'pool-1': connection to server at "127.0.0.1", port 5432 failed: Connection refused

It self-heals (the client retries), so there's no data loss — but it's recurring log noise across every GCP SC stack, and for app containers that have a startup probe it costs an extra restart to recover (observed in prod: a single-replica web pod took a 137/startup-probe-fail restart on relocation, briefly dropping its uptime check).

The existing probe machinery in deployment.go can't help: readiness/startup probes are only attached to the ingress/app container (or a container with a single port), never to the proxy — and port-less worker containers get no gate at all.

Fix

Run the runtime proxy as a native sidecar — an init container with RestartPolicy: Always — gated by a startup probe against the proxy's built-in --health-check HTTP server. Per the Kubernetes SidecarContainers contract, the app containers don't start until the proxy's startup probe passes, i.e. until the proxy is actually listening. The race is eliminated for web and worker tiers alike.

Two files:

  • cloudsql_proxy.go — when timeout == 0 (runtime proxy): enable --health-check (--http-address=0.0.0.0 --http-port=9090), declare the health port, add RestartPolicy: Always + StartupProbe/ReadinessProbe on /startup & /readiness. Refactored into pure cloudsqlProxyCommandArgs / cloudsqlProxyContainerArgs helpers so the behaviour is unit-testable.
  • compute_proc.go — append the runtime proxy to InitContainerOutputs instead of SidecarOutputs.

Scope / safety

  • Deployment proxy only. The init-Job proxy (timeout > 0, the self-killing variant used by the db-user-init Job) stays an ordinary terminating container — RestartPolicy: Always on a RestartPolicy: Never Job's container would keep the Job from ever completing. This invariant is now covered by a test.
  • --health-check binds 0.0.0.0 (default is localhost) so the kubelet probe can reach it.
  • The #340 per-container VPA policy matches the proxy by name (cloudsql-proxy), unaffected by the init/regular placement.
  • Requires GKE ≥ 1.29 (SidecarContainers, beta-on-by-default in 1.29, GA in 1.33). Target clusters are on 1.34.

Follow-up (not in this PR)

Native sidecars also make the init-Job proxy's sleep/kill -9 timeout hack unnecessary (native sidecars in Jobs terminate when the main container exits) — can be simplified in a later change.

Tests

  • TestCloudsqlProxyCommandArgs_RuntimeEnablesHealthCheck — runtime proxy runs the binary directly with --health-check + --http-address=0.0.0.0, not shell-wrapped.
  • TestCloudsqlProxyCommandArgs_InitJobSelfKills — init-Job proxy is sh -c wrapped, self-kills, no health server.
  • TestCloudsqlProxyContainerArgs_RuntimeIsNativeSidecarRestartPolicy: Always + startup/readiness probes + health port present.
  • TestCloudsqlProxyContainerArgs_InitJobIsNotSidecar — no RestartPolicy, no probe, no port (regression guard against accidentally making the Job proxy a sidecar).

go build ./pkg/clouds/..., go vet, and the full gcp + kubernetes package suites pass.

Rollout

Staging-first. This changes pod startup ordering; validate on a staging stack (confirm the proxy appears under initContainers with restartPolicy: Always, the app waits for it, and no connection-refused on a forced pod delete) before bumping the pinned SC version in the consuming repos.

The runtime CloudSQL proxy was injected as a plain sidecar container, so it
started in parallel with the app container with no ordering guarantee. On
every pod (re)start -- node scale-down, VPA eviction, rollout -- the app
dialed localhost:5432 before the proxy was listening and logged
connection-refused; app containers carrying a startup probe took an extra
restart to recover. The probe machinery only ever gated the ingress/app
container, never the proxy, and port-less worker containers got no gate at all.

Run the runtime proxy instead as a native sidecar: an init container with
RestartPolicy: Always, gated by a startup probe against the proxy's built-in
--health-check server. Kubernetes will not start the app containers until the
proxy is listening, so the race is gone for web and worker tiers alike.

Scoped to the long-lived Deployment proxy (timeout == 0). The init-Job proxy
(timeout > 0) stays an ordinary terminating container on purpose --
RestartPolicy: Always there would keep a RestartPolicy: Never Job from ever
completing. Requires GKE >= 1.29 (SidecarContainers, GA in 1.33); prod is 1.34.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown

Semgrep Scan Results

Repository: api | Commit: 641fbae

Check Status Details
⚠️ Semgrep Warning 1 warning(s), 1 total

Scanned at 2026-06-27 20:03 UTC

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown

Security Scan Results

Repository: api | Commit: 641fbae

Check Status Details
✅ Secret Scan Pass No secrets detected
✅ Dependencies (Trivy) Pass 0 total (no critical/high)
✅ Dependencies (Grype) Pass 0 total (no critical/high)
📦 SBOM Generated 523 components (CycloneDX)

Scanned at 2026-06-27 20:03 UTC

@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown

📊 Statement coverage

Measured on the documented included set (see docs/TESTING.md → Coverage scope). Observe-only — no regression gate is enforced yet.

Scope This PR main baseline Δ
Included set (Gold-tier denominator) 90.3% 90.3% +0.0 pp
Full set (whole repo, transparency) 27.9% 27.9% +0.0 pp

Baseline: main @ 842404b

@smecsia smecsia added the ci-run label Jun 27, 2026
Addresses a multimodel review of the cloudsql-proxy native-sidecar change:

- Add a LivenessProbe (/liveness) to the runtime proxy. On a native sidecar a
  failing readiness probe neither restarts the container nor gates pod readiness;
  only liveness recovers a proxy that passed startup then hung (deadlock / pool
  exhaustion / partial-OOM). /liveness is already served by --health-check.
- Extract attachCloudsqlProxyAsNativeSidecar() so the load-bearing placement
  (InitContainerOutputs, NOT SidecarOutputs) is unit-tested -- a future refactor
  appending to SidecarOutputs would otherwise silently reintroduce the race.
- Pin the previously NotNil-only assertions: exact health port (9090/csql-hc),
  probe paths (/startup,/readiness,/liveness), and probe<->port-name agreement,
  plus the base command flags (--address, --credentials-file) and the credential
  VolumeMount name/path. The init-Job proxy is asserted to carry none of these.
- docs: document the native-sidecar Cloud SQL Auth Proxy connectivity model in
  the gcp-cloudsql-postgres reference.

No behavior change to the shipped wiring (review verdict: APPROVE-WITH-NITS, no
blockers); these are robustness + coverage hardening.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD

Cre-eD commented Jun 27, 2026

Copy link
Copy Markdown
Contributor Author

Multimodel review — verdict: APPROVE-WITH-NITS (no blockers)

Ran a 4-lens review (k8s native-sidecar correctness, regression/blast-radius, test-coverage, Go-correctness) across two model families, then an adversarial verify pass on every finding (6 confirmed / 0 rejected) and a synthesis. Core change is SOUND: native sidecar via init container + RestartPolicy: Always + StartupProbe is the correct, GA (GKE ≥1.33) way to gate app start on proxy readiness; InitContainerOutputs → PodSpec.InitContainers placement verified; the init-Job (timeout>0) proxy correctly left as a plain terminating container; Deployment-only scope is right. No correctness defects.

Follow-up commit 6c55231 addresses every confirmed finding:

should-fix

  1. No LivenessProbe — on a native sidecar a failing readiness probe neither restarts the container nor gates pod readiness; only liveness recovers a proxy that passed startup then hung (deadlock / pool exhaustion / partial-OOM). Added a /liveness probe (already served by --health-check).
  2. Load-bearing placement untested — the one line that matters (InitContainerOutputs, not SidecarOutputs) had no test. Extracted attachCloudsqlProxyAsNativeSidecar() and unit-tested that the proxy lands in init containers and not in regular containers.

nice-to-have
3. Probe/port wiring was asserted only NotNil — pinned exact health port (9090/csql-hc), probe paths (/startup,/readiness,/liveness), and the probe↔port-name agreement check.
4. Base command flags unpinned — added --address / --credentials-file=… assertions (the credentials flag is now tied to the VolumeMount path).
5. VolumeMount name untested after the refactor — added a mount name/path/readOnly test.

All assertions also verify the init-Job proxy carries none of these (no probes, no RestartPolicy, no health port). Build + go vet + full gcp/kubernetes suites green. Docs: documented the native-sidecar connectivity model under gcp-cloudsql-postgres in the resource reference.

Still staging-first before bumping the pinned SC version in consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants