fix(gke): run cloudsql-proxy as a native sidecar to end the startup race by Cre-eD · Pull Request #347 · simple-container-com/api

Cre-eD · 2026-06-27T17:14:35Z

Problem

The runtime CloudSQL proxy is injected as a plain sidecar container (compute_proc.go → SidecarOutputs → PodSpec.Containers). Sidecar containers start in parallel with the app container with no ordering guarantee, and the proxy container carries no probe. So on every pod (re)start — GKE Autopilot node scale-down, VPA updateMode: Auto eviction, or a rollout — the app process dials localhost:5432 before the proxy is listening and logs:

error connecting in 'pool-1': connection to server at "127.0.0.1", port 5432 failed: Connection refused

It self-heals (the client retries), so there's no data loss — but it's recurring log noise across every GCP SC stack, and for app containers that have a startup probe it costs an extra restart to recover (observed in prod: a single-replica web pod took a 137/startup-probe-fail restart on relocation, briefly dropping its uptime check).

The existing probe machinery in deployment.go can't help: readiness/startup probes are only attached to the ingress/app container (or a container with a single port), never to the proxy — and port-less worker containers get no gate at all.

Fix

Run the runtime proxy as a native sidecar — an init container with RestartPolicy: Always — gated by a startup probe against the proxy's built-in --health-check HTTP server. Per the Kubernetes SidecarContainers contract, the app containers don't start until the proxy's startup probe passes, i.e. until the proxy is actually listening. The race is eliminated for web and worker tiers alike.

Two files:

cloudsql_proxy.go — when timeout == 0 (runtime proxy): enable --health-check (--http-address=0.0.0.0 --http-port=9090), declare the health port, add RestartPolicy: Always + StartupProbe/ReadinessProbe on /startup & /readiness. Refactored into pure cloudsqlProxyCommandArgs / cloudsqlProxyContainerArgs helpers so the behaviour is unit-testable.
compute_proc.go — append the runtime proxy to InitContainerOutputs instead of SidecarOutputs.

Scope / safety

Deployment proxy only. The init-Job proxy (timeout > 0, the self-killing variant used by the db-user-init Job) stays an ordinary terminating container — RestartPolicy: Always on a RestartPolicy: Never Job's container would keep the Job from ever completing. This invariant is now covered by a test.
--health-check binds 0.0.0.0 (default is localhost) so the kubelet probe can reach it.
The #340 per-container VPA policy matches the proxy by name (cloudsql-proxy), unaffected by the init/regular placement.
Requires GKE ≥ 1.29 (SidecarContainers, beta-on-by-default in 1.29, GA in 1.33). Target clusters are on 1.34.

Follow-up (not in this PR)

Native sidecars also make the init-Job proxy's sleep/kill -9 timeout hack unnecessary (native sidecars in Jobs terminate when the main container exits) — can be simplified in a later change.

Tests

TestCloudsqlProxyCommandArgs_RuntimeEnablesHealthCheck — runtime proxy runs the binary directly with --health-check + --http-address=0.0.0.0, not shell-wrapped.
TestCloudsqlProxyCommandArgs_InitJobSelfKills — init-Job proxy is sh -c wrapped, self-kills, no health server.
TestCloudsqlProxyContainerArgs_RuntimeIsNativeSidecar — RestartPolicy: Always + startup/readiness probes + health port present.
TestCloudsqlProxyContainerArgs_InitJobIsNotSidecar — no RestartPolicy, no probe, no port (regression guard against accidentally making the Job proxy a sidecar).

go build ./pkg/clouds/..., go vet, and the full gcp + kubernetes package suites pass.

Rollout

Staging-first. This changes pod startup ordering; validate on a staging stack (confirm the proxy appears under initContainers with restartPolicy: Always, the app waits for it, and no connection-refused on a forced pod delete) before bumping the pinned SC version in the consuming repos.

The runtime CloudSQL proxy was injected as a plain sidecar container, so it started in parallel with the app container with no ordering guarantee. On every pod (re)start -- node scale-down, VPA eviction, rollout -- the app dialed localhost:5432 before the proxy was listening and logged connection-refused; app containers carrying a startup probe took an extra restart to recover. The probe machinery only ever gated the ingress/app container, never the proxy, and port-less worker containers got no gate at all. Run the runtime proxy instead as a native sidecar: an init container with RestartPolicy: Always, gated by a startup probe against the proxy's built-in --health-check server. Kubernetes will not start the app containers until the proxy is listening, so the race is gone for web and worker tiers alike. Scoped to the long-lived Deployment proxy (timeout == 0). The init-Job proxy (timeout > 0) stays an ordinary terminating container on purpose -- RestartPolicy: Always there would keep a RestartPolicy: Never Job from ever completing. Requires GKE >= 1.29 (SidecarContainers, GA in 1.33); prod is 1.34. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

github-actions · 2026-06-27T17:15:37Z

Semgrep Scan Results

Repository: api | Commit: 641fbae

Check	Status	Details
⚠️ Semgrep	Warning	1 warning(s), 1 total

Scanned at 2026-06-27 20:03 UTC

github-actions · 2026-06-27T17:15:59Z

Security Scan Results

Repository: api | Commit: 641fbae

Check	Status	Details
✅ Secret Scan	Pass	No secrets detected
✅ Dependencies (Trivy)	Pass	0 total (no critical/high)
✅ Dependencies (Grype)	Pass	0 total (no critical/high)
📦 SBOM	Generated	523 components (CycloneDX)

Scanned at 2026-06-27 20:03 UTC

github-actions · 2026-06-27T17:18:36Z

📊 Statement coverage

Measured on the documented included set (see docs/TESTING.md → Coverage scope). Observe-only — no regression gate is enforced yet.

Scope	This PR	main baseline	Δ
Included set (Gold-tier denominator)	`90.3%`	`90.3%`	+0.0 pp
Full set (whole repo, transparency)	`27.9%`	`27.9%`	+0.0 pp

Baseline: main @ 842404b

Addresses a multimodel review of the cloudsql-proxy native-sidecar change: - Add a LivenessProbe (/liveness) to the runtime proxy. On a native sidecar a failing readiness probe neither restarts the container nor gates pod readiness; only liveness recovers a proxy that passed startup then hung (deadlock / pool exhaustion / partial-OOM). /liveness is already served by --health-check. - Extract attachCloudsqlProxyAsNativeSidecar() so the load-bearing placement (InitContainerOutputs, NOT SidecarOutputs) is unit-tested -- a future refactor appending to SidecarOutputs would otherwise silently reintroduce the race. - Pin the previously NotNil-only assertions: exact health port (9090/csql-hc), probe paths (/startup,/readiness,/liveness), and probe<->port-name agreement, plus the base command flags (--address, --credentials-file) and the credential VolumeMount name/path. The init-Job proxy is asserted to carry none of these. - docs: document the native-sidecar Cloud SQL Auth Proxy connectivity model in the gcp-cloudsql-postgres reference. No behavior change to the shipped wiring (review verdict: APPROVE-WITH-NITS, no blockers); these are robustness + coverage hardening. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>

Cre-eD · 2026-06-27T19:45:53Z

Multimodel review — verdict: APPROVE-WITH-NITS (no blockers)

Ran a 4-lens review (k8s native-sidecar correctness, regression/blast-radius, test-coverage, Go-correctness) across two model families, then an adversarial verify pass on every finding (6 confirmed / 0 rejected) and a synthesis. Core change is SOUND: native sidecar via init container + RestartPolicy: Always + StartupProbe is the correct, GA (GKE ≥1.33) way to gate app start on proxy readiness; InitContainerOutputs → PodSpec.InitContainers placement verified; the init-Job (timeout>0) proxy correctly left as a plain terminating container; Deployment-only scope is right. No correctness defects.

Follow-up commit 6c55231 addresses every confirmed finding:

should-fix

No LivenessProbe — on a native sidecar a failing readiness probe neither restarts the container nor gates pod readiness; only liveness recovers a proxy that passed startup then hung (deadlock / pool exhaustion / partial-OOM). Added a /liveness probe (already served by --health-check).
Load-bearing placement untested — the one line that matters (InitContainerOutputs, not SidecarOutputs) had no test. Extracted attachCloudsqlProxyAsNativeSidecar() and unit-tested that the proxy lands in init containers and not in regular containers.

nice-to-have
3. Probe/port wiring was asserted only NotNil — pinned exact health port (9090/csql-hc), probe paths (/startup,/readiness,/liveness), and the probe↔port-name agreement check.
4. Base command flags unpinned — added --address / --credentials-file=… assertions (the credentials flag is now tied to the VolumeMount path).
5. VolumeMount name untested after the refactor — added a mount name/path/readOnly test.

All assertions also verify the init-Job proxy carries none of these (no probes, no RestartPolicy, no health port). Build + go vet + full gcp/kubernetes suites green. Docs: documented the native-sidecar connectivity model under gcp-cloudsql-postgres in the resource reference.

Still staging-first before bumping the pinned SC version in consumers.

Cre-eD requested review from Laboratory, smecsia and universe-ops as code owners June 27, 2026 17:14

smecsia added the ci-run label Jun 27, 2026

Merge branch 'main' into fix/cloudsql-proxy-native-sidecar

4c699cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gke): run cloudsql-proxy as a native sidecar to end the startup race#347

fix(gke): run cloudsql-proxy as a native sidecar to end the startup race#347
Cre-eD wants to merge 3 commits into
mainfrom
fix/cloudsql-proxy-native-sidecar

Cre-eD commented Jun 27, 2026

Uh oh!

github-actions Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

Cre-eD commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Cre-eD commented Jun 27, 2026

Problem

Fix

Scope / safety

Follow-up (not in this PR)

Tests

Rollout

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semgrep Scan Results

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Security Scan Results

Uh oh!

github-actions Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Statement coverage

Uh oh!

Cre-eD commented Jun 27, 2026

Multimodel review — verdict: APPROVE-WITH-NITS (no blockers)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 27, 2026 •

edited

Loading

github-actions Bot commented Jun 27, 2026 •

edited

Loading

github-actions Bot commented Jun 27, 2026 •

edited

Loading