Skip to content

fix(gke): point web backend health check at /login (no healthy upstream)#6

Merged
gaurav0107 merged 1 commit into
mainfrom
fix/web-lb-health-check-path
Jun 7, 2026
Merged

fix(gke): point web backend health check at /login (no healthy upstream)#6
gaurav0107 merged 1 commit into
mainfrom
fix/web-lb-health-check-path

Conversation

@gaurav0107

Copy link
Copy Markdown
Collaborator

Symptom

https://langprobe.daz.co.in/ returns "no healthy upstream" even though the tracebility-web pod is Running and 1/1 Ready.

Root cause

The GCP L7 LB provisioned by the tracebility-web Gateway was health-checking the web backend at path /. Next.js returns 307 → /login at / (signed-in routing). GCP backend health checks treat anything other than 2xx as UNHEALTHY, so the backend NEG sat perma-UNHEALTHY and the LB returned 503s.

A prior fix (commit 31d8f81 "point web probes at /login") fixed the kubelet-side livenessProbe / readinessProbe, but the LB-side health check is configured separately — by default the GKE Gateway controller derives it from the backend Service and falls back to / on port 7090. The override mechanism is HealthCheckPolicy.networking.gke.io targeting the Service.

$ gcloud compute backend-services get-health gkegw1-1b53-tracebility-tracebility-web-7090-xbgwu0fpd8mr --global
... healthState: UNHEALTHY
... ipAddress: 10.7.0.98
... port: 7090

Fix

Add deploy/k8s/gke-gateway/healthcheckpolicy.yaml:

apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
spec:
  default:
    config:
      type: HTTP
      httpHealthCheck: { port: 7090, requestPath: /login }
  targetRef:
    kind: Service
    name: tracebility-web

Updated README.md to document the file alongside gateway.yaml / httproute.yaml.

Verified live

  • Applied the policy via kubectl apply -f.
  • gcloud compute health-checks describe … confirms requestPath: /login.
  • Backend health flipped UNHEALTHY → HEALTHY ~45s after apply.
  • curl -sI https://langprobe.daz.co.in/login200.
  • curl -sI https://langprobe.daz.co.in/307 (correct Next redirect, not the LB 503).

Test plan

  • CI on this PR goes green (no app-code change; just yaml + a docs hunk).
  • Merge.
  • Subsequent helm-deploy rollouts that recreate the Service don't lose the policy (HealthCheckPolicy is namespaced and survives Service recreation as long as the targetRef name stays tracebility-web).

🤖 Generated with Claude Code

The GKE Gateway-provisioned GCP LB was health-checking the web backend
at the default path '/'. Next.js returns 307 → /login at '/' (signed-in
state lives in a session cookie); GCP LBs treat anything other than 2xx
as UNHEALTHY, so the backend NEG was perma-UNHEALTHY and the public
domain served "no healthy upstream" 503s on every deploy.

Earlier fix (31d8f81 "point web probes at /login") fixed the kubelet-
side livenessProbe / readinessProbe but not the LB-side health check —
those are configured separately via HealthCheckPolicy on the Service.

Verified live:
- Backend service health flips UNHEALTHY → HEALTHY ~45s after apply.
- `curl -sI https://langprobe.daz.co.in/login` returns 200.
- `curl -sI https://langprobe.daz.co.in/` returns 307 (correct
  Next.js redirect, not the LB error).

Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
@gaurav0107 gaurav0107 merged commit 8c2c8a2 into main Jun 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant