Description
Add support for active/passive (primary/fallback) failover between backend services. Users should be able to designate one backend as primary and another as a fallback that only receives traffic when the primary's health degrades, with automatic recovery back to the primary when it becomes healthy again.
This has been requested in #13507.
Motivation
Currently, when an HTTPRoute references multiple backendRefs, they are translated as weighted clusters — traffic is split proportionally. There is no way to express "only use backend B if backend A is unhealthy."
Proposed Approach
Use Envoy's priority-based load balancing. Endpoints from both the primary and fallback backends would be placed in the same Envoy cluster but in different LocalityLbEndpoints groups with different priority levels:
- Primary backend endpoints →
priority: 0
- Fallback backend endpoints →
priority: 1
Envoy's built-in priority load balancing handles the rest: traffic goes to priority-0 endpoints first, and only spills over to priority-1 when the health of priority-0 drops below a threshold (controlled by the overprovisioning factor — at the default of 1.4, failover triggers when active health drops below ~72%).
Health checks (active or passive via outlier detection) are required for failover detection and automatic recovery.
API Options
A few options for how to expose this in the kgateway API:
- Annotation or field on
backendRefs: e.g., a fallback: true annotation or a new field on the backendRef to indicate it is a fallback destination. This works for Backends and kube resources
- New policy CRD field: e.g., a field on
TrafficPolicy or BackendConfigPolicy that designates failover relationships between services. Policy can apply to Backends and kube resources
- Extension to the
Backend CRD: A fallback field on the Backend resource itself. Problem with this is we could need to add the kube type to the backend for this to work and it would change the UX users already have if they are directly referenced in routes.
Considerations
- Weighted routing and failover are mutually exclusive semantics for the same
backendRefs list — the API should make this clear.
- Both primary and fallback backends share the same Envoy cluster, so they share cluster-level settings (timeouts, circuit breakers, LB algorithm). This is a limitation of the priority-based approach & we should make sure our choice of API respects this as well.
- Users should be strongly encouraged to configure health checks or outlier detection alongside failover, otherwise there's no mechanism to detect backend failure.
- The Gateway API spec does not currently have a standard for this pattern, so this would be a kgateway extension. Maybe a good topic for GEP?
Description
Add support for active/passive (primary/fallback) failover between backend services. Users should be able to designate one backend as primary and another as a fallback that only receives traffic when the primary's health degrades, with automatic recovery back to the primary when it becomes healthy again.
This has been requested in #13507.
Motivation
Currently, when an HTTPRoute references multiple
backendRefs, they are translated as weighted clusters — traffic is split proportionally. There is no way to express "only use backend B if backend A is unhealthy."Proposed Approach
Use Envoy's priority-based load balancing. Endpoints from both the primary and fallback backends would be placed in the same Envoy cluster but in different
LocalityLbEndpointsgroups with different priority levels:priority: 0priority: 1Envoy's built-in priority load balancing handles the rest: traffic goes to priority-0 endpoints first, and only spills over to priority-1 when the health of priority-0 drops below a threshold (controlled by the overprovisioning factor — at the default of 1.4, failover triggers when active health drops below ~72%).
Health checks (active or passive via outlier detection) are required for failover detection and automatic recovery.
API Options
A few options for how to expose this in the kgateway API:
backendRefs: e.g., afallback: trueannotation or a new field on the backendRef to indicate it is a fallback destination. This works for Backends and kube resourcesTrafficPolicyorBackendConfigPolicythat designates failover relationships between services. Policy can apply to Backends and kube resourcesBackendCRD: Afallbackfield on the Backend resource itself. Problem with this is we could need to add the kube type to the backend for this to work and it would change the UX users already have if they are directly referenced in routes.Considerations
backendRefslist — the API should make this clear.