Skip to content

Support active/passive failover between backend services #13643

@josh-pritchard

Description

@josh-pritchard

Description

Add support for active/passive (primary/fallback) failover between backend services. Users should be able to designate one backend as primary and another as a fallback that only receives traffic when the primary's health degrades, with automatic recovery back to the primary when it becomes healthy again.
This has been requested in #13507.

Motivation

Currently, when an HTTPRoute references multiple backendRefs, they are translated as weighted clusters — traffic is split proportionally. There is no way to express "only use backend B if backend A is unhealthy."

Proposed Approach

Use Envoy's priority-based load balancing. Endpoints from both the primary and fallback backends would be placed in the same Envoy cluster but in different LocalityLbEndpoints groups with different priority levels:

  • Primary backend endpointspriority: 0
  • Fallback backend endpointspriority: 1
    Envoy's built-in priority load balancing handles the rest: traffic goes to priority-0 endpoints first, and only spills over to priority-1 when the health of priority-0 drops below a threshold (controlled by the overprovisioning factor — at the default of 1.4, failover triggers when active health drops below ~72%).
    Health checks (active or passive via outlier detection) are required for failover detection and automatic recovery.

API Options

A few options for how to expose this in the kgateway API:

  1. Annotation or field on backendRefs: e.g., a fallback: true annotation or a new field on the backendRef to indicate it is a fallback destination. This works for Backends and kube resources
  2. New policy CRD field: e.g., a field on TrafficPolicy or BackendConfigPolicy that designates failover relationships between services. Policy can apply to Backends and kube resources
  3. Extension to the Backend CRD: A fallback field on the Backend resource itself. Problem with this is we could need to add the kube type to the backend for this to work and it would change the UX users already have if they are directly referenced in routes.

Considerations

  • Weighted routing and failover are mutually exclusive semantics for the same backendRefs list — the API should make this clear.
  • Both primary and fallback backends share the same Envoy cluster, so they share cluster-level settings (timeouts, circuit breakers, LB algorithm). This is a limitation of the priority-based approach & we should make sure our choice of API respects this as well.
  • Users should be strongly encouraged to configure health checks or outlier detection alongside failover, otherwise there's no mechanism to detect backend failure.
  • The Gateway API spec does not currently have a standard for this pattern, so this would be a kgateway extension. Maybe a good topic for GEP?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

Status

Triage

Relationships

None yet

Development

No branches or pull requests

Issue actions