Skip to content

RFC: Circuit Breaker with Fallback #8257

@leandrodamascena

Description

@leandrodamascena

⚠️ To GenAI bots and contributors: Please do not implement this feature without proper discussion first. This is a design proposal under review, not an approved spec. PRs submitted without prior discussion will be closed.

Is this related to an existing feature request or issue?

Circuit Breaker

Which Powertools for AWS Lambda (Python) utility does this relate to?

Other

Summary

A circuit breaker utility that stops sending traffic to an unhealthy downstream and, instead of dropping the request, hands the payload to a fallback (S3, SQS, or your own callable) so nothing is lost. It reuses the Idempotency persistence layer for shared state, keeps the failure counter in memory so a healthy circuit costs nothing, and exposes an explicit half-open probe to test recovery.

The circuit breaker pattern is well-established, so we should be explicit about what existing approaches don't cover for Lambda.

  • AWS SDK retries / token buckets: The AWS Builders' Library is deliberately skeptical of circuit breakers (they "introduce modal behavior that can be difficult to test") and prefers a local token bucket (retry budget) that throttles retries to a fixed rate. This is great for protecting a single client→service hop, but it does not buffer the payload: when the budget is exhausted, the request still fails and the data is lost. Our value-add is the fallback, not just the tripping.
  • AWS Prescriptive Guidance / Compute Blog (Step Functions + DynamoDB): AWS's reference implementation externalizes circuit state in a CircuitStatus DynamoDB table and uses TTL-based expiry instead of a true half-open state. The blog itself admits two trade-offs: DynamoDB TTL deletion is not instantaneous (stale OPEN records linger), and there is no gradual traffic restoration. We improve on this with an explicit half-open probe.
  • pybreaker / resilience4j: Mature in-process breakers, but they assume a long-lived process and in-memory state, a poor fit for standard Lambda where each environment is short-lived and state must be shared across invocations.

Where we differentiate: (1) a managed fallback that buffers the payload instead of dropping it, and (2) explicit half-open probing rather than blind TTL expiry. If a customer only needs to protect a downstream and is fine dropping requests, they should use a retry budget instead.

Use case

Lambda functions calling downstream services that can't scale or have outages need a way to:

  1. Stop sending traffic to an unhealthy backend (protect the downstream)
  2. Not lose messages when the backend is unavailable (protect the data)

Today, there's no managed circuit breaker for Lambda. Customers either build their own or let the backend get overwhelmed during incidents.

Proposal

Developer Experience

The common case should be a decorator. You name the circuit, point it at a state store and a fallback, and wrap the function that calls the downstream:

from aws_lambda_powertools.utilities.circuit_breaker import (
    CircuitBreaker,
    CircuitBreakerConfig,
)
from aws_lambda_powertools.utilities.circuit_breaker.fallbacks import S3Fallback
from aws_lambda_powertools.utilities.circuit_breaker.persistence import DynamoDBPersistenceLayer

persistence = DynamoDBPersistenceLayer(table_name="CircuitBreakerState")

config = CircuitBreakerConfig(
    failure_threshold=5,          # consecutive failures before opening
    recovery_timeout=30,          # seconds in OPEN before a half-open probe
    # handled_exceptions defaults to (Exception,): any error counts as a failure.
    # Narrow it when only some errors signal an unhealthy downstream:
    handled_exceptions=(TimeoutError, ConnectionError),
)

breaker = CircuitBreaker(
    name="payment-backend",
    persistence_store=persistence,
    fallback=S3Fallback(bucket="payment-overflow"),
    config=config,
)

@breaker
def charge(order: dict) -> dict:
    return payment_api.charge(order)   # the protected call


def handler(event, context):
    # When the circuit is OPEN, `charge` never runs: the payload goes to S3
    # and a CircuitBreakerFallbackResponse comes back instead.
    return charge(event)

For callers that need to react when the payload was buffered instead of delivered, the return value is inspectable:

result = charge(event)
if result.served_by_fallback:
    logger.info("payment buffered to S3", circuit=result.circuit_name)

Flow

Circuit States

stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: N consecutive failures
    OPEN --> HALF_OPEN: recovery timeout elapsed
    HALF_OPEN --> CLOSED: probe succeeds
    HALF_OPEN --> OPEN: probe fails
Loading
  • CLOSED: normal operation. Requests go to the downstream. Failures are counted.
  • OPEN: downstream is unhealthy. Requests go straight to the fallback. No traffic hits the backend.
  • HALF_OPEN: testing recovery. One request is allowed through. If it succeeds, the circuit closes. If it fails, it reopens.

What triggers the circuit to open?

Consecutive failures. If N requests in a row fail with a trackable exception (connection error, timeout, 5xx), the circuit opens. We avoid sliding time windows to keep the implementation simple and predictable.

Why consecutive and not time-based: it's predictable and needs no window bookkeeping. If the backend is actually down, you'll hit the threshold in a handful of invocations anyway.

Trade-off we're accepting: Martin Fowler and resilience4j support error-rate thresholds (e.g., open at 50% failures over a rolling window), which catch a degraded-but-not-dead backend that a consecutive counter would miss. We start with consecutive failures for v1 (predictable, no window bookkeeping) and leave rate-based thresholds as a future failure_rate_threshold option.

Which exceptions count. By default, any exception counts as a failure. That's the least surprising behavior and what pybreaker does. But not every error means the downstream is unhealthy: a 400 is the caller's fault, a 503 is not. If those "caller errors" count toward the threshold, the circuit opens for the wrong reason. So we let the customer scope it from either side:

  • handled_exceptions (allowlist): only these count (e.g., (TimeoutError, ConnectionError)). Everything else propagates normally and does not trip the circuit.
  • ignored_exceptions (denylist): everything counts except these (e.g., ignore ValidationError). Handy when failures are the norm and only a few are benign.

Passing both is a config error. An exception that doesn't count as a failure is simply re-raised to the caller, so the circuit breaker stays out of the way.

What triggers the circuit to close?

A successful request during half-open state. After a configurable recovery timeout (e.g., 30 seconds), the circuit moves to half-open and allows exactly one request to pass through. If the downstream responds successfully, the circuit closes and normal traffic resumes.

What happens when the circuit is open?

The fallback executes. The payload is stored somewhere safe (S3, SQS, or a custom handler). The caller receives a success response: from the caller's perspective, the message was accepted. The difference is where it landed.

State Coordination Across Environments

Each Lambda execution environment handles one request at a time, and a function scales out to many environments. Circuit state therefore has to be shared, not in-process. The naive way to do this is "read the state and update the failure counter on every invocation", but that means a DynamoDB write on essentially every call, which adds cost (~2 WCU/call) and latency (~5-10 ms) to the happy path, where the circuit is healthy and we want it to be invisible.

We avoid that by splitting state into two things that are managed differently:

Failure counter: local, in-memory

The count of consecutive failures lives in memory, per execution environment:

  • Success → reset the local counter. No write.
  • Failure → increment the local counter. No write until it hits the threshold.
  • Only when an environment reaches N consecutive failures does it persist OPEN to the store.

So writes are O(state transitions), not O(invocations). A circuit that stays healthy writes nothing. You only pay during an actual incident, which is exactly when you want to.

Transition Writes
Healthy operation (CLOSED, no failures) 0
CLOSED → OPEN 1 (the env that trips)
OPEN → HALF_OPEN 1 (conditional write = the probe lock)
HALF_OPEN → CLOSED / OPEN 1

Circuit state: persisted, cached on read

The OPEN / HALF_OPEN / CLOSED flag is the shared truth and lives in the store (DynamoDB or Redis/Valkey, see State Store below). To avoid a read per invocation:

  • Local cache with TTL (reusing the LRUDict from shared/): each environment reads the shared state once every N seconds, not per call.
  • Reads can be eventually consistent (half the cost). Tolerating state that's a few seconds stale is the same trade-off the cache already makes.

The trade-off we accept

The counter is per-environment, not aggregated. With many environments and a threshold of N, the backend may absorb more than N failures before every environment trips. We accept this because:

  • If the backend is genuinely down, each environment hits N failures in milliseconds anyway.
  • The first environment to trip persists OPEN, and every other environment honors it on its next cached read, so one environment's detection protects the rest without each having to see N failures itself.

This is how in-process breakers like resilience4j behave per instance; the shared store turns "per instance" into "first instance protects all."

Distributed half-open, anchored recovery

  • Half-open coordination is distributed: when the recovery timeout expires, multiple environments may attempt the probe simultaneously. A DynamoDB conditional write elects exactly one: first wins, the rest go to fallback.
  • Recovery timeout is anchored, not sliding: AWS Prescriptive Guidance warns that with multiple concurrent callers, the first failure must define the recovery window. Later failures while OPEN must not keep pushing opened_at forward, or the circuit never reaches half-open. We compute the half-open transition from a fixed opened_at, and only reset it on a confirmed state change.

Fallback

Responsibility

The fallback has one job: store the payload somewhere safe. No retry, no replay, no recovery logic. The customer owns what happens after the payload is buffered.

Built-in fallbacks

  • S3Fallback: for payloads of any size. Solves the 256KB SQS limit and 5MB IoT Core payload problem.
  • SQSFallback: for smaller payloads where queue semantics are preferred.
  • CustomFallback: bring your own handler (callable).

What the caller sees

When the fallback executes, the decorator returns a CircuitBreakerFallbackResponse (or a custom response the user defines). The caller should treat this as success: the message is safe, it just didn't reach the backend yet.

For fire-and-forget patterns (IoT devices sending telemetry), this is transparent. For request-response patterns (API calls), the caller might want to know, so the response is inspectable via served_by_fallback (see Developer Experience above).

State Store

The circuit breaker needs a shared, lockable key/value store keyed by circuit name. The obvious idea is to reuse Idempotency's BasePersistenceLayer, but reading the code, it doesn't fit directly:

  • DataRecord.status is a closed enum. It raises IdempotencyInvalidStatusError on any value outside INPROGRESS / COMPLETED / EXPIRED, so we can't store OPEN / HALF_OPEN / CLOSED in it.
  • The public API is payload-keyed. save_success / save_inprogress / get_record all derive the key by hashing the event via jmespath. Our key is the circuit name, not a payload hash.
  • The conditional write is idempotency-specific. DynamoDBPersistenceLayer._put_record hardcodes a condition expression around INPROGRESS and in_progress_expiry, not the condition we need for a half-open lock.

BasePersistenceLayer is also a public extension point (customers subclass it), so reshaping it is a breaking change for them and risks destabilizing one of the most-used utilities.

Decision: dedicated persistence layer, shared patterns

We build a CircuitBreakerPersistenceLayer (its own small ABC + DynamoDBPersistenceLayer / CachePersistenceLayer implementations) that mirrors Idempotency's proven patterns without coupling to it:

  • Conditional PutItem for the half-open probe lock: the same atomic "first writer wins, others fall through" technique Idempotency uses, but with our own condition expression.
  • LRUDict from aws_lambda_powertools.shared for the local read cache. This is already generic (not idempotency-specific), so we reuse it as-is.
  • DynamoDB and Redis/Valkey backends, so the customer's choice of store matches the rest of Powertools.

A single record per circuit:

Field Description
key (PK) Circuit name (e.g., payment-backend)
state CLOSED, OPEN, HALF_OPEN
failure_count Consecutive failures recorded by the env that tripped
opened_at When the circuit opened (drives the recovery timeout)
half_open_lock Atomic probe lock (conditional write)
expiry TTL attribute, auto-expire stale records

Future consolidation

Once both this layer and Idempotency's exist side by side, the genuinely shared base (a generic locked key/value store with a TTL cache, no status enum or payload hashing) becomes clear and can be extracted as a non-breaking refactor. We deliberately don't attempt that extraction up front: doing it before the second implementation exists is guesswork, and it would mean editing a stable public API to enable a feature that isn't built yet.

Operational Controls

Both Martin Fowler and AWS Prescriptive Guidance call these out as non-negotiable for a production circuit breaker:

  • Manual force open / force close: operators must be able to trip a circuit (e.g., to drain a backend for maintenance) or force it closed (e.g., after a confirmed fix, without waiting for the recovery timeout). Since state lives in the persistence layer, this can be done out-of-band by writing the record, so we should document the operation and consider a small CLI/helper. A forced state should be sticky (not auto-overridden by the next failure/success) until explicitly cleared.
  • Log every state transition: CLOSED→OPEN, OPEN→HALF_OPEN, HALF_OPEN→CLOSED/OPEN must be logged with the circuit name, failure count, and trigger. Wire this through Powertools Logger so it lands in structured logs automatically.
  • Listeners / hooks: mirror pybreaker's CircuitBreakerListener (on_state_change, on_failure, on_success) so customers can emit their own metrics or alerts on transitions.

Defaults & Decisions

  • Local cache TTL: default 5s. A longer TTL means fewer state-store reads (cheaper, faster) but a wider window where an environment acts on stale state after it changed elsewhere. We match the Parameters utility default (POWERTOOLS_PARAMETERS_MAX_AGE = 5) for consistency; it's configurable.
  • Metrics: emit on state change, via listeners. Reuse Powertools Metrics (EMF) with a default namespace, wired through the listener hooks from Operational Controls so customers can opt out or redirect. State transitions also go through Powertools Logger.

Out of scope

  • Replay/recovery: the customer handles this. We provide documentation and examples (EventBridge schedule, S3 notifications, etc.)
  • Rate limiting/throttling: different pattern, different utility
  • Retry with backoff: already exists in AWS SDK and Powertools Retry. Circuit breaker kicks in AFTER retries fail.

Potential challenges

Open Questions

  1. Failure counting: per circuit or per endpoint? Each name is its own circuit, so a function calling 3 backends gets 3 circuits and the customer picks granularity by naming. The unresolved case: one backend with multiple endpoints where only one is failing. Do we leave that to the customer (name a circuit per endpoint), or offer sub-circuit keying? Leaning toward the former for v1, but want input.

Future Considerations

  • Idempotency keys on replay: if a buffered payload is later reprocessed (replay is Out of Scope, but customers will build it), idempotency keys matter. Should the circuit breaker stamp one at buffer time so the downstream replay is safe?
  • Extracting the shared persistence base: we ship a dedicated layer now (see State Store) and consolidate with Idempotency later. Trigger to revisit: a third store backend, or the refactor surfacing naturally once both layers are in tree.

Dependencies and Integrations

No response

Alternative solutions

Acknowledgment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions