RFC: Circuit Breaker with Fallback

> **⚠️ To GenAI bots and contributors:** Please do not implement this feature without proper discussion first. This is a design proposal under review, not an approved spec. PRs submitted without prior discussion will be closed.

### Is this related to an existing feature request or issue?

Circuit Breaker

### Which Powertools for AWS Lambda (Python) utility does this relate to?

Other

### Summary

A circuit breaker utility that stops sending traffic to an unhealthy downstream and, instead of dropping the request, hands the payload to a *fallback* (S3, SQS, or your own callable) so nothing is lost. It reuses the Idempotency persistence layer for shared state, keeps the failure counter in memory so a healthy circuit costs nothing, and exposes an explicit half-open probe to test recovery.

The circuit breaker pattern is well-established, so we should be explicit about what existing approaches don't cover for Lambda.

- **AWS SDK retries / token buckets**: The [AWS Builders' Library](https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/) is deliberately skeptical of circuit breakers (they "introduce modal behavior that can be difficult to test") and prefers a local *token bucket* (retry budget) that throttles retries to a fixed rate. This is great for protecting a single client→service hop, but it does **not** buffer the payload: when the budget is exhausted, the request still fails and the data is lost. Our value-add is the **fallback**, not just the tripping.
- **AWS Prescriptive Guidance / Compute Blog (Step Functions + DynamoDB)**: AWS's reference implementation externalizes circuit state in a `CircuitStatus` DynamoDB table and uses **TTL-based expiry instead of a true half-open state**. The blog itself admits two trade-offs: DynamoDB TTL deletion is not instantaneous (stale OPEN records linger), and there is no gradual traffic restoration. We improve on this with an explicit half-open probe.
- **pybreaker / resilience4j**: Mature in-process breakers, but they assume a long-lived process and in-memory state, a poor fit for standard Lambda where each environment is short-lived and state must be shared across invocations.

**Where we differentiate:** (1) a managed **fallback** that buffers the payload instead of dropping it, and (2) explicit **half-open** probing rather than blind TTL expiry. If a customer only needs to protect a downstream and is fine dropping requests, they should use a retry budget instead.


### Use case

Lambda functions calling downstream services that can't scale or have outages need a way to:

1. Stop sending traffic to an unhealthy backend (protect the downstream)
2. Not lose messages when the backend is unavailable (protect the data)

Today, there's no managed circuit breaker for Lambda. Customers either build their own or let the backend get overwhelmed during incidents.

### Proposal

## Developer Experience

The common case should be a decorator. You name the circuit, point it at a state store and a fallback, and wrap the function that calls the downstream:

```python
from aws_lambda_powertools.utilities.circuit_breaker import (
    CircuitBreaker,
    CircuitBreakerConfig,
)
from aws_lambda_powertools.utilities.circuit_breaker.fallbacks import S3Fallback
from aws_lambda_powertools.utilities.circuit_breaker.persistence import DynamoDBPersistenceLayer

persistence = DynamoDBPersistenceLayer(table_name="CircuitBreakerState")

config = CircuitBreakerConfig(
    failure_threshold=5,          # consecutive failures before opening
    recovery_timeout=30,          # seconds in OPEN before a half-open probe
    # handled_exceptions defaults to (Exception,): any error counts as a failure.
    # Narrow it when only some errors signal an unhealthy downstream:
    handled_exceptions=(TimeoutError, ConnectionError),
)

breaker = CircuitBreaker(
    name="payment-backend",
    persistence_store=persistence,
    fallback=S3Fallback(bucket="payment-overflow"),
    config=config,
)

@breaker
def charge(order: dict) -> dict:
    return payment_api.charge(order)   # the protected call


def handler(event, context):
    # When the circuit is OPEN, `charge` never runs: the payload goes to S3
    # and a CircuitBreakerFallbackResponse comes back instead.
    return charge(event)
```

For callers that need to react when the payload was buffered instead of delivered, the return value is inspectable:

```python
result = charge(event)
if result.served_by_fallback:
    logger.info("payment buffered to S3", circuit=result.circuit_name)
```

## Flow

### Circuit States

```mermaid
stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN: N consecutive failures
    OPEN --> HALF_OPEN: recovery timeout elapsed
    HALF_OPEN --> CLOSED: probe succeeds
    HALF_OPEN --> OPEN: probe fails
```

- **CLOSED**: normal operation. Requests go to the downstream. Failures are counted.
- **OPEN**: downstream is unhealthy. Requests go straight to the fallback. No traffic hits the backend.
- **HALF_OPEN**: testing recovery. One request is allowed through. If it succeeds, the circuit closes. If it fails, it reopens.

### What triggers the circuit to open?

Consecutive failures. If N requests in a row fail with a trackable exception (connection error, timeout, 5xx), the circuit opens. We avoid sliding time windows to keep the implementation simple and predictable.

Why consecutive and not time-based: it's predictable and needs no window bookkeeping. If the backend is actually down, you'll hit the threshold in a handful of invocations anyway.

Trade-off we're accepting: Martin Fowler and resilience4j support **error-rate** thresholds (e.g., open at 50% failures over a rolling window), which catch a degraded-but-not-dead backend that a consecutive counter would miss. We start with consecutive failures for v1 (predictable, no window bookkeeping) and leave rate-based thresholds as a future `failure_rate_threshold` option.

**Which exceptions count.** By default, **any exception** counts as a failure. That's the least surprising behavior and what pybreaker does. But not every error means the downstream is unhealthy: a `400` is the caller's fault, a `503` is not. If those "caller errors" count toward the threshold, the circuit opens for the wrong reason. So we let the customer scope it from either side:

- `handled_exceptions` (allowlist): only these count (e.g., `(TimeoutError, ConnectionError)`). Everything else propagates normally and does **not** trip the circuit.
- `ignored_exceptions` (denylist): everything counts *except* these (e.g., ignore `ValidationError`). Handy when failures are the norm and only a few are benign.

Passing both is a config error. An exception that doesn't count as a failure is simply re-raised to the caller, so the circuit breaker stays out of the way.

### What triggers the circuit to close?

A successful request during half-open state. After a configurable recovery timeout (e.g., 30 seconds), the circuit moves to half-open and allows exactly one request to pass through. If the downstream responds successfully, the circuit closes and normal traffic resumes.

### What happens when the circuit is open?

The fallback executes. The payload is stored somewhere safe (S3, SQS, or a custom handler). The caller receives a success response: from the caller's perspective, the message was accepted. The difference is where it landed.

## State Coordination Across Environments

Each Lambda execution environment handles one request at a time, and a function scales out to many environments. Circuit state therefore has to be shared, not in-process. The naive way to do this is "read the state and update the failure counter on every invocation", but that means a DynamoDB write on essentially every call, which adds cost (~2 WCU/call) and latency (~5-10 ms) to the happy path, where the circuit is healthy and we want it to be invisible.

We avoid that by splitting state into two things that are managed differently:

### Failure counter: local, in-memory

The count of *consecutive failures* lives in memory, per execution environment:

- **Success** → reset the local counter. No write.
- **Failure** → increment the local counter. No write until it hits the threshold.
- Only when an environment reaches N consecutive failures does it persist `OPEN` to the store.

So **writes are O(state transitions), not O(invocations)**. A circuit that stays healthy writes nothing. You only pay during an actual incident, which is exactly when you want to.

| Transition | Writes |
|---|---|
| Healthy operation (CLOSED, no failures) | **0** |
| CLOSED → OPEN | 1 (the env that trips) |
| OPEN → HALF_OPEN | 1 (conditional write = the probe lock) |
| HALF_OPEN → CLOSED / OPEN | 1 |

### Circuit state: persisted, cached on read

The `OPEN` / `HALF_OPEN` / `CLOSED` flag is the shared truth and lives in the store (DynamoDB or Redis/Valkey, see State Store below). To avoid a read per invocation:

- **Local cache with TTL** (reusing the `LRUDict` from `shared/`): each environment reads the shared state once every N seconds, not per call.
- Reads can be **eventually consistent** (half the cost). Tolerating state that's a few seconds stale is the same trade-off the cache already makes.

### The trade-off we accept

The counter is per-environment, not aggregated. With many environments and a threshold of N, the backend may absorb more than N failures before *every* environment trips. We accept this because:

- If the backend is genuinely down, each environment hits N failures in milliseconds anyway.
- The **first** environment to trip persists `OPEN`, and every other environment honors it on its next cached read, so one environment's detection protects the rest without each having to see N failures itself.

This is how in-process breakers like resilience4j behave per instance; the shared store turns "per instance" into "first instance protects all."

### Distributed half-open, anchored recovery

- **Half-open coordination is distributed**: when the recovery timeout expires, multiple environments may attempt the probe simultaneously. A DynamoDB conditional write elects exactly one: first wins, the rest go to fallback.
- **Recovery timeout is anchored, not sliding**: AWS Prescriptive Guidance warns that with multiple concurrent callers, the *first* failure must define the recovery window. Later failures while OPEN must not keep pushing `opened_at` forward, or the circuit never reaches half-open. We compute the half-open transition from a fixed `opened_at`, and only reset it on a confirmed state change.

## Fallback

### Responsibility

The fallback has one job: store the payload somewhere safe. No retry, no replay, no recovery logic. The customer owns what happens after the payload is buffered.

### Built-in fallbacks

- **S3Fallback**: for payloads of any size. Solves the 256KB SQS limit and 5MB IoT Core payload problem.
- **SQSFallback**: for smaller payloads where queue semantics are preferred.
- **CustomFallback**: bring your own handler (callable).

### What the caller sees

When the fallback executes, the decorator returns a `CircuitBreakerFallbackResponse` (or a custom response the user defines). The caller should treat this as success: the message is safe, it just didn't reach the backend yet.

For fire-and-forget patterns (IoT devices sending telemetry), this is transparent. For request-response patterns (API calls), the caller might want to know, so the response is inspectable via `served_by_fallback` (see Developer Experience above).

## State Store

The circuit breaker needs a shared, lockable key/value store keyed by circuit name. The obvious idea is to reuse Idempotency's `BasePersistenceLayer`, but reading the code, it doesn't fit directly:

- **`DataRecord.status` is a closed enum.** It raises `IdempotencyInvalidStatusError` on any value outside `INPROGRESS` / `COMPLETED` / `EXPIRED`, so we can't store `OPEN` / `HALF_OPEN` / `CLOSED` in it.
- **The public API is payload-keyed.** `save_success` / `save_inprogress` / `get_record` all derive the key by hashing the event via jmespath. Our key is the circuit name, not a payload hash.
- **The conditional write is idempotency-specific.** `DynamoDBPersistenceLayer._put_record` hardcodes a condition expression around `INPROGRESS` and `in_progress_expiry`, not the condition we need for a half-open lock.

`BasePersistenceLayer` is also a public extension point (customers subclass it), so reshaping it is a breaking change for them and risks destabilizing one of the most-used utilities.

### Decision: dedicated persistence layer, shared patterns

We build a `CircuitBreakerPersistenceLayer` (its own small ABC + `DynamoDBPersistenceLayer` / `CachePersistenceLayer` implementations) that **mirrors** Idempotency's proven patterns without coupling to it:

- **Conditional `PutItem`** for the half-open probe lock: the same atomic "first writer wins, others fall through" technique Idempotency uses, but with our own condition expression.
- **`LRUDict`** from `aws_lambda_powertools.shared` for the local read cache. This is already generic (not idempotency-specific), so we reuse it as-is.
- DynamoDB and Redis/Valkey backends, so the customer's choice of store matches the rest of Powertools.

A single record per circuit:

| Field | Description |
|---|---|
| key (PK) | Circuit name (e.g., `payment-backend`) |
| state | `CLOSED`, `OPEN`, `HALF_OPEN` |
| failure_count | Consecutive failures recorded by the env that tripped |
| opened_at | When the circuit opened (drives the recovery timeout) |
| half_open_lock | Atomic probe lock (conditional write) |
| expiry | TTL attribute, auto-expire stale records |

### Future consolidation

Once both this layer and Idempotency's exist side by side, the genuinely shared base (a generic locked key/value store with a TTL cache, no status enum or payload hashing) becomes clear and can be extracted as a non-breaking refactor. We deliberately don't attempt that extraction up front: doing it before the second implementation exists is guesswork, and it would mean editing a stable public API to enable a feature that isn't built yet.

## Operational Controls

Both Martin Fowler and AWS Prescriptive Guidance call these out as non-negotiable for a production circuit breaker:

- **Manual force open / force close**: operators must be able to trip a circuit (e.g., to drain a backend for maintenance) or force it closed (e.g., after a confirmed fix, without waiting for the recovery timeout). Since state lives in the persistence layer, this can be done out-of-band by writing the record, so we should document the operation and consider a small CLI/helper. A forced state should be sticky (not auto-overridden by the next failure/success) until explicitly cleared.
- **Log every state transition**: CLOSED→OPEN, OPEN→HALF_OPEN, HALF_OPEN→CLOSED/OPEN must be logged with the circuit name, failure count, and trigger. Wire this through Powertools Logger so it lands in structured logs automatically.
- **Listeners / hooks**: mirror pybreaker's `CircuitBreakerListener` (`on_state_change`, `on_failure`, `on_success`) so customers can emit their own metrics or alerts on transitions.

## Defaults & Decisions

- **Local cache TTL: default 5s.** A longer TTL means fewer state-store reads (cheaper, faster) but a wider window where an environment acts on stale state after it changed elsewhere. We match the Parameters utility default (`POWERTOOLS_PARAMETERS_MAX_AGE = 5`) for consistency; it's configurable.
- **Metrics: emit on state change, via listeners.** Reuse Powertools Metrics (EMF) with a default namespace, wired through the listener hooks from Operational Controls so customers can opt out or redirect. State transitions also go through Powertools Logger.


### Out of scope

- **Replay/recovery**: the customer handles this. We provide documentation and examples (EventBridge schedule, S3 notifications, etc.)
- **Rate limiting/throttling**: different pattern, different utility
- **Retry with backoff**: already exists in AWS SDK and Powertools Retry. Circuit breaker kicks in AFTER retries fail.

### Potential challenges

## Open Questions

1. **Failure counting: per circuit or per endpoint?** Each `name` is its own circuit, so a function calling 3 backends gets 3 circuits and the customer picks granularity by naming. The unresolved case: one backend with multiple endpoints where only one is failing. Do we leave that to the customer (name a circuit per endpoint), or offer sub-circuit keying? Leaning toward the former for v1, but want input.

## Future Considerations

- **Idempotency keys on replay**: if a buffered payload is later reprocessed (replay is Out of Scope, but customers will build it), idempotency keys matter. Should the circuit breaker stamp one at buffer time so the downstream replay is safe?
- **Extracting the shared persistence base**: we ship a dedicated layer now (see State Store) and consolidate with Idempotency later. Trigger to revisit: a third store backend, or the refactor surfacing naturally once both layers are in tree.

### Dependencies and Integrations

_No response_

### Alternative solutions

```markdown

```

### Acknowledgment

- [x] This feature request meets [Powertools for AWS Lambda (Python) Tenets](https://docs.powertools.aws.dev/lambda/python/latest/#tenets)
- [x] Should this be considered in other Powertools for AWS Lambda languages? i.e. [Java](https://github.com/aws-powertools/powertools-lambda-java/), [TypeScript](https://github.com/aws-powertools/powertools-lambda-typescript/), and [.NET](https://github.com/aws-powertools/powertools-lambda-dotnet/)

Transition	Writes
Healthy operation (CLOSED, no failures)	0
CLOSED → OPEN	1 (the env that trips)
OPEN → HALF_OPEN	1 (conditional write = the probe lock)
HALF_OPEN → CLOSED / OPEN	1

Field	Description
key (PK)	Circuit name (e.g., `payment-backend`)
state	`CLOSED`, `OPEN`, `HALF_OPEN`
failure_count	Consecutive failures recorded by the env that tripped
opened_at	When the circuit opened (drives the recovery timeout)
half_open_lock	Atomic probe lock (conditional write)
expiry	TTL attribute, auto-expire stale records

RFC: Circuit Breaker with Fallback #8257

Description

Is this related to an existing feature request or issue?

Which Powertools for AWS Lambda (Python) utility does this relate to?

Summary

Use case

Proposal

Developer Experience

Flow

Circuit States

What triggers the circuit to open?

What triggers the circuit to close?

What happens when the circuit is open?

State Coordination Across Environments

Failure counter: local, in-memory

Circuit state: persisted, cached on read

The trade-off we accept

Distributed half-open, anchored recovery

Fallback

Responsibility

Built-in fallbacks

What the caller sees

State Store

Decision: dedicated persistence layer, shared patterns

Future consolidation

Operational Controls

Defaults & Decisions

Out of scope

Potential challenges

Open Questions

Future Considerations

Dependencies and Integrations

Alternative solutions

Acknowledgment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions