You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
⚠️ To GenAI bots and contributors: Please do not implement this feature without proper discussion first. This is a design proposal under review, not an approved spec. PRs submitted without prior discussion will be closed.
Is this related to an existing feature request or issue?
Circuit Breaker
Which Powertools for AWS Lambda (Python) utility does this relate to?
Other
Summary
A circuit breaker utility that stops sending traffic to an unhealthy downstream and, instead of dropping the request, hands the payload to a fallback (S3, SQS, or your own callable) so nothing is lost. It reuses the Idempotency persistence layer for shared state, keeps the failure counter in memory so a healthy circuit costs nothing, and exposes an explicit half-open probe to test recovery.
The circuit breaker pattern is well-established, so we should be explicit about what existing approaches don't cover for Lambda.
AWS SDK retries / token buckets: The AWS Builders' Library is deliberately skeptical of circuit breakers (they "introduce modal behavior that can be difficult to test") and prefers a local token bucket (retry budget) that throttles retries to a fixed rate. This is great for protecting a single client→service hop, but it does not buffer the payload: when the budget is exhausted, the request still fails and the data is lost. Our value-add is the fallback, not just the tripping.
AWS Prescriptive Guidance / Compute Blog (Step Functions + DynamoDB): AWS's reference implementation externalizes circuit state in a CircuitStatus DynamoDB table and uses TTL-based expiry instead of a true half-open state. The blog itself admits two trade-offs: DynamoDB TTL deletion is not instantaneous (stale OPEN records linger), and there is no gradual traffic restoration. We improve on this with an explicit half-open probe.
pybreaker / resilience4j: Mature in-process breakers, but they assume a long-lived process and in-memory state, a poor fit for standard Lambda where each environment is short-lived and state must be shared across invocations.
Where we differentiate: (1) a managed fallback that buffers the payload instead of dropping it, and (2) explicit half-open probing rather than blind TTL expiry. If a customer only needs to protect a downstream and is fine dropping requests, they should use a retry budget instead.
Use case
Lambda functions calling downstream services that can't scale or have outages need a way to:
Stop sending traffic to an unhealthy backend (protect the downstream)
Not lose messages when the backend is unavailable (protect the data)
Today, there's no managed circuit breaker for Lambda. Customers either build their own or let the backend get overwhelmed during incidents.
Proposal
Developer Experience
The common case should be a decorator. You name the circuit, point it at a state store and a fallback, and wrap the function that calls the downstream:
fromaws_lambda_powertools.utilities.circuit_breakerimport (
CircuitBreaker,
CircuitBreakerConfig,
)
fromaws_lambda_powertools.utilities.circuit_breaker.fallbacksimportS3Fallbackfromaws_lambda_powertools.utilities.circuit_breaker.persistenceimportDynamoDBPersistenceLayerpersistence=DynamoDBPersistenceLayer(table_name="CircuitBreakerState")
config=CircuitBreakerConfig(
failure_threshold=5, # consecutive failures before openingrecovery_timeout=30, # seconds in OPEN before a half-open probe# handled_exceptions defaults to (Exception,): any error counts as a failure.# Narrow it when only some errors signal an unhealthy downstream:handled_exceptions=(TimeoutError, ConnectionError),
)
breaker=CircuitBreaker(
name="payment-backend",
persistence_store=persistence,
fallback=S3Fallback(bucket="payment-overflow"),
config=config,
)
@breakerdefcharge(order: dict) ->dict:
returnpayment_api.charge(order) # the protected calldefhandler(event, context):
# When the circuit is OPEN, `charge` never runs: the payload goes to S3# and a CircuitBreakerFallbackResponse comes back instead.returncharge(event)
For callers that need to react when the payload was buffered instead of delivered, the return value is inspectable:
result=charge(event)
ifresult.served_by_fallback:
logger.info("payment buffered to S3", circuit=result.circuit_name)
CLOSED: normal operation. Requests go to the downstream. Failures are counted.
OPEN: downstream is unhealthy. Requests go straight to the fallback. No traffic hits the backend.
HALF_OPEN: testing recovery. One request is allowed through. If it succeeds, the circuit closes. If it fails, it reopens.
What triggers the circuit to open?
Consecutive failures. If N requests in a row fail with a trackable exception (connection error, timeout, 5xx), the circuit opens. We avoid sliding time windows to keep the implementation simple and predictable.
Why consecutive and not time-based: it's predictable and needs no window bookkeeping. If the backend is actually down, you'll hit the threshold in a handful of invocations anyway.
Trade-off we're accepting: Martin Fowler and resilience4j support error-rate thresholds (e.g., open at 50% failures over a rolling window), which catch a degraded-but-not-dead backend that a consecutive counter would miss. We start with consecutive failures for v1 (predictable, no window bookkeeping) and leave rate-based thresholds as a future failure_rate_threshold option.
Which exceptions count. By default, any exception counts as a failure. That's the least surprising behavior and what pybreaker does. But not every error means the downstream is unhealthy: a 400 is the caller's fault, a 503 is not. If those "caller errors" count toward the threshold, the circuit opens for the wrong reason. So we let the customer scope it from either side:
handled_exceptions (allowlist): only these count (e.g., (TimeoutError, ConnectionError)). Everything else propagates normally and does not trip the circuit.
ignored_exceptions (denylist): everything counts except these (e.g., ignore ValidationError). Handy when failures are the norm and only a few are benign.
Passing both is a config error. An exception that doesn't count as a failure is simply re-raised to the caller, so the circuit breaker stays out of the way.
What triggers the circuit to close?
A successful request during half-open state. After a configurable recovery timeout (e.g., 30 seconds), the circuit moves to half-open and allows exactly one request to pass through. If the downstream responds successfully, the circuit closes and normal traffic resumes.
What happens when the circuit is open?
The fallback executes. The payload is stored somewhere safe (S3, SQS, or a custom handler). The caller receives a success response: from the caller's perspective, the message was accepted. The difference is where it landed.
State Coordination Across Environments
Each Lambda execution environment handles one request at a time, and a function scales out to many environments. Circuit state therefore has to be shared, not in-process. The naive way to do this is "read the state and update the failure counter on every invocation", but that means a DynamoDB write on essentially every call, which adds cost (~2 WCU/call) and latency (~5-10 ms) to the happy path, where the circuit is healthy and we want it to be invisible.
We avoid that by splitting state into two things that are managed differently:
Failure counter: local, in-memory
The count of consecutive failures lives in memory, per execution environment:
Success → reset the local counter. No write.
Failure → increment the local counter. No write until it hits the threshold.
Only when an environment reaches N consecutive failures does it persist OPEN to the store.
So writes are O(state transitions), not O(invocations). A circuit that stays healthy writes nothing. You only pay during an actual incident, which is exactly when you want to.
Transition
Writes
Healthy operation (CLOSED, no failures)
0
CLOSED → OPEN
1 (the env that trips)
OPEN → HALF_OPEN
1 (conditional write = the probe lock)
HALF_OPEN → CLOSED / OPEN
1
Circuit state: persisted, cached on read
The OPEN / HALF_OPEN / CLOSED flag is the shared truth and lives in the store (DynamoDB or Redis/Valkey, see State Store below). To avoid a read per invocation:
Local cache with TTL (reusing the LRUDict from shared/): each environment reads the shared state once every N seconds, not per call.
Reads can be eventually consistent (half the cost). Tolerating state that's a few seconds stale is the same trade-off the cache already makes.
The trade-off we accept
The counter is per-environment, not aggregated. With many environments and a threshold of N, the backend may absorb more than N failures before every environment trips. We accept this because:
If the backend is genuinely down, each environment hits N failures in milliseconds anyway.
The first environment to trip persists OPEN, and every other environment honors it on its next cached read, so one environment's detection protects the rest without each having to see N failures itself.
This is how in-process breakers like resilience4j behave per instance; the shared store turns "per instance" into "first instance protects all."
Distributed half-open, anchored recovery
Half-open coordination is distributed: when the recovery timeout expires, multiple environments may attempt the probe simultaneously. A DynamoDB conditional write elects exactly one: first wins, the rest go to fallback.
Recovery timeout is anchored, not sliding: AWS Prescriptive Guidance warns that with multiple concurrent callers, the first failure must define the recovery window. Later failures while OPEN must not keep pushing opened_at forward, or the circuit never reaches half-open. We compute the half-open transition from a fixed opened_at, and only reset it on a confirmed state change.
Fallback
Responsibility
The fallback has one job: store the payload somewhere safe. No retry, no replay, no recovery logic. The customer owns what happens after the payload is buffered.
Built-in fallbacks
S3Fallback: for payloads of any size. Solves the 256KB SQS limit and 5MB IoT Core payload problem.
SQSFallback: for smaller payloads where queue semantics are preferred.
CustomFallback: bring your own handler (callable).
What the caller sees
When the fallback executes, the decorator returns a CircuitBreakerFallbackResponse (or a custom response the user defines). The caller should treat this as success: the message is safe, it just didn't reach the backend yet.
For fire-and-forget patterns (IoT devices sending telemetry), this is transparent. For request-response patterns (API calls), the caller might want to know, so the response is inspectable via served_by_fallback (see Developer Experience above).
State Store
The circuit breaker needs a shared, lockable key/value store keyed by circuit name. The obvious idea is to reuse Idempotency's BasePersistenceLayer, but reading the code, it doesn't fit directly:
DataRecord.status is a closed enum. It raises IdempotencyInvalidStatusError on any value outside INPROGRESS / COMPLETED / EXPIRED, so we can't store OPEN / HALF_OPEN / CLOSED in it.
The public API is payload-keyed.save_success / save_inprogress / get_record all derive the key by hashing the event via jmespath. Our key is the circuit name, not a payload hash.
The conditional write is idempotency-specific.DynamoDBPersistenceLayer._put_record hardcodes a condition expression around INPROGRESS and in_progress_expiry, not the condition we need for a half-open lock.
BasePersistenceLayer is also a public extension point (customers subclass it), so reshaping it is a breaking change for them and risks destabilizing one of the most-used utilities.
We build a CircuitBreakerPersistenceLayer (its own small ABC + DynamoDBPersistenceLayer / CachePersistenceLayer implementations) that mirrors Idempotency's proven patterns without coupling to it:
Conditional PutItem for the half-open probe lock: the same atomic "first writer wins, others fall through" technique Idempotency uses, but with our own condition expression.
LRUDict from aws_lambda_powertools.shared for the local read cache. This is already generic (not idempotency-specific), so we reuse it as-is.
DynamoDB and Redis/Valkey backends, so the customer's choice of store matches the rest of Powertools.
A single record per circuit:
Field
Description
key (PK)
Circuit name (e.g., payment-backend)
state
CLOSED, OPEN, HALF_OPEN
failure_count
Consecutive failures recorded by the env that tripped
opened_at
When the circuit opened (drives the recovery timeout)
half_open_lock
Atomic probe lock (conditional write)
expiry
TTL attribute, auto-expire stale records
Future consolidation
Once both this layer and Idempotency's exist side by side, the genuinely shared base (a generic locked key/value store with a TTL cache, no status enum or payload hashing) becomes clear and can be extracted as a non-breaking refactor. We deliberately don't attempt that extraction up front: doing it before the second implementation exists is guesswork, and it would mean editing a stable public API to enable a feature that isn't built yet.
Operational Controls
Both Martin Fowler and AWS Prescriptive Guidance call these out as non-negotiable for a production circuit breaker:
Manual force open / force close: operators must be able to trip a circuit (e.g., to drain a backend for maintenance) or force it closed (e.g., after a confirmed fix, without waiting for the recovery timeout). Since state lives in the persistence layer, this can be done out-of-band by writing the record, so we should document the operation and consider a small CLI/helper. A forced state should be sticky (not auto-overridden by the next failure/success) until explicitly cleared.
Log every state transition: CLOSED→OPEN, OPEN→HALF_OPEN, HALF_OPEN→CLOSED/OPEN must be logged with the circuit name, failure count, and trigger. Wire this through Powertools Logger so it lands in structured logs automatically.
Listeners / hooks: mirror pybreaker's CircuitBreakerListener (on_state_change, on_failure, on_success) so customers can emit their own metrics or alerts on transitions.
Defaults & Decisions
Local cache TTL: default 5s. A longer TTL means fewer state-store reads (cheaper, faster) but a wider window where an environment acts on stale state after it changed elsewhere. We match the Parameters utility default (POWERTOOLS_PARAMETERS_MAX_AGE = 5) for consistency; it's configurable.
Metrics: emit on state change, via listeners. Reuse Powertools Metrics (EMF) with a default namespace, wired through the listener hooks from Operational Controls so customers can opt out or redirect. State transitions also go through Powertools Logger.
Out of scope
Replay/recovery: the customer handles this. We provide documentation and examples (EventBridge schedule, S3 notifications, etc.)
Rate limiting/throttling: different pattern, different utility
Retry with backoff: already exists in AWS SDK and Powertools Retry. Circuit breaker kicks in AFTER retries fail.
Potential challenges
Open Questions
Failure counting: per circuit or per endpoint? Each name is its own circuit, so a function calling 3 backends gets 3 circuits and the customer picks granularity by naming. The unresolved case: one backend with multiple endpoints where only one is failing. Do we leave that to the customer (name a circuit per endpoint), or offer sub-circuit keying? Leaning toward the former for v1, but want input.
Future Considerations
Idempotency keys on replay: if a buffered payload is later reprocessed (replay is Out of Scope, but customers will build it), idempotency keys matter. Should the circuit breaker stamp one at buffer time so the downstream replay is safe?
Extracting the shared persistence base: we ship a dedicated layer now (see State Store) and consolidate with Idempotency later. Trigger to revisit: a third store backend, or the refactor surfacing naturally once both layers are in tree.
Is this related to an existing feature request or issue?
Circuit Breaker
Which Powertools for AWS Lambda (Python) utility does this relate to?
Other
Summary
A circuit breaker utility that stops sending traffic to an unhealthy downstream and, instead of dropping the request, hands the payload to a fallback (S3, SQS, or your own callable) so nothing is lost. It reuses the Idempotency persistence layer for shared state, keeps the failure counter in memory so a healthy circuit costs nothing, and exposes an explicit half-open probe to test recovery.
The circuit breaker pattern is well-established, so we should be explicit about what existing approaches don't cover for Lambda.
CircuitStatusDynamoDB table and uses TTL-based expiry instead of a true half-open state. The blog itself admits two trade-offs: DynamoDB TTL deletion is not instantaneous (stale OPEN records linger), and there is no gradual traffic restoration. We improve on this with an explicit half-open probe.Where we differentiate: (1) a managed fallback that buffers the payload instead of dropping it, and (2) explicit half-open probing rather than blind TTL expiry. If a customer only needs to protect a downstream and is fine dropping requests, they should use a retry budget instead.
Use case
Lambda functions calling downstream services that can't scale or have outages need a way to:
Today, there's no managed circuit breaker for Lambda. Customers either build their own or let the backend get overwhelmed during incidents.
Proposal
Developer Experience
The common case should be a decorator. You name the circuit, point it at a state store and a fallback, and wrap the function that calls the downstream:
For callers that need to react when the payload was buffered instead of delivered, the return value is inspectable:
Flow
Circuit States
stateDiagram-v2 [*] --> CLOSED CLOSED --> OPEN: N consecutive failures OPEN --> HALF_OPEN: recovery timeout elapsed HALF_OPEN --> CLOSED: probe succeeds HALF_OPEN --> OPEN: probe failsWhat triggers the circuit to open?
Consecutive failures. If N requests in a row fail with a trackable exception (connection error, timeout, 5xx), the circuit opens. We avoid sliding time windows to keep the implementation simple and predictable.
Why consecutive and not time-based: it's predictable and needs no window bookkeeping. If the backend is actually down, you'll hit the threshold in a handful of invocations anyway.
Trade-off we're accepting: Martin Fowler and resilience4j support error-rate thresholds (e.g., open at 50% failures over a rolling window), which catch a degraded-but-not-dead backend that a consecutive counter would miss. We start with consecutive failures for v1 (predictable, no window bookkeeping) and leave rate-based thresholds as a future
failure_rate_thresholdoption.Which exceptions count. By default, any exception counts as a failure. That's the least surprising behavior and what pybreaker does. But not every error means the downstream is unhealthy: a
400is the caller's fault, a503is not. If those "caller errors" count toward the threshold, the circuit opens for the wrong reason. So we let the customer scope it from either side:handled_exceptions(allowlist): only these count (e.g.,(TimeoutError, ConnectionError)). Everything else propagates normally and does not trip the circuit.ignored_exceptions(denylist): everything counts except these (e.g., ignoreValidationError). Handy when failures are the norm and only a few are benign.Passing both is a config error. An exception that doesn't count as a failure is simply re-raised to the caller, so the circuit breaker stays out of the way.
What triggers the circuit to close?
A successful request during half-open state. After a configurable recovery timeout (e.g., 30 seconds), the circuit moves to half-open and allows exactly one request to pass through. If the downstream responds successfully, the circuit closes and normal traffic resumes.
What happens when the circuit is open?
The fallback executes. The payload is stored somewhere safe (S3, SQS, or a custom handler). The caller receives a success response: from the caller's perspective, the message was accepted. The difference is where it landed.
State Coordination Across Environments
Each Lambda execution environment handles one request at a time, and a function scales out to many environments. Circuit state therefore has to be shared, not in-process. The naive way to do this is "read the state and update the failure counter on every invocation", but that means a DynamoDB write on essentially every call, which adds cost (~2 WCU/call) and latency (~5-10 ms) to the happy path, where the circuit is healthy and we want it to be invisible.
We avoid that by splitting state into two things that are managed differently:
Failure counter: local, in-memory
The count of consecutive failures lives in memory, per execution environment:
OPENto the store.So writes are O(state transitions), not O(invocations). A circuit that stays healthy writes nothing. You only pay during an actual incident, which is exactly when you want to.
Circuit state: persisted, cached on read
The
OPEN/HALF_OPEN/CLOSEDflag is the shared truth and lives in the store (DynamoDB or Redis/Valkey, see State Store below). To avoid a read per invocation:LRUDictfromshared/): each environment reads the shared state once every N seconds, not per call.The trade-off we accept
The counter is per-environment, not aggregated. With many environments and a threshold of N, the backend may absorb more than N failures before every environment trips. We accept this because:
OPEN, and every other environment honors it on its next cached read, so one environment's detection protects the rest without each having to see N failures itself.This is how in-process breakers like resilience4j behave per instance; the shared store turns "per instance" into "first instance protects all."
Distributed half-open, anchored recovery
opened_atforward, or the circuit never reaches half-open. We compute the half-open transition from a fixedopened_at, and only reset it on a confirmed state change.Fallback
Responsibility
The fallback has one job: store the payload somewhere safe. No retry, no replay, no recovery logic. The customer owns what happens after the payload is buffered.
Built-in fallbacks
What the caller sees
When the fallback executes, the decorator returns a
CircuitBreakerFallbackResponse(or a custom response the user defines). The caller should treat this as success: the message is safe, it just didn't reach the backend yet.For fire-and-forget patterns (IoT devices sending telemetry), this is transparent. For request-response patterns (API calls), the caller might want to know, so the response is inspectable via
served_by_fallback(see Developer Experience above).State Store
The circuit breaker needs a shared, lockable key/value store keyed by circuit name. The obvious idea is to reuse Idempotency's
BasePersistenceLayer, but reading the code, it doesn't fit directly:DataRecord.statusis a closed enum. It raisesIdempotencyInvalidStatusErroron any value outsideINPROGRESS/COMPLETED/EXPIRED, so we can't storeOPEN/HALF_OPEN/CLOSEDin it.save_success/save_inprogress/get_recordall derive the key by hashing the event via jmespath. Our key is the circuit name, not a payload hash.DynamoDBPersistenceLayer._put_recordhardcodes a condition expression aroundINPROGRESSandin_progress_expiry, not the condition we need for a half-open lock.BasePersistenceLayeris also a public extension point (customers subclass it), so reshaping it is a breaking change for them and risks destabilizing one of the most-used utilities.Decision: dedicated persistence layer, shared patterns
We build a
CircuitBreakerPersistenceLayer(its own small ABC +DynamoDBPersistenceLayer/CachePersistenceLayerimplementations) that mirrors Idempotency's proven patterns without coupling to it:PutItemfor the half-open probe lock: the same atomic "first writer wins, others fall through" technique Idempotency uses, but with our own condition expression.LRUDictfromaws_lambda_powertools.sharedfor the local read cache. This is already generic (not idempotency-specific), so we reuse it as-is.A single record per circuit:
payment-backend)CLOSED,OPEN,HALF_OPENFuture consolidation
Once both this layer and Idempotency's exist side by side, the genuinely shared base (a generic locked key/value store with a TTL cache, no status enum or payload hashing) becomes clear and can be extracted as a non-breaking refactor. We deliberately don't attempt that extraction up front: doing it before the second implementation exists is guesswork, and it would mean editing a stable public API to enable a feature that isn't built yet.
Operational Controls
Both Martin Fowler and AWS Prescriptive Guidance call these out as non-negotiable for a production circuit breaker:
CircuitBreakerListener(on_state_change,on_failure,on_success) so customers can emit their own metrics or alerts on transitions.Defaults & Decisions
POWERTOOLS_PARAMETERS_MAX_AGE = 5) for consistency; it's configurable.Out of scope
Potential challenges
Open Questions
nameis its own circuit, so a function calling 3 backends gets 3 circuits and the customer picks granularity by naming. The unresolved case: one backend with multiple endpoints where only one is failing. Do we leave that to the customer (name a circuit per endpoint), or offer sub-circuit keying? Leaning toward the former for v1, but want input.Future Considerations
Dependencies and Integrations
No response
Alternative solutions
Acknowledgment