Reserve half-open tests#70
Merged
Merged
Conversation
0e4f002 to
5acd1f7
Compare
5acd1f7 to
bcdb3f4
Compare
4 tasks
676b6cf to
d964178
Compare
d964178 to
1ef2c64
Compare
When a circuit becomes half-open, only one process (or thread) should run the protected block as a test execution. Other processes that observe the half-open state while a run is in flight now skip with `OpenCircuitError` and emit a `:circuit_skipped` event, matching the existing semantics for an open circuit. Exclusivity is provided by a new compare-and-set storage method, `Storage::Interface#reserve`, alongside `reserved_at` and `current_time` fields on `Status`. The reservation expires after `cool_down` so a crashed reserver auto-recovers; the worst-case behaviour is one extra test run per cool-down window when a protected block runs longer than `cool_down`. Custom storage backends must implement `#reserve`. `Memory` uses `Concurrent::Atom#compare_and_set`; `Redis` uses WATCH/MULTI/EXEC on a new `circuit:<name>:reserved_at` key; `Null` always returns true; `FaultTolerantProxy` falls open on backend error. Also includes: * `Status#can_run?` treats `locked_closed?` as an unconditional override above the reservation check, so a manually locked-closed circuit runs even with a stale reservation still in effect. * `Status` snapshots `current_time` at construction so all predicates reason about the same instant. * `Storage::Redis#close` and `#reset` clear the `reserved_at` key. * `Storage::Redis#reopen` and `#reserve` use safe navigation when serializing `previous_*_at`, fixing the first CAS against a missing key. * `docker-compose.yml` for local infra (mysql, redis, opensearch) and matching `MYSQL_HOST`/`MYSQL_USER` defaults in the spec helper. Bumps version to 0.13.0.
1ef2c64 to
57d1c48
Compare
akrend-psq
approved these changes
May 13, 2026
akrend-psq
left a comment
There was a problem hiding this comment.
Looks good. I'm excited to see this play out in prod.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reserve half-open test runs across processes so only one process executes the test block when a circuit becomes half-open. Other processes that observe the half-open state while a test run is in flight now skip with
OpenCircuitErrorand emit a:circuit_skippedevent, matching the existing semantics for an open circuit.Exclusivity is provided by a new compare-and-set storage method,
Storage::Interface#reserve, alongsidereserved_atandcurrent_timefields onStatus. The reservation expires aftercool_downso a crashed reserver auto-recovers; the worst-case behaviour is one extra test run percool_downwindow when a protected block runs longer thancool_down.Background: the half-open race
In a circuit breaker, the half-open state is meant to admit a single probe through to test whether the upstream has recovered. The pre-existing Faulty implementation didn't enforce this — any caller that observed
half_open?would run the protected block — so under concurrent load the cool-down window degenerated into a thundering herd of probes against an upstream that might still be unhealthy.Without coordination (status quo on
master)sequenceDiagram participant W1 as Worker 1 participant W2 as Worker 2 participant W3 as Worker 3 participant S as Storage participant U as Upstream (still flaky) Note over S: state=open<br/>cool_down has elapsed<br/>(reads as half_open) par W1->>S: status S-->>W1: half_open and W2->>S: status S-->>W2: half_open and W3->>S: status S-->>W3: half_open end Note over W1,W3: Each worker concludes<br/>"I am THE probe" rect rgb(255, 226, 226) par W1->>U: protected call and W2->>U: protected call and W3->>U: protected call end Note over U: Re-overloaded —<br/>circuit re-opens,<br/>cool_down restarts endThe visible symptoms in production were:
cool_downexpires (every worker probing in lock-step).:circuit_success/:circuit_failureevent counts spike by ~N on each half-open transition instead of yielding exactly one event.With reservation (this PR)
sequenceDiagram participant W1 as Worker 1 participant W2 as Worker 2 participant W3 as Worker 3 participant S as Storage participant U as Upstream Note over S: state=half_open<br/>reserved_at=nil par W1->>S: status S-->>W1: half_open, reserved_at=nil and W2->>S: status S-->>W2: half_open, reserved_at=nil and W3->>S: status S-->>W3: half_open, reserved_at=nil end par CAS race on reserved_at W1->>S: reserve(now, prev=nil) S-->>W1: true and W2->>S: reserve(now, prev=nil) S-->>W2: false and W3->>S: reserve(now, prev=nil) S-->>W3: false end rect rgb(226, 245, 226) W1->>U: protected call (the probe) U-->>W1: success or failure Note over W1: close on success<br/>re-open on failure end rect rgb(255, 243, 224) Note over W2,W3: emit circuit_skipped event<br/>raise OpenCircuitError endThe CAS itself is
WATCH/MULTI/EXEConcircuit:<name>:reserved_atin Redis andConcurrent::Atom#compare_and_setin Memory. The reservation expires aftercool_down, so a worker that crashes mid-probe doesn't strand the circuit — the next caller aftercool_downreadsreserved_atas stale and wins a fresh CAS. Worst-case behaviour is one extra probe percool_downwindow when a protected block runs longer thancool_down.Release
Releases as 0.13.0 once #78 (the 0.12.0 modernization PR) ships. The rebase target is the
modernizebranch, so the diff GitHub shows againstmasterwill narrow to just this feature once 0.12.0 merges.Storage backends
Custom
Storage::Interfaceimplementations must add a#reserve(circuit, time, previous_reserved_at)method:Memory—Concurrent::Atom#compare_and_setonreserved_at.Redis—WATCH/MULTI/EXECon a newcircuit:<name>:reserved_atkey, expiring withcircuit_ttl.Null— always returnstrue.FaultTolerantProxy— falls open on backend error.Also included
Status#can_run?treatslocked_closed?as an unconditional override above the reservation check, so a manually locked-closed circuit runs even with a stale reservation still in effect from a prior cycle.Statussnapshotscurrent_timeat construction soopen?,half_open?, andreserved?all reason about the same instant.Storage::Redis#closeand#resetclear thereserved_atkey.Storage::Redis#reopenand#reserveuse safe navigation when serializingprevious_*_at, fixing the first CAS against a missing key.docker-compose.ymlfor local infra (mysql, redis, opensearch) and matchingMYSQL_HOST/MYSQL_USERdefaults in the spec helper.Test plan
bundle exec rubocopcleanbundle exec rspecgreen (full suite against redis 4/5 via CI)spec/circuit_spec.rbcoverage for the half-open reservation path (single executor, skip event for losers)spec/storage/redis_spec.rbconcurrency example asserts exactly one reserver wins under pool contentionspec/status_spec.rbcoverage for snapshottedcurrent_timeandlocked_closed?override