Skip to content

Reserve half-open tests#70

Merged
justinhoward merged 1 commit into
masterfrom
reserve-half-open-tests
May 14, 2026
Merged

Reserve half-open tests#70
justinhoward merged 1 commit into
masterfrom
reserve-half-open-tests

Conversation

@justinhoward

@justinhoward justinhoward commented Mar 14, 2025

Copy link
Copy Markdown
Member

Summary

Reserve half-open test runs across processes so only one process executes the test block when a circuit becomes half-open. Other processes that observe the half-open state while a test run is in flight now skip with OpenCircuitError and emit a :circuit_skipped event, matching the existing semantics for an open circuit.

Exclusivity is provided by a new compare-and-set storage method, Storage::Interface#reserve, alongside reserved_at and current_time fields on Status. The reservation expires after cool_down so a crashed reserver auto-recovers; the worst-case behaviour is one extra test run per cool_down window when a protected block runs longer than cool_down.

Background: the half-open race

In a circuit breaker, the half-open state is meant to admit a single probe through to test whether the upstream has recovered. The pre-existing Faulty implementation didn't enforce this — any caller that observed half_open? would run the protected block — so under concurrent load the cool-down window degenerated into a thundering herd of probes against an upstream that might still be unhealthy.

Without coordination (status quo on master)

sequenceDiagram
    participant W1 as Worker 1
    participant W2 as Worker 2
    participant W3 as Worker 3
    participant S as Storage
    participant U as Upstream (still flaky)

    Note over S: state=open<br/>cool_down has elapsed<br/>(reads as half_open)

    par
        W1->>S: status
        S-->>W1: half_open
    and
        W2->>S: status
        S-->>W2: half_open
    and
        W3->>S: status
        S-->>W3: half_open
    end

    Note over W1,W3: Each worker concludes<br/>"I am THE probe"

    rect rgb(255, 226, 226)
    par
        W1->>U: protected call
    and
        W2->>U: protected call
    and
        W3->>U: protected call
    end
    Note over U: Re-overloaded —<br/>circuit re-opens,<br/>cool_down restarts
    end
Loading

The visible symptoms in production were:

  • Bursty re-failures the instant cool_down expires (every worker probing in lock-step).
  • A flaky upstream that would otherwise recover gets repeatedly knocked back over by its own probe storm.
  • :circuit_success / :circuit_failure event counts spike by ~N on each half-open transition instead of yielding exactly one event.

With reservation (this PR)

sequenceDiagram
    participant W1 as Worker 1
    participant W2 as Worker 2
    participant W3 as Worker 3
    participant S as Storage
    participant U as Upstream

    Note over S: state=half_open<br/>reserved_at=nil

    par
        W1->>S: status
        S-->>W1: half_open, reserved_at=nil
    and
        W2->>S: status
        S-->>W2: half_open, reserved_at=nil
    and
        W3->>S: status
        S-->>W3: half_open, reserved_at=nil
    end

    par CAS race on reserved_at
        W1->>S: reserve(now, prev=nil)
        S-->>W1: true
    and
        W2->>S: reserve(now, prev=nil)
        S-->>W2: false
    and
        W3->>S: reserve(now, prev=nil)
        S-->>W3: false
    end

    rect rgb(226, 245, 226)
    W1->>U: protected call (the probe)
    U-->>W1: success or failure
    Note over W1: close on success<br/>re-open on failure
    end

    rect rgb(255, 243, 224)
    Note over W2,W3: emit circuit_skipped event<br/>raise OpenCircuitError
    end
Loading

The CAS itself is WATCH/MULTI/EXEC on circuit:<name>:reserved_at in Redis and Concurrent::Atom#compare_and_set in Memory. The reservation expires after cool_down, so a worker that crashes mid-probe doesn't strand the circuit — the next caller after cool_down reads reserved_at as stale and wins a fresh CAS. Worst-case behaviour is one extra probe per cool_down window when a protected block runs longer than cool_down.

Release

Releases as 0.13.0 once #78 (the 0.12.0 modernization PR) ships. The rebase target is the modernize branch, so the diff GitHub shows against master will narrow to just this feature once 0.12.0 merges.

Storage backends

Custom Storage::Interface implementations must add a #reserve(circuit, time, previous_reserved_at) method:

  • MemoryConcurrent::Atom#compare_and_set on reserved_at.
  • RedisWATCH/MULTI/EXEC on a new circuit:<name>:reserved_at key, expiring with circuit_ttl.
  • Null — always returns true.
  • FaultTolerantProxy — falls open on backend error.

Also included

  • Status#can_run? treats locked_closed? as an unconditional override above the reservation check, so a manually locked-closed circuit runs even with a stale reservation still in effect from a prior cycle.
  • Status snapshots current_time at construction so open?, half_open?, and reserved? all reason about the same instant.
  • Storage::Redis#close and #reset clear the reserved_at key.
  • Storage::Redis#reopen and #reserve use safe navigation when serializing previous_*_at, fixing the first CAS against a missing key.
  • docker-compose.yml for local infra (mysql, redis, opensearch) and matching MYSQL_HOST / MYSQL_USER defaults in the spec helper.

Test plan

  • bundle exec rubocop clean
  • bundle exec rspec green (full suite against redis 4/5 via CI)
  • New spec/circuit_spec.rb coverage for the half-open reservation path (single executor, skip event for losers)
  • New spec/storage/redis_spec.rb concurrency example asserts exactly one reserver wins under pool contention
  • spec/status_spec.rb coverage for snapshotted current_time and locked_closed? override

@justinhoward justinhoward force-pushed the reserve-half-open-tests branch 2 times, most recently from 0e4f002 to 5acd1f7 Compare March 14, 2025 18:14
@justinhoward justinhoward force-pushed the reserve-half-open-tests branch from 5acd1f7 to bcdb3f4 Compare May 13, 2026 02:47
@justinhoward justinhoward force-pushed the reserve-half-open-tests branch 2 times, most recently from 676b6cf to d964178 Compare May 13, 2026 15:37
@justinhoward justinhoward changed the base branch from master to modernize May 13, 2026 15:41
@justinhoward justinhoward marked this pull request as ready for review May 13, 2026 15:46
@justinhoward justinhoward force-pushed the reserve-half-open-tests branch from d964178 to 1ef2c64 Compare May 13, 2026 18:20
When a circuit becomes half-open, only one process (or thread) should
run the protected block as a test execution. Other processes that
observe the half-open state while a run is in flight now skip with
`OpenCircuitError` and emit a `:circuit_skipped` event, matching the
existing semantics for an open circuit.

Exclusivity is provided by a new compare-and-set storage method,
`Storage::Interface#reserve`, alongside `reserved_at` and
`current_time` fields on `Status`. The reservation expires after
`cool_down` so a crashed reserver auto-recovers; the worst-case
behaviour is one extra test run per cool-down window when a protected
block runs longer than `cool_down`.

Custom storage backends must implement `#reserve`. `Memory` uses
`Concurrent::Atom#compare_and_set`; `Redis` uses WATCH/MULTI/EXEC on a
new `circuit:<name>:reserved_at` key; `Null` always returns true;
`FaultTolerantProxy` falls open on backend error.

Also includes:

* `Status#can_run?` treats `locked_closed?` as an unconditional
  override above the reservation check, so a manually locked-closed
  circuit runs even with a stale reservation still in effect.
* `Status` snapshots `current_time` at construction so all predicates
  reason about the same instant.
* `Storage::Redis#close` and `#reset` clear the `reserved_at` key.
* `Storage::Redis#reopen` and `#reserve` use safe navigation when
  serializing `previous_*_at`, fixing the first CAS against a missing
  key.
* `docker-compose.yml` for local infra (mysql, redis, opensearch) and
  matching `MYSQL_HOST`/`MYSQL_USER` defaults in the spec helper.

Bumps version to 0.13.0.
@justinhoward justinhoward force-pushed the reserve-half-open-tests branch from 1ef2c64 to 57d1c48 Compare May 13, 2026 19:15
Base automatically changed from modernize to master May 13, 2026 20:16

@akrend-psq akrend-psq left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm excited to see this play out in prod.

@justinhoward justinhoward merged commit 41d4a33 into master May 14, 2026
15 checks passed
@justinhoward justinhoward deleted the reserve-half-open-tests branch May 14, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants