Skip to content

test(api): fix flaky ListAll cancellation tests via happens-before sync#45

Open
jklaassenjc wants to merge 1 commit into
mainfrom
juergen/kla-438-flaky-cancellation-test
Open

test(api): fix flaky ListAll cancellation tests via happens-before sync#45
jklaassenjc wants to merge 1 commit into
mainfrom
juergen/kla-438-flaky-cancellation-test

Conversation

@jklaassenjc

@jklaassenjc jklaassenjc commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Replace busy-spin + immediate cancel() in TestV1Client_ListAll_ContextCancellation and the V2 sibling with a strict happens-before sync.
  • The test server now blocks the second-page response on a cancelled channel that the test goroutine closes only after cancel() has fired — so the HTTP client is guaranteed to observe the cancellation before the second response lands.

Why

PR #41's release run failed at make test with:

--- FAIL: TestV1Client_ListAll_ContextCancellation (0.00s)
    v1_test.go:244: expected context cancellation error, got nil

A gh run rerun --failed recovered immediately, so the flake is real but rare — and the kind of flake that erodes confidence in CI ("just rerun" becomes a reflex). Same pattern existed in the V2 variant, fixed alongside per the ticket's acceptance criteria.

Test plan

  • go test ./internal/api/ -race -count=100 -run "TestV(1|2)Client_ListAll_ContextCancellation" — 100/100 pass on both
  • go test ./internal/api/ -race -count=1 — full package clean
  • go vet ./... clean

Refs KLA-438

🤖 Generated with Claude Code


Note

Low Risk
Test-only synchronization changes in v1_test.go and v2_test.go; no runtime API client logic is modified.

Overview
Fixes flaky ListAll context-cancellation tests for V1 and V2 by replacing a busy-spin on request count with channel-based happens-before ordering.

After the first page completes, the test goroutine waits on firstReqServed, calls cancel(), then closes cancelled. The mock server blocks responses for page 2+ until cancelled is closed, so pagination cannot finish successfully before cancellation is visible to the HTTP client.

Comments reference KLA-438 and document why the old pattern could pass on slow CI. No production ListAll behavior changes—test-only.

Reviewed by Cursor Bugbot for commit 5a70743. Bugbot is set up for automated code reviews on this repo. Configure here.

Pre-fix both TestV1Client_ListAll_ContextCancellation and its V2 sibling
used a busy-spin on requestCount + immediate cancel() to try to cancel
"after the first page." On a slow CI runner the second page's HTTP
request could be issued and complete before the cancel propagated, and
ListAll returned a normal result instead of context.Canceled. The PR #41
release run hit exactly this — a "just rerun" flake that recovered on
retry.

Replace the spin with a strict happens-before sync: the test server
blocks the second-page response on a `cancelled` channel that the test
goroutine closes only after cancel() has fired. The cancellation is
therefore guaranteed to be observed by the HTTP layer before the second
response lands, regardless of runner speed.

Validated with go test -race -count=100 on both tests — 100/100 pass.

Refs KLA-438

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

3 participants