test(flow-08): tighten buy-side correctness assertions#466
Conversation
Smoke evidence — assertion fires exactly as designedRan ``` The default Comparison vs main (same .env, same host)
Cascade after [6] (expected, derivative, not new findings)Once [6] fails, the next four checks in flow-08 inevitably fail because they all assume a properly-funded buyer: ``` All four are noise around the same root cause. Optionally we could short-circuit after [6] to keep the artifact cleaner, but I'd leave them as-is — they're useful diagnostic context if [6] passes someday and one of them still fails. Other smoke notes
Suggested merge order
|
Follow-up: same 503 symptom on flow-08 — root cause is Anvil stalenessReproduced the flow-11 step [43]
Root causeFacilitator ( The facilitator's Fix (verified)Restarted Anvil + facilitator with a fresh fork (new base block
The 3 remaining FAILs (steps 8/11/16) are stale-state artifacts from the first run's auths still being in the sidecar pool — not flow correctness issues. Suggested follow-up (separate PR, not this one)flow-10 currently reuses a running Anvil if port 8545 is bound. The reuse path is what put us in the pruned-state window. Options:
Also worth a tickbox: |
… review
Three correctness gaps surfaced by an audit against the named payment
invariants in references/live-obol-qa.md and references/paid-commerce.md:
1. Buyer-wallet invariant. flow-08 previously funded whatever wallet the
default obol-agent happened to generate at stack init — the exact
"do not fund a generated signer" anti-pattern named in the live-OBOL
QA reference. Now derives the deterministic Bob address from
.env REMOTE_SIGNER_PRIVATE_KEY (the canonical keccak-of-abi-encode
pattern used by flow-11/13/14) and asserts AGENT_WALLET == BOB_WALLET
before funding. The flow header documents the upstream pre-seed
requirement.
2. Exact balance deltas. Replaces "seller balance increased" + missing
buyer-side check with strict pre/post deltas on both sides:
post_seller - pre_seller == PAID_AMOUNT AND
pre_buyer - post_buyer == PAID_AMOUNT. Also removes a swallowed-
failure else-branch that emitted `pass` when the seller balance had
neither increased nor stayed equal (i.e. decrease was reported as
pass).
3. Decouple paid-inference correctness from model wording. The pre-
existing assertion required the model to return the verbatim string
"USDC payment smoke test passed." Replaced with a structural check:
HTTP 200 + non-empty TEXT. The verbatim match is kept as a separate
informational `pass` line. Aligns with paid-commerce.md ("do not
rely on agent wording").
Secondary correctness tightenings rolled in:
- Fail-fast on empty PAID_AMOUNT from the 402 body (previously silent;
only surfaced much later at the settlement-receipt step).
- Master-key read failure now `emit_metrics; exit 1` instead of
continuing with an empty bearer token.
- x402-buyer auth-pool assertion now requires the exact expected count
(EXPECTED_AUTHS, derived from BUY_BUDGET_USDC / per-request price)
instead of the loose `remaining=[1-9]` (single-digit) pattern.
- New post-call step asserts remaining decremented by exactly 1 — the
spend-proof half of the sidecar contract.
- Anvil funding poll regex broadened from exact `^1000000000 ` to
`^[1-9][0-9]{8,} ` so a re-run with pre-existing balance doesn't
fail the poll.
The unused BUY_AUTH_COUNT=5 declaration is removed; the same value is
now derived and asserted via EXPECTED_AUTHS.
842473e to
e35d872
Compare
Summary
A specialist audit of
flows/flow-08-buy.shagainst the named payment invariants in.claude/skills/obol-stack-dev/references/live-obol-qa.mdandreferences/paid-commerce.mdsurfaced three correctness gaps. This PR addresses all three plus secondary precision issues from the same review.1. Buyer-wallet invariant —
L211–217(was)flow-08 previously funded whatever wallet
obol agent wallet list obol-agentreturned. Ifobol stack upgenerated a random agent wallet, flow-08 happily funded that withanvil_setStorageAtand the test passed — exactly the "do not fund a generated signer to make the test pass" anti-pattern named inlive-obol-qa.md.Now derives the canonical Bob address from
.env REMOTE_SIGNER_PRIVATE_KEYusing the keccak-of-abi-encode pattern thatflow-11-dual-stack.shalready uses (line 794), and assertsAGENT_WALLET == BOB_WALLETbefore funding.2. Exact balance deltas —
L300–330(was)Old assertion was "seller balance increased" (post > pre), no buyer-side check, and a swallowed-failure else-branch at L322–323 that emitted
passeven when seller balance decreased. Now both sides are checked strictly:post_seller - pre_seller == PAID_AMOUNTpre_buyer - post_buyer == PAID_AMOUNTwith no catch-all
pass. Adds aPRE_BUYER_BALcapture next to the existingPRE_SELLER_BAL.3. Decouple paid-inference correctness from verbatim model wording —
L274–281(was)Old check required the model to return the literal string
"USDC payment smoke test passed.". Payment correctness should not depend on the model's instruction-following (paid-commerce.md: "Do not rely on agent wording"). Replaced with a structural assertion (HTTP 200 + non-emptyTEXT). The verbatim match is preserved as a separate informationalpassline.Secondary fixes rolled in
PAID_AMOUNTparse (was silent).LITELLM_MASTER_KEYempty nowemit_metrics; exit 1instead of continuing with empty bearer token.remaining=$EXPECTED_AUTHSinstead of looseremaining=[1-9].remainingdecremented by exactly 1 after the paid call.^1000000000to^[1-9][0-9]{8,}so a re-run with pre-existing balance doesn't fail.BUY_AUTH_COUNT=5removed; now derived asEXPECTED_AUTHSand actually asserted.Test plan
bash -n flows/flow-08-buy.sh— syntax cleanflows/release-smoke.sh) on spark1 — currently in flight against main astmux qa-release-20260511-193603. Will re-run against this branch and report.obol-agentwas created with--private-key-file <bob-derived>: confirm flow-08 passes end-to-end with exact deltas and the new sidecar-decrement step.