items/async: fail fast on Redis outage (closes #647)#653
Conversation
`POST /api/v1/items/async` used to return 202 even when Redis was
unreachable. ioredis's default `enableOfflineQueue: true` buffers
commands client-side, so `queue.addBulk` resolved cleanly while Redis
was down and the DataLoader-batched write reported success. The
submission survived as long as the API process stayed up and Redis
came back — but the buffer is in-process, so a simultaneous Redis
outage + API restart silently dropped every submission that had been
acknowledged with 202.
Fix:
- Add a dedicated `IORedisEnqueueNoBuffer` client built with
`enableOfflineQueue: false`. Same Redis as the shared `IORedis`,
same connection wiring extracted into a `makeRedis(extraOptions)`
helper to avoid copy-pasting.
- Wire `itemSubmissionQueueBulkWrite` to the new client. The retry/
DLQ path stays on the shared `IORedis` — it's worker-internal and
buffering is fine there.
- Tighten the handler in `submitItems.ts`: on `bulkWriteResponse.error`
short-circuit with 5xx and `return`. Previously it sent 500 *and*
then fell through to a second `res.status(202).end()` call. Express
swallowed the second one as "headers already sent" so the client got
500 in practice, but the code claimed both intents.
- Register the new client in `closeSharedResourcesForShutdown` so its
connection is closed on shutdown.
With this change, `queue.addBulk` rejects immediately when Redis is
unreachable ("Stream isn't writeable and enableOfflineQueue options is
false") — verified locally against a stopped Redis container — and the
existing 5xx branch fires. The shared `IORedis` keeps its default
buffer for the BullMQ worker side, which legitimately needs to ride
through transient reconnects.
Follow-ups:
- Phase 1 of the outage integ suite (#646) currently pins the
pre-fix lossy contract for items/async. Once both #646 and this
PR land, the Redis scenarios in `outage.integ.test.ts` flip from
"lands after recovery" to "5xx + no Scylla row". Filed as part of
#646's review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 34 minutes and 23 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Closes #647.
Summary
POST /api/v1/items/asyncused to return 202 even when Redis was unreachable. ioredis's defaultenableOfflineQueue: truebuffers commands client-side, soqueue.addBulkresolved cleanly while Redis was down and the DataLoader-batched write reported success. The submission survived as long as the API process stayed up and Redis came back — but the buffer is in-process, so a simultaneous Redis outage + API restart silently dropped every submission acknowledged with 202.Fix
Three small changes:
server/iocContainer/index.ts— extract amakeRedis(extraOptions)helper so the existing IORedis connection wiring isn't copy-pasted, then add a second factoryIORedisEnqueueNoBufferthat builds the same client withenableOfflineQueue: false. WireitemSubmissionQueueBulkWriteto the new client. The sharedIORediskeeps its default buffer for BullMQ workers (which legitimately need to ride through transient reconnects). Register the new client incloseSharedResourcesForShutdown.server/routes/items/submitItems.ts— tighten the response flow. OnbulkWriteResponse.error, short-circuit with 5xx andreturn. Before this fix the handler sent 500 and then fell through to a secondres.status(202).end()call. Express swallowed the second one as "headers already sent" so clients got 500 in practice, but the code claimed both intents.Verification
Verified locally with a standalone smoke test against the dev
docker composestack:The DLQ path (
itemSubmissionRetryQueueBulkWrite) stays on the sharedIORedis— it's worker-internal and buffering through brief reconnects is fine there.What changed in observable behaviour
addBulkcall): the new client rejects rather than buffering. Adopters who relied on the implicit buffering to ride through ms-scale blips will see more 5xx, but the trade is "honest 5xx" vs "false 202 + possible loss on restart." This is the intended contract per items/async returns 202 while Redis is unreachable (narrow data-loss on API restart) #647.Follow-ups
outage.integ.test.tsflip from "lands after recovery" to "5xx + no Scylla row." I'll do that follow-up in a rebase of Add outage integration test: HMA + Redis (phase 1 of #343) #646 after this merges.Test plan
enableOfflineQueue: falseclient rejects when Redis is stopped and recovers cleanly when restarteditems-submission.integ.test.ts(the happy-path integ test) still passes — confirms the new wiring doesn't regress submissions when Redis is healthyIORedisis the right call (rationale: worker-internal, transient reconnects acceptable)🤖 Generated with Claude Code