fix(lmi): bound Redis I/O and fail open in GlobalRateLimiter [CLD-320]#460
Open
jabra wants to merge 5 commits into
Open
fix(lmi): bound Redis I/O and fail open in GlobalRateLimiter [CLD-320]#460jabra wants to merge 5 commits into
jabra wants to merge 5 commits into
Conversation
- Pass stream_timeout/connect_timeout to RedisStorage so a stalled idle ElastiCache connection can no longer block try_acquire forever; coredis retries twice on a fresh pooled connection before surfacing the error. - Catch RedisError/StorageError/OSError in test()/hit(): fail open for the call and permanently degrade the instance to MemoryStorage after RATE_LIMITER_REDIS_MAX_FAILURES consecutive failures, with a loud warning. test()/hit() returning False (limit full) still waits, unchanged. - outbound_ip() uses trust_env=True so HTTP(S)_PROXY is honored in proxy-only egress environments.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens GlobalRateLimiter against Redis connection stalls by enforcing per-operation socket timeouts and introducing a fail-open + degradation path to in-memory limiting when Redis becomes unreachable, preventing PaperQA trajectories from freezing indefinitely.
Changes:
- Add Redis operation/connect timeouts when constructing
RedisStorageso stalled I/O cannot block forever. - Fail open on Redis/storage/socket errors and degrade to
MemoryStorageafter a configurable number of consecutive failures, with optional timed recovery probing. - Ensure outbound IP detection honors
HTTP(S)_PROXYby usingaiohttp.ClientSession(trust_env=True).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/lmi/src/lmi/rate_limiter.py | Adds Redis I/O timeouts, fail-open behavior with degradation/recovery logic, and proxy-aware outbound IP lookup. |
| packages/lmi/tests/test_rate_limiter.py | Adds coverage for fail-open behavior, degradation to in-memory, recovery probing, and RedisStorage timeout configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
71f51fe to
0dba96f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PaperQA agent trajectories intermittently freeze forever.
GlobalRateLimiter.try_acquire()gates every LLM/embedding call on a Redistest()/hit()(vialimitsovercoredis). The coredis client was created with no socket timeout (stream_timeout=None), so when an idle Redis (AWS ElastiCache) connection is silently dropped, theawaitblocks indefinitely. Theacquire_timeoutonly bounds theasyncio.sleeppolling loop, not the Redis I/O, so it never fires. Confirmed via thread dumps (event loop parked),/proc/net/tcp(idle ESTABLISHED :6379 sockets), and ElastiCache metrics (Redis healthy/idle).Fix (all in
packages/lmi/src/lmi/rate_limiter.py)stream_timeout/connect_timeout(RATE_LIMITER_REDIS_OP_TIMEOUT, default 5s) toRedisStorage->coredis.Redis.from_url. coredis's default retry policy retries twice on a fresh pooled connection, so a stalled op surfaces as an error instead of hanging.RedisError/StorageError/OSError(retries exhausted = Redis unreachable),try_acquirelets the request through instead of propagating. AfterRATE_LIMITER_REDIS_MAX_FAILURES(default 3) consecutive failures the instance permanently degrades toMemoryStorage(per-process limiting) with a loudlogger.warning.test()/hit()returningFalse(limit full) still waits — unchanged.outbound_ip()usesaiohttp.ClientSession(trust_env=True)soHTTP(S)_PROXYis honored.Deliberate fail-open: the rate limiter is a guardrail, not a correctness requirement — freezing a multi-hour trajectory is worse than one un-coordinated call. Logged loudly.
Tests
Added to
packages/lmi/tests/test_rate_limiter.py: fail-open on timeout/connection error (no hang, no raise), degrade to MemoryStorage after N failures, counter reset on success, wait-loop unchanged onhit()==False, andRedisStorageconstructed with the timeouts.