Skip to content

fix(lmi): bound Redis I/O and fail open in GlobalRateLimiter [CLD-320]#460

Open
jabra wants to merge 5 commits into
fix/cifrom
fix/rate-limiter-redis-timeout
Open

fix(lmi): bound Redis I/O and fail open in GlobalRateLimiter [CLD-320]#460
jabra wants to merge 5 commits into
fix/cifrom
fix/rate-limiter-redis-timeout

Conversation

@jabra

@jabra jabra commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Problem

PaperQA agent trajectories intermittently freeze forever. GlobalRateLimiter.try_acquire() gates every LLM/embedding call on a Redis test()/hit() (via limits over coredis). The coredis client was created with no socket timeout (stream_timeout=None), so when an idle Redis (AWS ElastiCache) connection is silently dropped, the await blocks indefinitely. The acquire_timeout only bounds the asyncio.sleep polling loop, not the Redis I/O, so it never fires. Confirmed via thread dumps (event loop parked), /proc/net/tcp (idle ESTABLISHED :6379 sockets), and ElastiCache metrics (Redis healthy/idle).

Fix (all in packages/lmi/src/lmi/rate_limiter.py)

  • Bound Redis I/O: pass stream_timeout/connect_timeout (RATE_LIMITER_REDIS_OP_TIMEOUT, default 5s) to RedisStorage -> coredis.Redis.from_url. coredis's default retry policy retries twice on a fresh pooled connection, so a stalled op surfaces as an error instead of hanging.
  • Fail open: on RedisError/StorageError/OSError (retries exhausted = Redis unreachable), try_acquire lets the request through instead of propagating. After RATE_LIMITER_REDIS_MAX_FAILURES (default 3) consecutive failures the instance permanently degrades to MemoryStorage (per-process limiting) with a loud logger.warning. test()/hit() returning False (limit full) still waits — unchanged.
  • Proxy egress: outbound_ip() uses aiohttp.ClientSession(trust_env=True) so HTTP(S)_PROXY is honored.

Deliberate fail-open: the rate limiter is a guardrail, not a correctness requirement — freezing a multi-hour trajectory is worse than one un-coordinated call. Logged loudly.

Tests

Added to packages/lmi/tests/test_rate_limiter.py: fail-open on timeout/connection error (no hang, no raise), degrade to MemoryStorage after N failures, counter reset on success, wait-loop unchanged on hit()==False, and RedisStorage constructed with the timeouts.

jabra and others added 3 commits June 17, 2026 01:25
- Pass stream_timeout/connect_timeout to RedisStorage so a stalled idle
  ElastiCache connection can no longer block try_acquire forever; coredis
  retries twice on a fresh pooled connection before surfacing the error.
- Catch RedisError/StorageError/OSError in test()/hit(): fail open for the
  call and permanently degrade the instance to MemoryStorage after
  RATE_LIMITER_REDIS_MAX_FAILURES consecutive failures, with a loud warning.
  test()/hit() returning False (limit full) still waits, unchanged.
- outbound_ip() uses trust_env=True so HTTP(S)_PROXY is honored in
  proxy-only egress environments.
@maykcaldas maykcaldas marked this pull request as ready for review June 23, 2026 19:49
Copilot AI review requested due to automatic review settings June 23, 2026 19:49

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens GlobalRateLimiter against Redis connection stalls by enforcing per-operation socket timeouts and introducing a fail-open + degradation path to in-memory limiting when Redis becomes unreachable, preventing PaperQA trajectories from freezing indefinitely.

Changes:

  • Add Redis operation/connect timeouts when constructing RedisStorage so stalled I/O cannot block forever.
  • Fail open on Redis/storage/socket errors and degrade to MemoryStorage after a configurable number of consecutive failures, with optional timed recovery probing.
  • Ensure outbound IP detection honors HTTP(S)_PROXY by using aiohttp.ClientSession(trust_env=True).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
packages/lmi/src/lmi/rate_limiter.py Adds Redis I/O timeouts, fail-open behavior with degradation/recovery logic, and proxy-aware outbound IP lookup.
packages/lmi/tests/test_rate_limiter.py Adds coverage for fail-open behavior, degradation to in-memory, recovery probing, and RedisStorage timeout configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/lmi/src/lmi/rate_limiter.py Outdated
Comment thread packages/lmi/tests/test_rate_limiter.py
@maykcaldas maykcaldas force-pushed the fix/rate-limiter-redis-timeout branch from 71f51fe to 0dba96f Compare June 23, 2026 20:51
@maykcaldas maykcaldas changed the base branch from main to fix/ci June 23, 2026 21:42
@maykcaldas maykcaldas self-assigned this Jun 23, 2026
@maykcaldas maykcaldas changed the title fix(lmi): bound Redis I/O and fail open in GlobalRateLimiter fix(lmi): bound Redis I/O and fail open in GlobalRateLimiter [CLD-320] Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants