Skip to content

feat: load-adaptive per-repo rate limiting for shared public bots#1423

Open
thomasrockhu-codecov wants to merge 2 commits into
mainfrom
th/public-bot-rate-limiting
Open

feat: load-adaptive per-repo rate limiting for shared public bots#1423
thomasrockhu-codecov wants to merge 2 commits into
mainfrom
th/public-bot-rate-limiting

Conversation

@thomasrockhu-codecov

Copy link
Copy Markdown
Contributor

Summary

Stops any single public repo from monopolizing the shared dedicated-app tokens (commit_dedicated_app / pull_dedicated_app) while still letting repos burst freely when the pool is idle. Enforcement is per-request, keyed per (shared bot, repo).

  • Load-adaptive sliding cap: each shared bot's live GitHub utilization U = 1 - remaining/limit (read from X-RateLimit-* headers we already log) drives a per-repo cap. Quiet pool (U <= util_low) → generous max_share; saturated pool (U >= util_high) → small guaranteed_share; linear interpolation between. A repo whose rolling usage in the trailing window exceeds its cap is over-limit.
  • Silent drop: when enforcing and over-cap, make_http_call returns a synthetic 204 (parsed to None, pagination stops) — no TorngitRateLimitError raised. Callers get an empty result as if GitHub returned nothing.
  • Observe-first: ships inert (enabled=false); with enabled=true, enforce=false it emits metrics on every over-cap request but still sends it, so thresholds can be validated against live traffic before any drops.
  • Only shared public bots affected: installation tokens and owner PATs are untouched (matched on token_to_use["entity_name"] against the configured bots list).

What's added

  • New shared/rate_limits/public_bot.py: config loader, sliding_repo_cap(), minute-bucket per-repo usage counters, pool-utilization read/write, and a fail-open evaluate_public_bot_request() decision helper (any Redis error → not limited).
  • Enforcement hook + pool-utilization recording in torngit/github.py make_http_call.
  • Prometheus instruments: git_provider_public_bot_over_cap{bot,repo_slug,mode}, git_provider_public_bot_requests{bot}, git_provider_public_bot_utilization{bot}.

Configuration

github:
  public_bot_rate_limit:
    enabled: true
    enforce: false        # observe/metrics-only first; flip to true to drop
    bots: [commit, pull]
    guaranteed_share: 0.02
    max_share: 0.20
    util_low: 0.5
    util_high: 0.9
    window_seconds: 3600
    budget_fallback: 15000

Test plan

  • make test.path TEST_PATH="tests/unit/rate_limits/test_public_bot.py tests/unit/torngit/test_github_public_bot.py" from libs/shared
  • Unit tests cover sliding_repo_cap boundaries/interpolation, config loader defaults+overrides, Redis read/write fail-open paths, evaluate_public_bot_request under/over cap, and make_http_call observe (sends, mode=observe) vs enforce (drops silently → None, not counted).
  • Ship with enabled: true, enforce: false; watch git_provider_public_bot_over_cap and tune thresholds; then flip enforce: true.

Rollout

  1. Deploy inert, then set enabled: true, enforce: false (metrics only).
  2. Validate over-cap signal + audit affected endpoints for a few days.
  3. Flip enforce: true to begin dropping.

Made with Cursor

Stop any single public repo from monopolizing the shared dedicated-app
tokens (commit_dedicated_app / pull_dedicated_app) while letting repos burst
freely when the pool is idle.

Each shared bot's live GitHub utilization (U = 1 - remaining/limit, read from
X-RateLimit-* headers) drives a per-repo cap that slides from a generous burst
share when the pool is quiet down to a small guaranteed share when the pool is
saturated. A repo whose rolling usage in the trailing window exceeds its cap is
over-limit; over-limit requests are silently dropped (synthetic 204 -> None, no
exception) and emitted to Prometheus.

- New shared/rate_limits/public_bot.py: config loader, sliding_repo_cap(),
  minute-bucket per-repo usage counters, pool-utilization read/write, and the
  fail-open evaluate_public_bot_request() decision helper.
- Enforce in torngit github make_http_call for repo calls whose token
  entity_name is a configured bot; record pool utilization after each response.
- Prometheus instruments: git_provider_public_bot_over_cap{bot,repo_slug,mode},
  git_provider_public_bot_requests{bot}, git_provider_public_bot_utilization{bot}.
- Ships inert (enabled=false); observe-first (enforce=false) before dropping.

Co-authored-by: Cursor <cursoragent@cursor.com>
@codspeed-hq

codspeed-hq Bot commented Jul 4, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 9 untouched benchmarks


Comparing th/public-bot-rate-limiting (a346d98) with main (ee7a2b4)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (701e991) during the generation of this report, so ee7a2b4 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@codecov-notifications

codecov-notifications Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.33508% with 7 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
libs/shared/shared/torngit/github.py 93.57% 4 Missing and 3 partials ⚠️

📢 Thoughts on this report? Let us know!

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.33508% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.90%. Comparing base (701e991) to head (a346d98).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
libs/shared/shared/torngit/github.py 93.57% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1423      +/-   ##
==========================================
+ Coverage   91.89%   91.90%   +0.01%     
==========================================
  Files        1325     1326       +1     
  Lines       50868    51011     +143     
  Branches     1626     1637      +11     
==========================================
+ Hits        46744    46884     +140     
  Misses       3818     3818              
- Partials      306      309       +3     
Flag Coverage Δ
apiunit 94.94% <ø> (ø)
sharedintegration 36.95% <32.98%> (+0.06%) ⬆️
sharedunit 85.00% <96.33%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

…code

- Break the monolithic make_http_call into small, well-named helpers
  (header building, host header, public-bot cap, response logging, token
  refresh, rate-limit fallback, terminal-status raising) so the retry loop
  reads top-to-bottom.
- Drop unnecessary input/Redis protection in shared/rate_limits/public_bot.py:
  remove bytes-decode helpers (int/float already accept bytes), per-function
  fail-open try/except, redundant clamping, config type coercion, and an
  unreachable degenerate-config branch.
- Centralize the one needed safety (a Redis outage must not block provider
  requests) into single fail-open guards at the call sites in github.py.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant