Hey! We're running about ~950 schedules in production and hit a problem where midnight cron jobs fire hours late. Traced it back to claimDueSchedule().
The current implementation calls SMEMBERS to get all schedule IDs, then loops through them client-side doing one EVAL per ID. Each of those is a Redis round-trip:
SMEMBERS schedules::index → [id1, id2, …, idN]
EVAL CLAIM_SCHEDULE_SCRIPT id1 → nil
EVAL CLAIM_SCHEDULE_SCRIPT id2 → nil
…
EVAL CLAIM_SCHEDULE_SCRIPT idK → claimed!
With ~950 schedules this gets slow fast — #dispatchDueSchedules() calls claimDueSchedule() in a loop until it returns null, so you end up with O(N) round-trips per claim × M claims.
We tested this by loading ~950 schedules into a local Redis instance, all set to become due at the same moment. Before starting the worker, we snapshotted every due schedule's target next_run_at. Then we ran the worker (concurrency 5) and waited for all schedules to be claimed. After that we read each schedule's last_run_at from Redis and compared it to the snapshotted target — giving us the actual drift per schedule and the total wall time from first to last claim.
We ran this twice: once with the current code, and once with a modified version where we replaced the per-ID client-side loop with a single Lua script that does the full SMEMBERS + HGETALL iteration server-side inside Redis — so one EVAL instead of N.
| Metric |
Before |
After (server-side Lua) |
| Schedules due |
927 |
979 |
| Wall time (first → last claim) |
21.8s |
3.7s |
| Drift from target (min) |
3.8s |
839ms |
| Drift from target (avg) |
12.8s |
2.5s |
| Drift from target (max) |
25.6s |
4.6s |
This is on localhost with no network latency. In production where each round-trip to Redis has real latency, the gap would be wider since the old code does N round-trips per claim vs 2.
The method signature doesn't change at all. Cron nextRunAt recalculation still happens in JS (needs cron-parser), so there's one HSET after the claim, 2 round-trips total instead of N+1.
Happy to open a PR if you're interested.
Hey! We're running about ~950 schedules in production and hit a problem where midnight cron jobs fire hours late. Traced it back to
claimDueSchedule().The current implementation calls
SMEMBERSto get all schedule IDs, then loops through them client-side doing oneEVALper ID. Each of those is a Redis round-trip:With ~950 schedules this gets slow fast —
#dispatchDueSchedules()callsclaimDueSchedule()in a loop until it returnsnull, so you end up with O(N) round-trips per claim × M claims.We tested this by loading ~950 schedules into a local Redis instance, all set to become due at the same moment. Before starting the worker, we snapshotted every due schedule's target
next_run_at. Then we ran the worker (concurrency 5) and waited for all schedules to be claimed. After that we read each schedule'slast_run_atfrom Redis and compared it to the snapshotted target — giving us the actual drift per schedule and the total wall time from first to last claim.We ran this twice: once with the current code, and once with a modified version where we replaced the per-ID client-side loop with a single Lua script that does the full
SMEMBERS+HGETALLiteration server-side inside Redis — so oneEVALinstead of N.This is on localhost with no network latency. In production where each round-trip to Redis has real latency, the gap would be wider since the old code does N round-trips per claim vs 2.
The method signature doesn't change at all. Cron
nextRunAtrecalculation still happens in JS (needscron-parser), so there's oneHSETafter the claim, 2 round-trips total instead of N+1.Happy to open a PR if you're interested.