Skip to content

claimDueSchedule() scales poorly with many schedules #13

@isimisi

Description

@isimisi

Hey! We're running about ~950 schedules in production and hit a problem where midnight cron jobs fire hours late. Traced it back to claimDueSchedule().

The current implementation calls SMEMBERS to get all schedule IDs, then loops through them client-side doing one EVAL per ID. Each of those is a Redis round-trip:

SMEMBERS schedules::index          → [id1, id2, …, idN]
EVAL CLAIM_SCHEDULE_SCRIPT id1     → nil
EVAL CLAIM_SCHEDULE_SCRIPT id2     → nil
…
EVAL CLAIM_SCHEDULE_SCRIPT idK     → claimed!

With ~950 schedules this gets slow fast — #dispatchDueSchedules() calls claimDueSchedule() in a loop until it returns null, so you end up with O(N) round-trips per claim × M claims.

We tested this by loading ~950 schedules into a local Redis instance, all set to become due at the same moment. Before starting the worker, we snapshotted every due schedule's target next_run_at. Then we ran the worker (concurrency 5) and waited for all schedules to be claimed. After that we read each schedule's last_run_at from Redis and compared it to the snapshotted target — giving us the actual drift per schedule and the total wall time from first to last claim.

We ran this twice: once with the current code, and once with a modified version where we replaced the per-ID client-side loop with a single Lua script that does the full SMEMBERS + HGETALL iteration server-side inside Redis — so one EVAL instead of N.

Metric Before After (server-side Lua)
Schedules due 927 979
Wall time (first → last claim) 21.8s 3.7s
Drift from target (min) 3.8s 839ms
Drift from target (avg) 12.8s 2.5s
Drift from target (max) 25.6s 4.6s

This is on localhost with no network latency. In production where each round-trip to Redis has real latency, the gap would be wider since the old code does N round-trips per claim vs 2.

The method signature doesn't change at all. Cron nextRunAt recalculation still happens in JS (needs cron-parser), so there's one HSET after the claim, 2 round-trips total instead of N+1.

Happy to open a PR if you're interested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions