Pass at least one key to script executions#49
Conversation
|
We run sentinel but getting this to work sounds good to me. I will try to add a circle ci redis "cluster" environment for this here: Line 35 in 78a4fef the previous idea is to use this "hash" in either name space or queue name:
|
|
We were previously specifying the namespace using For what it's worth, we deployed this initially just specifying the hash tag as the namespace (in |
|
I need more info to produce this locally.
|
|
Hey @taylorchu, makes sense; I'll come up with a crisp set of repro steps and update here. |
This allows Redis to route our execution to the correct node when running in a Redis Cluster.
Quick fix. We use the heartbeat middlewear which allows long-running jobs to work. However, the context is also ending up canceled which is causing other issues.
|
@jpalawaga @nyergler could you check if 016e03b fixes it? |
|
backstory: it turns out go-redis 9.0.3 fixes a bug where it might take the first non key (normal args) as routing key, so this problem is not surfaced. after upgrading to 9.0.5, I finally can reproduce this problem and fix it. |
Enqueue and PromoteJob have both used ZADD XX GT since #3 to preserve the deferral guarantee for jobs enqueued with deterministic IDs: once a schedule sits at time T in the future, a duplicate enqueue at T' < T must be a no-op, and PromoteJob must not demote a job whose score has been bumped to now + InvisibleSec by Dequeue. That is the right semantic for dedup-style jobs that may race their own re-enqueue. It is the wrong semantic for two cases that have grown up around the queue since: 1. Worker retry rescheduling. When a handler returns an error the retry middleware computes a backoff delay and calls Enqueue with score = now + delay. With GT, that score is rejected because the Dequeue invisibility mark (now + InvisibleSec, typically 60s) is greater. The configured Backoff is effectively dead for any value less than InvisibleSec; gated handlers re-run only on the InvisibleSec cadence regardless of how short the backoff is. 2. Subqueue PromoteOnAck. The subqueue middleware advances the next gated job after the prior handler Acks by calling PromoteJob on that job's ID. With GT, the score (still sitting at the InvisibleSec mark from its last dequeue/gated cycle) is also rejected and the next job continues to wait out its full invisibility window. Add a per-job AllowPromotion flag. Default false preserves today's GT semantics so dedup-deferral jobs are unaffected; setting true causes Enqueue to use plain ZADD XX so backoff can lower the score, and causes PromoteJob to use ZADD XX (without GT) so the next gated subqueue entry can be advanced. The flag is read only after the job is initially enqueued. The Enqueue Lua script splits the per-job arg list into two ZADD calls (one with gt, one without) so a mixed BulkEnqueue stays atomic. PromoteJob does an HGET to read the flag before issuing the ZADD; this adds one round-trip per promotion but keeps the API stable.
In the old code, it was possible for the wrapped function call to panic, which would cause the heartbeating go routine to never stop refreshing the call. This moves the important bits into a defer statement which will definitely be ran, allowing the heartbeater to close. This is motivated by seeing a heartbeater run indefinitely, always updating the visibility of a job and never allowing it to actually be consumed or retried.
This allows Redis to route our execution to the correct node when running against a Redis Cluster.
I encountered this issue with an on premises deployment of our application, in which our customer is using Redis Cluster. They were seeing errors with cross-slot key access in a Lua script. This surprised me, since I knew that
workprovides a mechanism for namespacing keys to avoid this very problem.And then I read the following in the
EVALdocs:When I noticed that work doesn't pass any keys to the script evaluation, I surmised that this allowed Redis to route the script to any node in the cluster.
This PR is the minimal update to fix this issue, passing a single namespaced key to the script evaluation.
A more thorough, albeit slightly more invasive, fix would be to perform all key composition outside the script. That would also have the nice advantage of not requiring the Lua & Go key creation to be kept in sync.
If you're open to that, I'll happily update my fork and this PR.