Skip to content

Commit 5cca5f2

Browse files
fix(run-engine): route both fast-path reads through replica when enabled
Per Eric: when `useReplicaForFastPathRead` is on, both the cheap probe and the full-run fetch should hit the replica - the whole point of the flag is to take fast-path load off the writer. Replica lag is acceptable because debounce is best-effort: a stale `delayUntil` falls through to the locked path, and a stale `status` at worst returns the existing run (the same outcome the caller would see if their trigger had landed a few hundred ms earlier). Co-Authored-By: Eric Allam <eallam@icloud.com>
1 parent 5fc14ff commit 5cca5f2

2 files changed

Lines changed: 30 additions & 46 deletions

File tree

internal-packages/run-engine/src/engine/systems/debounceSystem.ts

Lines changed: 23 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -63,11 +63,8 @@ export type DebounceSystemOptions = {
6363
*/
6464
fastPathSkipEnabled?: boolean;
6565
/**
66-
* When true, route the cheap probe of the unlocked fast-path through
67-
* `readOnlyPrisma` (e.g. an Aurora reader) instead of the writer. The
68-
* full-run read used to construct the returned `existing` result still
69-
* goes through the writer, so callers never see a run whose status has
70-
* already moved out of DELAYED on the writer due to replica lag.
66+
* When true, route the unlocked fast-path reads (probe + full-run fetch)
67+
* through `readOnlyPrisma` (e.g. an Aurora reader) instead of the writer.
7168
*/
7269
useReplicaForFastPathRead?: boolean;
7370
};
@@ -482,13 +479,16 @@ return 0
482479
tx?: PrismaClientOrTransaction;
483480
}): Promise<DebounceResult> {
484481
const prisma = tx ?? this.$.prisma;
485-
// The cheap probe in the fast-path skip can run on `readOnlyPrisma` when
486-
// configured. Replica lag is fine because the probe is best-effort: a
487-
// stale view either falls through to the locked path or is rejected by
488-
// the writer-validated re-check inside `#tryFastPathSkip`. Only divert
489-
// the probe when the caller isn't inside a tx (where the read needs to
490-
// see the tx's writes).
491-
const probeReadPrisma =
482+
// Reads in the unlocked fast-path can run on `readOnlyPrisma` when
483+
// configured (e.g. an Aurora reader). Replica lag is fine: debounce is
484+
// best-effort and a stale read either falls through to the locked path
485+
// (when delayUntil hasn't replicated yet) or returns the existing run
486+
// (when the run's status is stale). The latter is the same outcome the
487+
// caller would see if their trigger had simply landed a few hundred ms
488+
// earlier, which is within the natural debounce race. Only divert reads
489+
// when the caller isn't inside a tx (where the read needs to see the
490+
// tx's writes).
491+
const fastPathReadPrisma =
492492
tx ?? (this.useReplicaForFastPathRead ? this.$.readOnlyPrisma : this.$.prisma);
493493

494494
// Compute the (quantized) target delayUntil up-front, before taking any lock.
@@ -507,12 +507,7 @@ return 0
507507
existingRunId,
508508
newDelayUntil,
509509
debounce,
510-
probePrisma: probeReadPrisma,
511-
// The full-run read used to construct the returned `existing` result
512-
// always goes through the writer, even when the cheap probe is on a
513-
// replica. Otherwise replica lag could let us return a run whose
514-
// status has already moved out of DELAYED on the writer.
515-
validatePrisma: prisma,
510+
prisma: fastPathReadPrisma,
516511
});
517512
if (fastPathResult) {
518513
return fastPathResult;
@@ -588,33 +583,30 @@ return 0
588583
* already exceeded its max debounce duration so the locked path can return
589584
* `max_duration_exceeded` and let the caller create a new run.
590585
*
591-
* The cheap probe (`probePrisma`) may be on a read replica - replica lag is
592-
* fine because the monotonic-forward invariant means a stale view just falls
593-
* through to the locked path. The full-run read used to construct the
594-
* returned `existing` result always goes through `validatePrisma` (the
595-
* writer), so callers never receive a run whose status has already moved out
596-
* of DELAYED on the writer due to replica lag.
586+
* `prisma` may be a read replica - replica lag is acceptable because
587+
* debounce is best-effort. A stale `delayUntil` either matches reality or
588+
* undershoots (we fall through to the locked path); a stale `status` at
589+
* worst returns the existing run, which is the same outcome the caller
590+
* would see if their trigger had landed a few hundred ms earlier.
597591
*/
598592
async #tryFastPathSkip({
599593
existingRunId,
600594
newDelayUntil,
601595
debounce,
602-
probePrisma,
603-
validatePrisma,
596+
prisma,
604597
}: {
605598
existingRunId: string;
606599
newDelayUntil: Date;
607600
debounce: DebounceOptions;
608-
probePrisma: PrismaClientOrTransaction | PrismaReplicaClient;
609-
validatePrisma: PrismaClientOrTransaction;
601+
prisma: PrismaClientOrTransaction | PrismaReplicaClient;
610602
}): Promise<DebounceResult | null> {
611603
// Trailing mode with updateData still needs the lock so the data update is
612604
// applied; only short-circuit when there's nothing to update.
613605
if (debounce.mode === "trailing" && debounce.updateData) {
614606
return null;
615607
}
616608

617-
const probe = await probePrisma.taskRun.findFirst({
609+
const probe = await prisma.taskRun.findFirst({
618610
where: { id: existingRunId },
619611
select: { status: true, delayUntil: true, createdAt: true },
620612
});
@@ -640,20 +632,11 @@ return 0
640632
return null;
641633
}
642634

643-
// Validate against the writer before returning. Also re-checks delayUntil
644-
// and the max-duration window in case the writer has moved on since the
645-
// (possibly stale) probe.
646-
const fullRun = await validatePrisma.taskRun.findFirst({
635+
const fullRun = await prisma.taskRun.findFirst({
647636
where: { id: existingRunId },
648637
include: { associatedWaitpoint: true },
649638
});
650-
if (!fullRun || fullRun.status !== "DELAYED" || !fullRun.delayUntil) {
651-
return null;
652-
}
653-
if (newDelayUntil.getTime() > fullRun.delayUntil.getTime()) {
654-
return null;
655-
}
656-
if (newDelayUntil.getTime() > fullRun.createdAt.getTime() + maxDurationMs) {
639+
if (!fullRun || fullRun.status !== "DELAYED") {
657640
return null;
658641
}
659642

internal-packages/run-engine/src/engine/types.ts

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -154,12 +154,13 @@ export type RunEngineOptions = {
154154
*/
155155
fastPathSkipEnabled?: boolean;
156156
/**
157-
* Whether to route the cheap probe of the unlocked fast-path through
158-
* `readOnlyPrisma` (e.g. an Aurora reader) instead of the writer. The
159-
* full-run read used to construct the returned `existing` result still
160-
* goes through the writer, so callers never see a run whose status has
161-
* already moved out of DELAYED on the writer. Replica lag at worst means
162-
* a few extra callers fall through to the lock.
157+
* Whether to route the unlocked fast-path reads (probe + full-run fetch)
158+
* through `readOnlyPrisma` (e.g. an Aurora reader) instead of the writer.
159+
* Safe because debounce is best-effort: a stale `delayUntil` falls
160+
* through to the locked path (the locked path re-checks under the lock),
161+
* and a stale `status` at worst returns the existing run, which is the
162+
* same outcome the caller would see if their trigger had landed a few
163+
* hundred ms earlier.
163164
*
164165
* Default: false.
165166
*/

0 commit comments

Comments
 (0)