Skip to content

Commit 5fc14ff

Browse files
fix(run-engine): always validate fast-path skip on the writer
Replica probe stays on `readOnlyPrisma` (when `useReplicaForFastPathRead` is on) but the full-run read used to construct the returned `existing` result is now always issued against the writer. The writer-side read also re-checks `status === DELAYED`, `delayUntil`, and the max-duration window so replica lag can't cause a trigger to silently collapse onto a run that has already moved out of DELAYED on the writer. Also clarifies the `fastPathSkipEnabled` doc to note that trailing-mode triggers carrying `updateData` always bypass the fast path. Co-Authored-By: Eric Allam <eallam@icloud.com>
1 parent 30586b4 commit 5fc14ff

2 files changed

Lines changed: 49 additions & 19 deletions

File tree

internal-packages/run-engine/src/engine/systems/debounceSystem.ts

Lines changed: 41 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,11 @@ export type DebounceSystemOptions = {
6363
*/
6464
fastPathSkipEnabled?: boolean;
6565
/**
66-
* When true, route the unlocked fast-path read through `readOnlyPrisma`
67-
* (e.g. an Aurora reader) instead of the writer.
66+
* When true, route the cheap probe of the unlocked fast-path through
67+
* `readOnlyPrisma` (e.g. an Aurora reader) instead of the writer. The
68+
* full-run read used to construct the returned `existing` result still
69+
* goes through the writer, so callers never see a run whose status has
70+
* already moved out of DELAYED on the writer due to replica lag.
6871
*/
6972
useReplicaForFastPathRead?: boolean;
7073
};
@@ -479,12 +482,13 @@ return 0
479482
tx?: PrismaClientOrTransaction;
480483
}): Promise<DebounceResult> {
481484
const prisma = tx ?? this.$.prisma;
482-
// Reads that are explicitly best-effort (the fast-path skip) can run on
483-
// `readOnlyPrisma` when configured. Replica lag is fine: the monotonic-
484-
// forward invariant means a stale read just falls through to the locked
485-
// path. Only divert reads when the caller isn't inside a tx (where the
486-
// read needs to see the tx's writes).
487-
const fastPathReadPrisma =
485+
// The cheap probe in the fast-path skip can run on `readOnlyPrisma` when
486+
// configured. Replica lag is fine because the probe is best-effort: a
487+
// stale view either falls through to the locked path or is rejected by
488+
// the writer-validated re-check inside `#tryFastPathSkip`. Only divert
489+
// the probe when the caller isn't inside a tx (where the read needs to
490+
// see the tx's writes).
491+
const probeReadPrisma =
488492
tx ?? (this.useReplicaForFastPathRead ? this.$.readOnlyPrisma : this.$.prisma);
489493

490494
// Compute the (quantized) target delayUntil up-front, before taking any lock.
@@ -503,7 +507,12 @@ return 0
503507
existingRunId,
504508
newDelayUntil,
505509
debounce,
506-
prisma: fastPathReadPrisma,
510+
probePrisma: probeReadPrisma,
511+
// The full-run read used to construct the returned `existing` result
512+
// always goes through the writer, even when the cheap probe is on a
513+
// replica. Otherwise replica lag could let us return a run whose
514+
// status has already moved out of DELAYED on the writer.
515+
validatePrisma: prisma,
507516
});
508517
if (fastPathResult) {
509518
return fastPathResult;
@@ -578,25 +587,34 @@ return 0
578587
* the lock to apply their data update. Also falls through when the run has
579588
* already exceeded its max debounce duration so the locked path can return
580589
* `max_duration_exceeded` and let the caller create a new run.
590+
*
591+
* The cheap probe (`probePrisma`) may be on a read replica - replica lag is
592+
* fine because the monotonic-forward invariant means a stale view just falls
593+
* through to the locked path. The full-run read used to construct the
594+
* returned `existing` result always goes through `validatePrisma` (the
595+
* writer), so callers never receive a run whose status has already moved out
596+
* of DELAYED on the writer due to replica lag.
581597
*/
582598
async #tryFastPathSkip({
583599
existingRunId,
584600
newDelayUntil,
585601
debounce,
586-
prisma,
602+
probePrisma,
603+
validatePrisma,
587604
}: {
588605
existingRunId: string;
589606
newDelayUntil: Date;
590607
debounce: DebounceOptions;
591-
prisma: PrismaClientOrTransaction | PrismaReplicaClient;
608+
probePrisma: PrismaClientOrTransaction | PrismaReplicaClient;
609+
validatePrisma: PrismaClientOrTransaction;
592610
}): Promise<DebounceResult | null> {
593611
// Trailing mode with updateData still needs the lock so the data update is
594612
// applied; only short-circuit when there's nothing to update.
595613
if (debounce.mode === "trailing" && debounce.updateData) {
596614
return null;
597615
}
598616

599-
const probe = await prisma.taskRun.findFirst({
617+
const probe = await probePrisma.taskRun.findFirst({
600618
where: { id: existingRunId },
601619
select: { status: true, delayUntil: true, createdAt: true },
602620
});
@@ -622,11 +640,20 @@ return 0
622640
return null;
623641
}
624642

625-
const fullRun = await prisma.taskRun.findFirst({
643+
// Validate against the writer before returning. Also re-checks delayUntil
644+
// and the max-duration window in case the writer has moved on since the
645+
// (possibly stale) probe.
646+
const fullRun = await validatePrisma.taskRun.findFirst({
626647
where: { id: existingRunId },
627648
include: { associatedWaitpoint: true },
628649
});
629-
if (!fullRun || fullRun.status !== "DELAYED") {
650+
if (!fullRun || fullRun.status !== "DELAYED" || !fullRun.delayUntil) {
651+
return null;
652+
}
653+
if (newDelayUntil.getTime() > fullRun.delayUntil.getTime()) {
654+
return null;
655+
}
656+
if (newDelayUntil.getTime() > fullRun.createdAt.getTime() + maxDurationMs) {
630657
return null;
631658
}
632659

internal-packages/run-engine/src/engine/types.ts

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -146,16 +146,19 @@ export type RunEngineOptions = {
146146
/**
147147
* Whether to read the existing run's `delayUntil` outside of the redlock and
148148
* short-circuit when the new (quantized) `delayUntil` is not later than the
149-
* current one. Drops `updateData` when triggered for trailing mode.
149+
* current one. Trailing-mode triggers carrying `updateData` always bypass
150+
* this fast path and take the lock so payload/metadata/tag updates still
151+
* land on the run.
150152
*
151153
* Default: true.
152154
*/
153155
fastPathSkipEnabled?: boolean;
154156
/**
155-
* Whether to route the unlocked fast-path read of `delayUntil`/`createdAt`
156-
* through `readOnlyPrisma` (e.g. an Aurora reader) instead of the writer.
157-
* Safe because the read is best-effort and re-checked under the lock by
158-
* whichever caller is actually pushing forward; replica lag at worst means
157+
* Whether to route the cheap probe of the unlocked fast-path through
158+
* `readOnlyPrisma` (e.g. an Aurora reader) instead of the writer. The
159+
* full-run read used to construct the returned `existing` result still
160+
* goes through the writer, so callers never see a run whose status has
161+
* already moved out of DELAYED on the writer. Replica lag at worst means
159162
* a few extra callers fall through to the lock.
160163
*
161164
* Default: false.

0 commit comments

Comments
 (0)