feat(daemon): idle poll hibernation so an idle daemon stops touching DeepLake (PRD-062e)#185
feat(daemon): idle poll hibernation so an idle daemon stops touching DeepLake (PRD-062e)#185chrisl10 wants to merge 2 commits into
Conversation
…DeepLake (PRD-062e) PRD-062b backed the idle poll cadence off to a ~30s ceiling, but a query every 30s still resets Activeloop's compute idle-timer, so DeepLake compute never scales to zero. This completes 062's "idle daemons must go quiet" by driving idle reads to actual zero. After a configurable idle window the adaptive poll loop stops re-arming its timer entirely (zero DeepLake polls), and a wake() seam fired by the queue.enqueue() chokepoint resumes it. The job-queue reaper hibernates the same way once no queued or leased work remains. The summary and skillify workers, which were still hand-rolling a flat 1000ms setInterval (so they polled at ~1Hz forever), are moved onto the shared adaptive loop so they hibernate too. - poll-backoff: clock-free idle accumulator + shouldSuspend()/onWake(); new HONEYCOMB_POLL_SUSPEND_ENABLED (default-on) and HONEYCOMB_POLL_SUSPEND_AFTER_MS (default 300000; 0 disables) knobs on the existing provider. - poll-loop: wake() seam, suspended state, skip-re-arm on suspend, idempotent start(). - wake-bus: one tiny registry fans wake() to every loop and the reaper. - job-queue: onEnqueue callback rings the bus; reaper idle-suspend + wakeReaper(). - assemble: build the bus, register every loop + the reaper, inject onEnqueue, thread the resolved backoff config into the summary/skillify builders. memory_jobs stays append-only version-bumped (a cadence change only, never the write pattern). Every new behavior is flag-gated and default-safe: with backoff off the daemon is the exact pre-062 flat path (parent AC-9); with suspend off it is 062b's steady ~30s cadence. AC-named tests cover suspend/wake on the existing manual-clock fakes with no live DeepLake.
📝 WalkthroughWalkthroughAdds PRD-062e idle poll hibernation: poll loops can suspend after an idle window, enqueue wakes propagate through a wake bus, and the job-queue reaper can hibernate and resume. Summary and skillify workers move onto the shared adaptive poll loop. ChangesPRD-062e: Idle Poll Hibernation
Sequence Diagram(s)sequenceDiagram
participant JobQueueService
participant WakeBus
participant PollLoop
participant Reaper
JobQueueService->>WakeBus: onWake after enqueue
WakeBus->>PollLoop: wake()
WakeBus->>Reaper: wakeReaper()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/daemon/runtime/assemble.ts (1)
2580-2584: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick winUnregister wake callbacks during shutdown.
wakeBus.register()returns cleanup functions, but these callbacks are retained acrossshutdown()/start()cycles. Store the unregister callbacks and drain them inshutdown()to avoid duplicate wakes and stale stopped worker references.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/daemon/runtime/assemble.ts` around lines 2580 - 2584, The wake callbacks registered in assembleRuntime via wakeBus.register() are not being cleaned up across shutdown/start cycles, which leaves stale worker references and can cause duplicate wakes. Store the unregister functions returned by each wakeBus.register() call alongside the existing wake registrations, and make shutdown() drain/unregister them before the next start(). Keep the fix localized to the wake registration block and the shutdown lifecycle handling so the worker wake handlers are recreated cleanly each cycle.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/daemon/runtime/assemble.ts`:
- Around line 1932-1935: The wake path in the runtime assembler only responds to
enqueue events, so retryable jobs scheduled by fail() with a future next_run_at
can get stuck when all loops hibernate. Update the wake wiring in assemble()
around the wakeBus configuration to also trigger a wake from the earliest
pending retry deadline, or keep the poll loop active while failed retry jobs
remain pending. Use the existing fail(), next_run_at, and wakeBus/onEnqueue
setup to locate the retry scheduling and add a timer-based wake source there.
In `@src/daemon/runtime/services/job-queue.ts`:
- Around line 495-497: The enqueue path in job-queue’s enqueue flow should treat
the wake hook as best-effort, not as part of the durable write. Update the code
around this.onEnqueue so a throw from the wake bus callback does not make
enqueue() reject after the job has already been appended; instead, catch and log
the failure while allowing the successful enqueue to complete. Use the existing
enqueue/onEnqueue path in JobQueue to locate the change and keep the hook
isolated from the persistence result.
In `@src/daemon/runtime/services/poll-loop.ts`:
- Around line 173-183: The wake() path in poll-loop should also reschedule
already-armed adaptive loops, not just fully suspended ones. Update the wake
logic in the wake() method so that after backoff.onWake() it can pull a live
timer back to the floor when the loop is backed off but not suspended, instead
of returning early; use the existing state fields this.suspended, this.backoff,
and the scheduling helper this.scheduleNext() to ensure
enqueue()->wakeBus.wake() causes the loop to poll immediately on new work.
---
Nitpick comments:
In `@src/daemon/runtime/assemble.ts`:
- Around line 2580-2584: The wake callbacks registered in assembleRuntime via
wakeBus.register() are not being cleaned up across shutdown/start cycles, which
leaves stale worker references and can cause duplicate wakes. Store the
unregister functions returned by each wakeBus.register() call alongside the
existing wake registrations, and make shutdown() drain/unregister them before
the next start(). Keep the fix localized to the wake registration block and the
shutdown lifecycle handling so the worker wake handlers are recreated cleanly
each cycle.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 03a7385b-005d-4bab-8b84-f88ebb193aea
📒 Files selected for processing (15)
library/requirements/completed/prd-062-deeplake-compute-cost-reduction/prd-062-deeplake-compute-cost-reduction-index.mdlibrary/requirements/completed/prd-062-deeplake-compute-cost-reduction/prd-062e-deeplake-compute-cost-reduction-idle-hibernation.mdsrc/daemon/runtime/assemble.tssrc/daemon/runtime/pipeline/stage-worker.tssrc/daemon/runtime/pollinating/worker.tssrc/daemon/runtime/services/job-queue.tssrc/daemon/runtime/services/lease-coordinator.tssrc/daemon/runtime/services/poll-backoff.tssrc/daemon/runtime/services/poll-loop.tssrc/daemon/runtime/services/wake-bus.tssrc/daemon/runtime/skillify/worker.tssrc/daemon/runtime/summaries/job.tstests/daemon/runtime/services/job-queue.test.tstests/daemon/runtime/services/poll-backoff.test.tstests/daemon/runtime/services/poll-loop.test.ts
- poll-loop wake(): pull a merely backed-off (not just fully suspended) loop's timer back to the floor, so the enqueue-driven wake actually picks up new work immediately instead of waiting out a stale long delay; also guard against waking a never-started loop (the deferred-under-consolidation case). - job-queue: schedule a one-shot fleet-wake at a failed job's retry deadline, so a hibernated daemon resumes to lease the retry on time rather than waiting for unrelated activity. The wake hook is renamed onEnqueue -> onWake since it now fires on both an enqueue and a retry deadline; the pending timers are cleared on stop. - job-queue: isolate the wake hook from the durable write. A throwing wake is caught and logged, so it never rejects an already-appended enqueue (which would invite a duplicate retry). New tests: wake() pulls a backed-off timer to the floor; a retryable failure schedules a deadline wake; a throwing wake does not reject enqueue. npm run ci green locally (the only failures are the pre-existing macOS realpath tests, unrelated).
|
Thanks @coderabbitai, all three were fair. Addressed in 23316f9:
New tests cover each: wake pulls a backed-off timer to the floor, a retryable failure schedules a deadline wake, and a throwing wake does not reject enqueue. |
|
Tip For best results, initiate chat on the files or code changes.
The retry-deadline fleet wake closes the hibernated-retry hole, making the suspend behavior safe even when the suspend window is tuned below the retry delay. Catching/logging wake-hook failures also cleanly decouples wake propagation from the durable append path, and updating Appreciate the added test coverage and local CI confirmation. Looks good from my side. (ᵔᴥᵔ) ✏️ Learnings added
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/daemon/runtime/services/job-queue.ts`:
- Around line 523-540: The retry-deadline wake in scheduleRetryWake is only kept
in memory and gets cleared on stop, so restart loses pending wakes for failed
jobs. Add startup/reaper reconstruction logic that scans existing JOB_FAILED
rows with a future next_run_at and re-arms their wake timers after the daemon
comes back up, reusing the existing scheduleRetryWake and fireWake flow. Make
sure the recovery path is wired into the same lifecycle that currently manages
retryWakes so pending wakes survive restarts.
In `@src/daemon/runtime/services/poll-loop.ts`:
- Around line 180-188: The wake reset is being lost when wake() runs while
tick() is already in flight, so PollLoop can re-arm from a backed-off state
after a false result. Update the wake()/finally() flow in poll-loop.ts to track
a pending wake while this.running is true, and in the tick completion path
re-apply backoff.onWake() after the outcome is recorded before scheduling the
next cycle. Use the existing wake(), onWake(), onEmptyLease(), and finally()
logic in PollLoop to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 5cf6cad6-c45d-4fc0-ae74-721542f59923
📒 Files selected for processing (5)
src/daemon/runtime/assemble.tssrc/daemon/runtime/services/job-queue.tssrc/daemon/runtime/services/poll-loop.tstests/daemon/runtime/services/job-queue.test.tstests/daemon/runtime/services/poll-loop.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/daemon/runtime/services/poll-loop.test.ts
- tests/daemon/runtime/services/job-queue.test.ts
- src/daemon/runtime/assemble.ts
| /** | ||
| * Schedule a ONE-SHOT fleet-wake at a `failed` job's retry deadline (PRD-062e). Without | ||
| * this, once every loop hibernates no enqueue occurs at `next_run_at`, so a retryable | ||
| * job could sit until unrelated activity wakes the bus. The timer fires once (it clears | ||
| * itself), rings the same wake hook an enqueue does, and is cleaned up on {@link stop}. | ||
| * A no-op when no wake hook is wired (a standalone/test queue). | ||
| */ | ||
| private scheduleRetryWake(delayMs: number): void { | ||
| if (this.onWake === undefined) return; | ||
| let handle: unknown; | ||
| handle = this.clock.setTimer(() => { | ||
| // One-shot over the interval-based clock seam: cancel before firing so it runs once. | ||
| this.clock.clearTimer(handle); | ||
| this.retryWakes.delete(handle); | ||
| this.fireWake({ trigger: "retry-deadline" }); | ||
| }, Math.max(1, delayMs)); | ||
| this.retryWakes.add(handle); | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift
Rebuild retry-deadline wakes after restart.
Line 530 schedules retry wakes only in memory, and Line 975 clears them on stop. If the daemon restarts before a persisted failed job’s next_run_at, no wake is reconstructed, so hibernated loops can sleep past the retry deadline. Recreate pending retry wakes from existing JOB_FAILED rows during startup or the reaper’s discovery sweep.
Also applies to: 967-976
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/daemon/runtime/services/job-queue.ts` around lines 523 - 540, The
retry-deadline wake in scheduleRetryWake is only kept in memory and gets cleared
on stop, so restart loses pending wakes for failed jobs. Add startup/reaper
reconstruction logic that scans existing JOB_FAILED rows with a future
next_run_at and re-arms their wake timers after the daemon comes back up,
reusing the existing scheduleRetryWake and fireWake flow. Make sure the recovery
path is wired into the same lifecycle that currently manages retryWakes so
pending wakes survive restarts.
| wake(): void { | ||
| // A never-started, stopped, or flat (non-adaptive) loop has nothing to wake. | ||
| if (!this.started || this.stopped || this.backoff === null) return; | ||
| // Snap the cadence back to fast and clear the idle accumulator so the just-woken | ||
| // loop polls immediately and cannot re-suspend until it goes idle again. | ||
| this.backoff.onWake(); | ||
| // A tick is in flight: its finally() will re-arm exactly once at the reset floor, | ||
| // so arming here too would double-arm. | ||
| if (this.running) return; |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Preserve the wake reset when a tick is already running.
Line 185 resets the backoff, but if the in-flight tick() later resolves false, Line 153 calls onEmptyLease() after the wake and schedules from a backed-off state instead of the floor. Track a pending wake while running and re-apply onWake() in finally() after the tick outcome is recorded.
Proposed fix
/** Guards against overlapping ticks on the poll loop (the workers' `running` flag). */
private running = false;
+ private wakePending = false;
@@
// Flat path relies on the repeating interval and does not reschedule here.
if (backoff === null || this.stopped) return;
+ if (this.wakePending) {
+ this.wakePending = false;
+ backoff.onWake();
+ }
@@
// A tick is in flight: its finally() will re-arm exactly once at the reset floor,
// so arming here too would double-arm.
- if (this.running) return;
+ if (this.running) {
+ this.wakePending = true;
+ return;
+ }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/daemon/runtime/services/poll-loop.ts` around lines 180 - 188, The wake
reset is being lost when wake() runs while tick() is already in flight, so
PollLoop can re-arm from a backed-off state after a false result. Update the
wake()/finally() flow in poll-loop.ts to track a pending wake while this.running
is true, and in the tick completion path re-apply backoff.onWake() after the
outcome is recorded before scheduling the next cycle. Use the existing wake(),
onWake(), onEmptyLease(), and finally() logic in PollLoop to locate the change.
What
Idle poll hibernation: after a configurable idle window the daemon stops polling DeepLake entirely (zero reads), so Activeloop compute can finally scale to zero, and a
wake()seam fired by new work resumes it. This is the unfinished half of PRD-062's "Locked 1: idle daemons must go quiet" (a newprd-062edoc is included).062b backed the idle cadence off to a ~30s ceiling, which was the dominant fix. But Activeloop bills compute (uptime) per hour while warm, and it scales to zero only after a sustained window with no queries: a poll every 30s keeps resetting that idle timer, so an idle daemon still pays a per-hour compute floor. 062e closes that last gap by driving idle reads to actual zero.
Why
Two structural facts made "just back off more" insufficient:
setInterval, so they polledmemory_jobsat ~1Hz forever and would have kept compute warm on their own.JobQueueService.enqueue(). Firing the wake from there (rather than instrumenting each HTTP handler) covers every work-producing path with one seam and cannot drift.Details
poll-backoff.ts):PollBackoffgains a clock-free idle accumulator (sums its own un-jittered steps, never reads a wall clock),shouldSuspend(), andonWake(). New knobsHONEYCOMB_POLL_SUSPEND_ENABLED(default-on when absent; explicitfalse/0rolls back) andHONEYCOMB_POLL_SUSPEND_AFTER_MS(default 300000;0disables), read through the existingenvPollBackoffConfigProvider.wake()(poll-loop.ts):AdaptivePollLoopskips the re-arm whenshouldSuspend()is true (goes quiet), andwake()resets to the floor and re-arms only if it had actually suspended (so it can never double-arm).start()is now idempotent.wake-bus.ts, new): a tiny registry that fans a singlewake()to every loop and the reaper. The queue rings it via an injectedonEnqueueafter a successful append.job-queue.ts): it has nothing to reclaim when the queue is idle, so after a couple of consecutive sweeps that observe no queued or leased work it stops re-arming;wakeReaper()(registered on the bus) resumes it. A queued job counts as active so it never suspends between an enqueue and the worker leasing it, and the >1 consecutive-idle threshold absorbs a single DeepLake stale-segment under-read.summaries/job.ts,skillify/worker.ts): converted off their hand-rolled flat timers onto the sharedbuildWorkerPollLoop, so all four kind-workers share one cadence (backoff + hibernation) and one overlap guard. This removes duplicated timer code (jscpd stays at 0%).assemble.ts): build the bus, register every loop + the reaper, injectonEnqueue, and thread the resolved backoff config into the summary/skillify builders.memory_jobsstays append-only version-bumped: this is a cadence change only, never the write pattern (an in-place UPDATE on this backend is provably non-deterministic per thejob-queue.tsheader). Everything is flag-gated and default-safe: with backoff off the daemon is the exact pre-062 flat path (parent AC-9); with suspend off it is 062b's steady ~30s cadence.Recall is intentionally not a separate wake trigger. Recall reads DeepLake directly and creates no queue work, so a hibernated fleet has nothing for it to resume; any work recall might enqueue already flows through the
enqueue()chokepoint, which strictly subsumes a recall-wake. The cold-start tradeoff is the accepted one: the first capture/recall after a long idle wakes the loop and Activeloop spins compute back up. Capture is fire-and-forget, so this is invisible to the user beyond the one-time spin-up.Testing
npm run ciis green locally (typecheck, jscpd at 0%, vitest, sql audit). New AC-named tests, all on the existing manual-clock / in-memory-queue fakes (no live DeepLake):poll-backoff.test.ts: the idle accumulator tripsshouldSuspend()at the expected empty-lease count; lease/wake reset it; the two off-switches disable it; default-on env resolution and explicit rollback.poll-loop.test.ts: the loop stops arming timers after the idle window;wake()re-arms at the floor and resumes; no double-arm; no-op afterstop(); idempotentstart(); suspend-disabled never hibernates.job-queue.test.ts:enqueue()firesonEnqueue; the reaper hibernates after idle sweeps and an enqueue's wake resumes it; suspend-disabled keeps sweeping (rollback parity).Open questions
HONEYCOMB_POLL_SUSPEND_AFTER_MSto whatever the real suspend grace window turns out to be.node:sqlitestore) would remove them structurally for single-user mode, so DeepLake only ever sees batched capture-writes and on-demand recall-reads. That changes the team-sharing contract, so I have written it up as a separate future PRD rather than folding it in here. Happy to coordinate on whether you want it.Summary by CodeRabbit