Skip to content

feat(daemon): idle poll hibernation so an idle daemon stops touching DeepLake (PRD-062e)#185

Open
chrisl10 wants to merge 2 commits into
legioncodeinc:mainfrom
chrisl10:feat/idle-poll-hibernation
Open

feat(daemon): idle poll hibernation so an idle daemon stops touching DeepLake (PRD-062e)#185
chrisl10 wants to merge 2 commits into
legioncodeinc:mainfrom
chrisl10:feat/idle-poll-hibernation

Conversation

@chrisl10

@chrisl10 chrisl10 commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What

Idle poll hibernation: after a configurable idle window the daemon stops polling DeepLake entirely (zero reads), so Activeloop compute can finally scale to zero, and a wake() seam fired by new work resumes it. This is the unfinished half of PRD-062's "Locked 1: idle daemons must go quiet" (a new prd-062e doc is included).

062b backed the idle cadence off to a ~30s ceiling, which was the dominant fix. But Activeloop bills compute (uptime) per hour while warm, and it scales to zero only after a sustained window with no queries: a poll every 30s keeps resetting that idle timer, so an idle daemon still pays a per-hour compute floor. 062e closes that last gap by driving idle reads to actual zero.

Why

Two structural facts made "just back off more" insufficient:

  1. There is more than one poller. The stage, pollinating, summary, and skillify workers each run a poll loop, plus the job-queue reaper sweeps every 5 minutes. For compute to reach zero, all of them must go quiet. 062b only put the stage + pollinating workers on the adaptive loop; the summary and skillify workers were still hand-rolling a flat 1000ms setInterval, so they polled memory_jobs at ~1Hz forever and would have kept compute warm on their own.
  2. The enqueue chokepoint is the right wake trigger. Every unit of background work enters through JobQueueService.enqueue(). Firing the wake from there (rather than instrumenting each HTTP handler) covers every work-producing path with one seam and cannot drift.

Details

  • Suspend decision in the pure state machine (poll-backoff.ts): PollBackoff gains a clock-free idle accumulator (sums its own un-jittered steps, never reads a wall clock), shouldSuspend(), and onWake(). New knobs HONEYCOMB_POLL_SUSPEND_ENABLED (default-on when absent; explicit false/0 rolls back) and HONEYCOMB_POLL_SUSPEND_AFTER_MS (default 300000; 0 disables), read through the existing envPollBackoffConfigProvider.
  • The loop stops re-arming, plus wake() (poll-loop.ts): AdaptivePollLoop skips the re-arm when shouldSuspend() is true (goes quiet), and wake() resets to the floor and re-arms only if it had actually suspended (so it can never double-arm). start() is now idempotent.
  • One wake bus (wake-bus.ts, new): a tiny registry that fans a single wake() to every loop and the reaper. The queue rings it via an injected onEnqueue after a successful append.
  • The reaper hibernates too (job-queue.ts): it has nothing to reclaim when the queue is idle, so after a couple of consecutive sweeps that observe no queued or leased work it stops re-arming; wakeReaper() (registered on the bus) resumes it. A queued job counts as active so it never suspends between an enqueue and the worker leasing it, and the >1 consecutive-idle threshold absorbs a single DeepLake stale-segment under-read.
  • Unify the workers (summaries/job.ts, skillify/worker.ts): converted off their hand-rolled flat timers onto the shared buildWorkerPollLoop, so all four kind-workers share one cadence (backoff + hibernation) and one overlap guard. This removes duplicated timer code (jscpd stays at 0%).
  • Wiring (assemble.ts): build the bus, register every loop + the reaper, inject onEnqueue, and thread the resolved backoff config into the summary/skillify builders.

memory_jobs stays append-only version-bumped: this is a cadence change only, never the write pattern (an in-place UPDATE on this backend is provably non-deterministic per the job-queue.ts header). Everything is flag-gated and default-safe: with backoff off the daemon is the exact pre-062 flat path (parent AC-9); with suspend off it is 062b's steady ~30s cadence.

Recall is intentionally not a separate wake trigger. Recall reads DeepLake directly and creates no queue work, so a hibernated fleet has nothing for it to resume; any work recall might enqueue already flows through the enqueue() chokepoint, which strictly subsumes a recall-wake. The cold-start tradeoff is the accepted one: the first capture/recall after a long idle wakes the loop and Activeloop spins compute back up. Capture is fire-and-forget, so this is invisible to the user beyond the one-time spin-up.

Testing

npm run ci is green locally (typecheck, jscpd at 0%, vitest, sql audit). New AC-named tests, all on the existing manual-clock / in-memory-queue fakes (no live DeepLake):

  • poll-backoff.test.ts: the idle accumulator trips shouldSuspend() at the expected empty-lease count; lease/wake reset it; the two off-switches disable it; default-on env resolution and explicit rollback.
  • poll-loop.test.ts: the loop stops arming timers after the idle window; wake() re-arms at the floor and resumes; no double-arm; no-op after stop(); idempotent start(); suspend-disabled never hibernates.
  • job-queue.test.ts: enqueue() fires onEnqueue; the reaper hibernates after idle sweeps and an enqueue's wake resumes it; suspend-disabled keeps sweeping (rollback parity).
  • The 062b AC-9 parity test and the summary/skillify suites stay green.

Open questions

  • Confirm the Activeloop auto-suspend window. This rests on hosted compute scaling to zero after a sustained no-query window. The strongest evidence is the large cost drop the fleet saw after 062b (consistent with poll frequency driving warm-compute cost, with the residual 30s poll as the last thing keeping it warm). The 062a query meter should confirm reads go to zero at idle and resume on the next capture, and the live compute-hours before/after is the final proof. Happy to tune HONEYCOMB_POLL_SUSPEND_AFTER_MS to whatever the real suspend grace window turns out to be.
  • Default posture. Shipped default-on to match 062b's cost-fix posture; easy to flip to default-off (opt-in) if you would rather land it conservatively first.
  • Follow-up: local / pluggable single-user job-queue backing. Hibernation removes idle reads by stopping the timer; a local queue backing (the daemon already ships a node:sqlite store) would remove them structurally for single-user mode, so DeepLake only ever sees batched capture-writes and on-demand recall-reads. That changes the team-sharing contract, so I have written it up as a separate future PRD rather than folding it in here. Happy to coordinate on whether you want it.

  • I've signed the CLA (the bot will confirm)

Summary by CodeRabbit

  • New Features
    • Added idle hibernation for background polling and durable queue reaping, reducing idle DeepLake reads.
    • Introduced a wake bus so new enqueued work immediately resumes hibernated poll loops and the reaper.
    • Added configurable env controls for suspend behavior (with rollback to prior cadence when disabled).
  • Bug Fixes
    • Improved wake/reaper behavior so wake triggers are best-effort and retryable failures re-wake at the right time.
  • Refactor
    • Migrated summary and skillify continuous polling to the shared adaptive poll loop.
  • Tests
    • Expanded coverage for idle suspension, wake semantics, and configuration defaults/rollbacks.

…DeepLake (PRD-062e)

PRD-062b backed the idle poll cadence off to a ~30s ceiling, but a query every
30s still resets Activeloop's compute idle-timer, so DeepLake compute never
scales to zero. This completes 062's "idle daemons must go quiet" by driving
idle reads to actual zero.

After a configurable idle window the adaptive poll loop stops re-arming its
timer entirely (zero DeepLake polls), and a wake() seam fired by the
queue.enqueue() chokepoint resumes it. The job-queue reaper hibernates the same
way once no queued or leased work remains. The summary and skillify workers,
which were still hand-rolling a flat 1000ms setInterval (so they polled at ~1Hz
forever), are moved onto the shared adaptive loop so they hibernate too.

- poll-backoff: clock-free idle accumulator + shouldSuspend()/onWake(); new
  HONEYCOMB_POLL_SUSPEND_ENABLED (default-on) and HONEYCOMB_POLL_SUSPEND_AFTER_MS
  (default 300000; 0 disables) knobs on the existing provider.
- poll-loop: wake() seam, suspended state, skip-re-arm on suspend, idempotent start().
- wake-bus: one tiny registry fans wake() to every loop and the reaper.
- job-queue: onEnqueue callback rings the bus; reaper idle-suspend + wakeReaper().
- assemble: build the bus, register every loop + the reaper, inject onEnqueue,
  thread the resolved backoff config into the summary/skillify builders.

memory_jobs stays append-only version-bumped (a cadence change only, never the
write pattern). Every new behavior is flag-gated and default-safe: with backoff
off the daemon is the exact pre-062 flat path (parent AC-9); with suspend off it
is 062b's steady ~30s cadence. AC-named tests cover suspend/wake on the existing
manual-clock fakes with no live DeepLake.
@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds PRD-062e idle poll hibernation: poll loops can suspend after an idle window, enqueue wakes propagate through a wake bus, and the job-queue reaper can hibernate and resume. Summary and skillify workers move onto the shared adaptive poll loop.

Changes

PRD-062e: Idle Poll Hibernation

Layer / File(s) Summary
PRD docs and suspension config/state machine
library/requirements/.../prd-062-deeplake-compute-cost-reduction-index.md, library/requirements/.../prd-062e-*.md, src/daemon/runtime/services/poll-backoff.ts, tests/daemon/runtime/services/poll-backoff.test.ts
Adds the PRD-062e spec and extends poll-backoff config, runtime state, and tests with suspension flags, idle accumulation, wake reset, and suspend eligibility checks.
Wake bus abstraction
src/daemon/runtime/services/wake-bus.ts
Adds the wake-bus contracts and implementation for callback registration, wake fan-out, and per-callback error isolation.
PollLoop hibernation and wake()
src/daemon/runtime/services/poll-loop.ts, tests/daemon/runtime/services/poll-loop.test.ts
Extends the adaptive poll loop with start idempotency, suspended state, wake handling, and stop-time lifecycle reset, with tests covering suspension and wake behavior.
Worker wake seams and shared loop migration
src/daemon/runtime/pipeline/stage-worker.ts, src/daemon/runtime/pollinating/worker.ts, src/daemon/runtime/services/lease-coordinator.ts, src/daemon/runtime/summaries/job.ts, src/daemon/runtime/skillify/worker.ts
Adds wake() to worker and coordinator interfaces, then migrates the summary and skillify workers onto the shared PollLoop with optional backoff wiring.
Job queue reaper hibernation
src/daemon/runtime/services/job-queue.ts, tests/daemon/runtime/services/job-queue.test.ts
Adds queue wake hooks, retry wake scheduling, reaper suspension and resumption, and service tests for wake propagation and idle reaper behavior.
Daemon assembly wiring
src/daemon/runtime/assemble.ts
Builds the wake bus, threads backoff into summary and skillify worker construction, wires queue wake callbacks, and registers poll-loop and reaper wake handlers during startup.

Sequence Diagram(s)

sequenceDiagram
  participant JobQueueService
  participant WakeBus
  participant PollLoop
  participant Reaper
  JobQueueService->>WakeBus: onWake after enqueue
  WakeBus->>PollLoop: wake()
  WakeBus->>Reaper: wakeReaper()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 The loops were sleepy, soft, and slow,
Then wake-bus bells began to glow.
One enqueue hop, and off they run,
Until the idle reads are none.
I twitch my nose: the hutch is bright,
DeepLake sleeps well through the night.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: daemon idle poll hibernation to stop DeepLake polling.
Docstring Coverage ✅ Passed Docstring coverage is 85.71% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/daemon/runtime/assemble.ts (1)

2580-2584: 🩺 Stability & Availability | 🔵 Trivial | ⚡ Quick win

Unregister wake callbacks during shutdown.

wakeBus.register() returns cleanup functions, but these callbacks are retained across shutdown()/start() cycles. Store the unregister callbacks and drain them in shutdown() to avoid duplicate wakes and stale stopped worker references.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/daemon/runtime/assemble.ts` around lines 2580 - 2584, The wake callbacks
registered in assembleRuntime via wakeBus.register() are not being cleaned up
across shutdown/start cycles, which leaves stale worker references and can cause
duplicate wakes. Store the unregister functions returned by each
wakeBus.register() call alongside the existing wake registrations, and make
shutdown() drain/unregister them before the next start(). Keep the fix localized
to the wake registration block and the shutdown lifecycle handling so the worker
wake handlers are recreated cleanly each cycle.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/daemon/runtime/assemble.ts`:
- Around line 1932-1935: The wake path in the runtime assembler only responds to
enqueue events, so retryable jobs scheduled by fail() with a future next_run_at
can get stuck when all loops hibernate. Update the wake wiring in assemble()
around the wakeBus configuration to also trigger a wake from the earliest
pending retry deadline, or keep the poll loop active while failed retry jobs
remain pending. Use the existing fail(), next_run_at, and wakeBus/onEnqueue
setup to locate the retry scheduling and add a timer-based wake source there.

In `@src/daemon/runtime/services/job-queue.ts`:
- Around line 495-497: The enqueue path in job-queue’s enqueue flow should treat
the wake hook as best-effort, not as part of the durable write. Update the code
around this.onEnqueue so a throw from the wake bus callback does not make
enqueue() reject after the job has already been appended; instead, catch and log
the failure while allowing the successful enqueue to complete. Use the existing
enqueue/onEnqueue path in JobQueue to locate the change and keep the hook
isolated from the persistence result.

In `@src/daemon/runtime/services/poll-loop.ts`:
- Around line 173-183: The wake() path in poll-loop should also reschedule
already-armed adaptive loops, not just fully suspended ones. Update the wake
logic in the wake() method so that after backoff.onWake() it can pull a live
timer back to the floor when the loop is backed off but not suspended, instead
of returning early; use the existing state fields this.suspended, this.backoff,
and the scheduling helper this.scheduleNext() to ensure
enqueue()->wakeBus.wake() causes the loop to poll immediately on new work.

---

Nitpick comments:
In `@src/daemon/runtime/assemble.ts`:
- Around line 2580-2584: The wake callbacks registered in assembleRuntime via
wakeBus.register() are not being cleaned up across shutdown/start cycles, which
leaves stale worker references and can cause duplicate wakes. Store the
unregister functions returned by each wakeBus.register() call alongside the
existing wake registrations, and make shutdown() drain/unregister them before
the next start(). Keep the fix localized to the wake registration block and the
shutdown lifecycle handling so the worker wake handlers are recreated cleanly
each cycle.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 03a7385b-005d-4bab-8b84-f88ebb193aea

📥 Commits

Reviewing files that changed from the base of the PR and between d45d944 and 5d1e71c.

📒 Files selected for processing (15)
  • library/requirements/completed/prd-062-deeplake-compute-cost-reduction/prd-062-deeplake-compute-cost-reduction-index.md
  • library/requirements/completed/prd-062-deeplake-compute-cost-reduction/prd-062e-deeplake-compute-cost-reduction-idle-hibernation.md
  • src/daemon/runtime/assemble.ts
  • src/daemon/runtime/pipeline/stage-worker.ts
  • src/daemon/runtime/pollinating/worker.ts
  • src/daemon/runtime/services/job-queue.ts
  • src/daemon/runtime/services/lease-coordinator.ts
  • src/daemon/runtime/services/poll-backoff.ts
  • src/daemon/runtime/services/poll-loop.ts
  • src/daemon/runtime/services/wake-bus.ts
  • src/daemon/runtime/skillify/worker.ts
  • src/daemon/runtime/summaries/job.ts
  • tests/daemon/runtime/services/job-queue.test.ts
  • tests/daemon/runtime/services/poll-backoff.test.ts
  • tests/daemon/runtime/services/poll-loop.test.ts

Comment thread src/daemon/runtime/assemble.ts Outdated
Comment thread src/daemon/runtime/services/job-queue.ts Outdated
Comment thread src/daemon/runtime/services/poll-loop.ts Outdated
- poll-loop wake(): pull a merely backed-off (not just fully suspended) loop's
  timer back to the floor, so the enqueue-driven wake actually picks up new work
  immediately instead of waiting out a stale long delay; also guard against waking
  a never-started loop (the deferred-under-consolidation case).
- job-queue: schedule a one-shot fleet-wake at a failed job's retry deadline, so a
  hibernated daemon resumes to lease the retry on time rather than waiting for
  unrelated activity. The wake hook is renamed onEnqueue -> onWake since it now
  fires on both an enqueue and a retry deadline; the pending timers are cleared on stop.
- job-queue: isolate the wake hook from the durable write. A throwing wake is caught
  and logged, so it never rejects an already-appended enqueue (which would invite a
  duplicate retry).

New tests: wake() pulls a backed-off timer to the floor; a retryable failure schedules
a deadline wake; a throwing wake does not reject enqueue. npm run ci green locally
(the only failures are the pre-existing macOS realpath tests, unrelated).
@chrisl10

Copy link
Copy Markdown
Contributor Author

Thanks @coderabbitai, all three were fair. Addressed in 23316f9:

  1. Retry-deadline wake (Major). Real gap: once every loop hibernates, nothing wakes them at a failed job's next_run_at. fail() now schedules a one-shot fleet-wake at the retry deadline (cleared on stop()), so a hibernated daemon resumes to lease the retry on time. Benign under the default backoff (retries are ≤8s, well under the 5-min suspend window) but a real bug once the suspend window is tuned below the retry delay, so worth closing. Renamed the hook onEnqueue -> onWake since it now fires on both an enqueue and a retry deadline.
  2. Isolate the wake hook from the durable write (Major). Done: the wake is now best-effort and caught/logged (job.wake.failed), so a throwing bus callback can never reject an already-appended enqueue.
  3. wake() pulling an armed timer to the floor (Major). Good catch on the pre-suspend idle case. wake() now cancels the live (stale long-delay) timer and re-arms at the floor for a merely backed-off loop, not just a fully suspended one, and I added the !started guard so waking a deferred-under-consolidation loop stays a no-op.

New tests cover each: wake pulls a backed-off timer to the floor, a retryable failure schedules a deadline wake, and a throwing wake does not reject enqueue. npm run ci is green locally.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@chrisl10 Thanks for the thorough follow-up — this addresses the gaps I called out.

The retry-deadline fleet wake closes the hibernated-retry hole, making the suspend behavior safe even when the suspend window is tuned below the retry delay. Catching/logging wake-hook failures also cleanly decouples wake propagation from the durable append path, and updating wake() to pull an already-armed backed-off timer down to the floor fixes the pre-suspend latency case nicely.

Appreciate the added test coverage and local CI confirmation. Looks good from my side.

(ᵔᴥᵔ)


✏️ Learnings added
Learnt from: chrisl10
URL: https://github.com/legioncodeinc/honeycomb/pull/185

Timestamp: 2026-06-29T08:24:58.120Z
Learning: In `src/daemon/runtime/services/job-queue.ts`, the daemon wake callback was renamed from `onEnqueue` to `onWake` because the same wake pathway is used for both successful enqueue events and retry-deadline wake scheduling from `fail()`.

Learnt from: chrisl10
URL: https://github.com/legioncodeinc/honeycomb/pull/185

Timestamp: 2026-06-29T08:24:58.120Z
Learning: In `src/daemon/runtime/services/job-queue.ts`, wake propagation after a successful durable append must be best-effort: exceptions from the wake callback are caught and logged as `job.wake.failed` so they cannot cause `enqueue()` to reject after the job has already been appended.

Learnt from: chrisl10
URL: https://github.com/legioncodeinc/honeycomb/pull/185

Timestamp: 2026-06-29T08:24:58.120Z
Learning: In `src/daemon/runtime/services/poll-loop.ts`, `wake()` must not only resume fully suspended adaptive loops; it must also cancel an already-armed backed-off timer and re-arm at the floor delay, while remaining a no-op when the loop has not been started.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/daemon/runtime/services/job-queue.ts`:
- Around line 523-540: The retry-deadline wake in scheduleRetryWake is only kept
in memory and gets cleared on stop, so restart loses pending wakes for failed
jobs. Add startup/reaper reconstruction logic that scans existing JOB_FAILED
rows with a future next_run_at and re-arms their wake timers after the daemon
comes back up, reusing the existing scheduleRetryWake and fireWake flow. Make
sure the recovery path is wired into the same lifecycle that currently manages
retryWakes so pending wakes survive restarts.

In `@src/daemon/runtime/services/poll-loop.ts`:
- Around line 180-188: The wake reset is being lost when wake() runs while
tick() is already in flight, so PollLoop can re-arm from a backed-off state
after a false result. Update the wake()/finally() flow in poll-loop.ts to track
a pending wake while this.running is true, and in the tick completion path
re-apply backoff.onWake() after the outcome is recorded before scheduling the
next cycle. Use the existing wake(), onWake(), onEmptyLease(), and finally()
logic in PollLoop to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 5cf6cad6-c45d-4fc0-ae74-721542f59923

📥 Commits

Reviewing files that changed from the base of the PR and between 5d1e71c and 23316f9.

📒 Files selected for processing (5)
  • src/daemon/runtime/assemble.ts
  • src/daemon/runtime/services/job-queue.ts
  • src/daemon/runtime/services/poll-loop.ts
  • tests/daemon/runtime/services/job-queue.test.ts
  • tests/daemon/runtime/services/poll-loop.test.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/daemon/runtime/services/poll-loop.test.ts
  • tests/daemon/runtime/services/job-queue.test.ts
  • src/daemon/runtime/assemble.ts

Comment on lines +523 to +540
/**
* Schedule a ONE-SHOT fleet-wake at a `failed` job's retry deadline (PRD-062e). Without
* this, once every loop hibernates no enqueue occurs at `next_run_at`, so a retryable
* job could sit until unrelated activity wakes the bus. The timer fires once (it clears
* itself), rings the same wake hook an enqueue does, and is cleaned up on {@link stop}.
* A no-op when no wake hook is wired (a standalone/test queue).
*/
private scheduleRetryWake(delayMs: number): void {
if (this.onWake === undefined) return;
let handle: unknown;
handle = this.clock.setTimer(() => {
// One-shot over the interval-based clock seam: cancel before firing so it runs once.
this.clock.clearTimer(handle);
this.retryWakes.delete(handle);
this.fireWake({ trigger: "retry-deadline" });
}, Math.max(1, delayMs));
this.retryWakes.add(handle);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | 🏗️ Heavy lift

Rebuild retry-deadline wakes after restart.

Line 530 schedules retry wakes only in memory, and Line 975 clears them on stop. If the daemon restarts before a persisted failed job’s next_run_at, no wake is reconstructed, so hibernated loops can sleep past the retry deadline. Recreate pending retry wakes from existing JOB_FAILED rows during startup or the reaper’s discovery sweep.

Also applies to: 967-976

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/daemon/runtime/services/job-queue.ts` around lines 523 - 540, The
retry-deadline wake in scheduleRetryWake is only kept in memory and gets cleared
on stop, so restart loses pending wakes for failed jobs. Add startup/reaper
reconstruction logic that scans existing JOB_FAILED rows with a future
next_run_at and re-arms their wake timers after the daemon comes back up,
reusing the existing scheduleRetryWake and fireWake flow. Make sure the recovery
path is wired into the same lifecycle that currently manages retryWakes so
pending wakes survive restarts.

Comment on lines +180 to +188
wake(): void {
// A never-started, stopped, or flat (non-adaptive) loop has nothing to wake.
if (!this.started || this.stopped || this.backoff === null) return;
// Snap the cadence back to fast and clear the idle accumulator so the just-woken
// loop polls immediately and cannot re-suspend until it goes idle again.
this.backoff.onWake();
// A tick is in flight: its finally() will re-arm exactly once at the reset floor,
// so arming here too would double-arm.
if (this.running) return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Preserve the wake reset when a tick is already running.

Line 185 resets the backoff, but if the in-flight tick() later resolves false, Line 153 calls onEmptyLease() after the wake and schedules from a backed-off state instead of the floor. Track a pending wake while running and re-apply onWake() in finally() after the tick outcome is recorded.

Proposed fix
 	/** Guards against overlapping ticks on the poll loop (the workers' `running` flag). */
 	private running = false;
+	private wakePending = false;
@@
 				// Flat path relies on the repeating interval and does not reschedule here.
 				if (backoff === null || this.stopped) return;
+				if (this.wakePending) {
+					this.wakePending = false;
+					backoff.onWake();
+				}
@@
 		// A tick is in flight: its finally() will re-arm exactly once at the reset floor,
 		// so arming here too would double-arm.
-		if (this.running) return;
+		if (this.running) {
+			this.wakePending = true;
+			return;
+		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/daemon/runtime/services/poll-loop.ts` around lines 180 - 188, The wake
reset is being lost when wake() runs while tick() is already in flight, so
PollLoop can re-arm from a backed-off state after a false result. Update the
wake()/finally() flow in poll-loop.ts to track a pending wake while this.running
is true, and in the tick completion path re-apply backoff.onWake() after the
outcome is recorded before scheduling the next cycle. Use the existing wake(),
onWake(), onEmptyLease(), and finally() logic in PollLoop to locate the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant