Skip to content

feat: watch mode — keep the index live as files change #336

Description

@HumanBean17

feat: --watch mode — keep the index live as files change

Summary

Add a long-running java-codebase-rag watch (or init/increment --watch) that
listens for source-tree changes and updates the index incrementally in the
background, so the index stays current without the operator re-running
increment by hand.

Problem

Today the index is static between explicit increment invocations. During
active development (tight edit→search loops inside an MCP host), the index
drifts from the working tree until the user remembers to re-run increment.
For a tool whose value is "fresh, correct context for the agent," stale-by-
default is the wrong default.

Why this is more tractable than it looks

The hard parts of incremental update already exist:

  • Vectors: cocoindex @coco.fn(memo=True) skips unchanged files automatically
    (java_index_flow_lancedb.py). A single changed file ≈ a single embed + a
    small merge-insert.
  • Graph: build_ast_graph.py::incremental_rebuild (--incremental) already
    detects added/changed/removed files and their dependents (cross-file
    CALLS / HTTP_CALLS / ASYNC_CALLS edges are re-scoped), and rewrites only
    that scope. The dependent-rescope logic — usually the gnarly part of a code
    graph watcher — is done.
  • Ignore: LayeredIgnore already decides what's in/out of the index, so the
    watcher can filter noise for free.

So watch is mostly a trigger/wiring problem, not a core-algorithm problem.

Proposed UX

java-codebase-rag watch --source-root <repo> [--index-dir <dir>] [--debounce 1.5s]
# runs until Ctrl-C; prints a structured event per applied update

Optional: init --watch / increment --watch flags that enter watch mode after
the initial pass.

Design options (the part to grill)

Option A — Watchdog → debounced increment (recommended MVP)

watchdog (FSEvents on macOS, inotify on Linux) watches the tree; events are
debounced (editors multi-save; build tools rewrite generated files); on a quiet
window we invoke the existing increment path (run_cocoindex_update +
run_incremental_graph).

  • ✅ Reuses the proven subprocess architecture (flow runs in a child cocoindex
    process today; this changes nothing about that).
  • ✅ Smallest blast radius — no new coupling to cocoindex internals.
  • ❌ Re-spawns the cocoindex child per flush → per-flush process startup +
    model-load cost (mitigated: model stays in OS page cache; embedder warm-up is
    the ~1s cost, acceptable at human edit cadence).

Option B — cocoindex LiveComponent (proper reactive vectors)

Use cocoindex's live-component model (coco.LiveComponentOperator,
update_full / mark_ready / update / delete, watchdog-driven) so vectors
update in-process without re-spawning, paired with graph incremental_rebuild.

  • ✅ No per-flush child spawn; true incremental vector upsert.
  • Architectural mismatch. The flow currently runs as a short-lived child
    (cocoindex update …). LiveComponent requires a long-lived process
    holding the flow live — i.e. we'd host cocoindex in-process as a daemon. This
    is a real shift in how the lifecycle works, and conflicts with the current
    "spawn and wait" model in pipeline.py.
  • ❌ cocoindex's live model only covers the vectors half; the graph half
    still needs our own incremental_rebuild trigger, so we'd carry two update
    mechanisms (one cocoindex-native, one ours) that must stay coherent.

Recommendation

Ship Option A first (debounced increment). Treat Option B as a future
optimization only if per-flush spawn cost is measured to matter at real edit
cadence.

Open questions (please grill)

  1. optimize() throttling. increment ends with a serialized
    table.optimize() (full Lance compaction). On a watch loop firing per edit,
    that's unacceptable overhead. Proposal: in watch mode, skip optimize per
    flush
    and run it on a coarse timer (e.g. every 60s of activity, or every N
    flushes). Acceptable? Does un-compacted-with-deletes hurt query correctness,
    or only latency/disk? (Recall: query correctness does not depend on
    optimize() — only compaction/prune do.)
  2. FTS index freshness. ensure_text_fts_index is created lazily on first
    search and not refreshed as rows are added/removed. Under continuous watch,
    do deleted/added chunks appear in find results until a re-create? Need to
    confirm whether the FTS index is append-friendly or needs periodic
    replace=True.
  3. Process model. Watch is a long-running daemon; the MCP server is a
    short-lived stdio process per host session. Should watch be (a) a separate
    background process the user starts (simplest, decoupled), or (b) something the
    MCP server spawns/manages? Lean (a) — keeps the stdio server lean and avoids
    lifecycle coupling. Confirm.
  4. MCP server cache invalidation. Does the running MCP server (or its
    callers) cache anything that a background watch update would make stale? If
    so, watch needs a signal/heartbeat the server can observe (or the server
    must re-open tables per request).
  5. Editor/build-tool noise. target/, build/, generated sources. Do we
    rely solely on LayeredIgnore, or add a watch-specific debounce/heuristic for
    rapid burst writes (e.g. a mvn compile touching hundreds of files)?
  6. Coherence across the two stores. If vectors update but the graph rebuild
    fails (or vice-versa) mid-flush, the index is temporarily inconsistent. Is
    best-effort eventual consistency acceptable for a background freshness
    feature (vs. the strict consistency init/increment promise)?

Non-goals

  • Real-time (<250ms) updates — human edit cadence with debounce is the target.
  • Distributed / multi-writer index access.
  • Replacing the increment command — watch layers on top of the same primitives.

Rough effort

Option A MVP: ~1 PR (watchdog integration + debounce + increment reuse +
optimize throttling + tests). Option B: multi-PR, blocked on the lifecycle
decision in Q3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions