From f727f1512f5015691a52070d9004e48d52917517 Mon Sep 17 00:00:00 2001
From: Dmitry Teryaev <doudmitry@gmail.com>
Date: Sun, 21 Jun 2026 21:35:26 +0300
Subject: [PATCH 1/2] docs(propose): init/increment perf program (bulk graph
 writes, cached ignore, mps)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Proposal-only. Profiles init at ~395s on a medium Java corpus and sequenced
three measured, independent levers as four PRs:
- PR-1: bulk COPY FROM for the full rebuild path (init/reprocess) — the ~81%
  graph-write lever; init projected ~395s -> ~120s.
- PR-2: same primitive extended to the incremental path.
- PR-3: hoist LayeredIgnore to a flow-lifespan ContextKey — ~25s -> ~0s.
- PR-4: default embedding device cuda -> mps -> cpu — ~28s -> ~16s on Apple Silicon.

No ontology bump; PR-1/2/3 re-index-free; PR-4 optional re-index callout.
ANN index (parked, #337) and watch mode (#336) explicitly out of scope.

Co-Authored-By: Claude <noreply@anthropic.com>
---
 propose/active/INIT-INCREMENT-PERF-PROPOSE.md | 191 ++++++++++++++++++
 1 file changed, 191 insertions(+)
 create mode 100644 propose/active/INIT-INCREMENT-PERF-PROPOSE.md
diff --git a/propose/active/INIT-INCREMENT-PERF-PROPOSE.md b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md
new file mode 100644
index 00000000..0dbb96b7
--- /dev/null
+++ b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md
@@ -0,0 +1,191 @@
+# INIT-INCREMENT-PERF-PROPOSE
+
+## Status
+Proposal — not yet implemented. Design-only; no production code in this PR.
+Scope agreed with maintainer: PR-1 rewrites the **full rebuild path only**
+(init / reprocess); the incremental path and the two smaller levers follow as
+separate PRs under this same proposal.
+
+## Problem Statement
+`init` / `increment` wall-clock is the project's stated pain point. A profiled
+`java-codebase-rag init` on a medium Java corpus (Shopizer: 1210 files → 1167
+indexed, 3879 chunks, ~32k graph edges, total **395s**) breaks down as:
+
+| phase | time | share |
+|---|---|---|
+| **LadybugDB graph write** (edges ~250s + nodes ~62s + routes/meta ~10s) | **~322s** | **~81%** |
+| cocoindex vectors (embed ~28s + LayeredIgnore-per-file ~25s + parse/enrich ~3s + LanceDB ~12s) | ~68s | ~17% |
+| optimize | ~5s | ~1% |
+
+Three independent root causes surface:
+
+1. **Per-row graph writes (the ~81% lever).** `build_ast_graph.py` writes every
+   node and edge one statement at a time via `conn.execute(query, row)` inside
+   loops — nodes at `build_ast_graph.py:3046-3093` (`_MERGE_SYMBOL` /
+   `_CREATE_SYMBOL`), edges at `3250-3398` (the `CREATE (...)` strings defined
+   `3108-3315`). 44 `conn.execute` call sites, almost all in per-row loops.
+   Measured ~7.8 ms/edge → ~250s for ~32k edges. kuzu (LadybugDB) is 1–2 orders
+   faster via `COPY FROM` bulk import.
+2. **`LayeredIgnore` rebuilt per file (a ~25s waste inside the vectors phase).**
+   `process_java_file` / `process_sql_file` / `process_yaml_file` in
+   `java_index_flow_lancedb.py` each construct `LayeredIgnore(project_root)`
+   once per file. On 1167 files that is ~25s of pure re-construction of an
+   object that is identical for the whole flow run.
+3. **Embeddings run on CPU by default (~28s) when MPS is available (~16s).**
+   `SBERT_DEVICE` is unset → the embedder defaults to CPU. On Apple Silicon MPS
+   is available and ~1.7x faster for `all-MiniLM-L6-v2`; the device resolution
+   never considers it.
+
+## Proposed Solution
+
+### PR-1 — Bulk `COPY FROM` for the full rebuild path (the big win)
+The full build assembles the entire graph in memory (`GraphTables`) and then
+writes it. That makes bulk insert a clean swap: instead of 44 per-row
+`conn.execute` loops, stage the assembled nodes and edges and load them with
+kuzu `COPY <table> FROM <source>`.
+
+- **Paths in scope:** the full rebuild used by `init` and `reprocess`
+  (`_write_nodes_impl(...)` callers at `build_ast_graph.py:824-825` and `:3103`,
+  plus the edge-emission block `3250-3398`).
+- **Paths NOT in scope for PR-1:** the incremental delete-then-emit path
+  (`_delete_file_scope`, `:673`, and the pass5/6 `Route` MERGE at `:3819-3821`).
+  Incremental touches a small scope (changed files + single-hop dependents), so
+  its per-row cost is low; converting it is a follow-up PR under this proposal.
+- **Mechanism:** stage node rows and each edge-type's rows to a bulk source, then
+  `COPY FROM`. Recommended source format is **Parquet** (see Open Question 1) —
+  `pyarrow` is already a transitive dependency via `lancedb`, and Parquet avoids
+  CSV-quoting hazards for Java FQNs / annotations / signatures.
+- **De-risk:** PR-1 begins with a ~10-line spike confirming `COPY FROM` passes
+  through the `ladybug` wrapper unchanged (Open Question 2), then proceeds to the
+  rewrite.
+- **Equivalence:** the rewritten build MUST produce a byte-for-byte equivalent
+  graph. An equivalence harness (see Tests) proves node/edge counts, GraphMeta
+  counters, and a battery of Cypher queries are identical between old and new.
+
+Expected: the ~312s graph-write phase → tens of seconds; overall `init` on this
+corpus from ~395s toward ~120s (projected; measured in PR-1).
+
+### PR-2 (follow-up) — Bulk write for the incremental path
+Refactor a shared stage→`COPY FROM` primitive out of PR-1 and apply it to the
+incremental `_delete_file_scope` → re-emit flow, preserving the pass5/6
+`MERGE (r:Route)` dedup semantics (`build_ast_graph.py:3819-3821`).
+
+### PR-3 — Cache `LayeredIgnore` as a cocoindex `ContextKey`
+Replace the three per-file `LayeredIgnore(project_root)` constructions with a
+single ignore instance built once per flow run, exposed via a cocoindex
+`ContextKey` (lifespan-scoped). The ignore decision per file is unchanged; only
+construction is hoisted. Keeps the cocoindex dependency inside
+`java_index_flow_lancedb.py` (AGENTS.md compliant). Expected ~25s → ~0s.
+
+### PR-4 — Default embedding device to MPS when available
+Extend the device resolution in `index_common.py` to `cuda → mps → cpu`
+(overridable via the existing `SBERT_DEVICE`). On Apple Silicon this cuts embed
+~28s → ~16s; on Linux servers / CI without MPS it falls back to CPU unchanged.
+Same model, same 384-dim embeddings — only the backend changes.
+
+## Scope
+- **PR-1:** replace per-row node/edge writes in the full rebuild path with
+  bulk `COPY FROM`; add equivalence harness + benchmark.
+- **PR-2:** shared bulk primitive applied to the incremental path.
+- **PR-3:** hoist `LayeredIgnore` to a flow-lifespan `ContextKey`.
+- **PR-4:** `cuda → mps → cpu` device default in `index_common.py`.
+- No new MCP tools, no new env vars (MPS reuses `SBERT_DEVICE`), no new public
+  surface.
+
+## Schema / Ontology / Re-index impact
+- **Ontology bump:** not required. No node/edge kinds, properties, or
+  enrichment semantics change. `ontology_version` stays 17.
+- **PR-1 / PR-2 re-index:** not required. The graph contents are identical
+  (proven by the equivalence harness); only the write mechanism changes. Users
+  pick up the faster path on their next `init` / `reprocess` / `increment`
+  naturally.
+- **PR-3 re-index:** not required. Same chunks, same vectors; only the ignore
+  check is faster.
+- **PR-4 re-index:** recommended (optional), not required. Switching the
+  default backend to MPS changes stored embeddings at the ~1e-5 level (different
+  kernel numerics); cosine ranking is stable, so existing CPU-built indexes keep
+  working, but a fresh `init` yields a single consistent backend. Needs a README
+  "Re-index recommended" callout.
+- **Config / tool surface:** none new.
+
+## Tests / Validation
+- **PR-1 equivalence harness (mandatory):** build the same source tree old-way
+  (per-row) and new-way (`COPY FROM`); assert identical: node count, per-type
+  edge counts, `GraphMeta` counters (via `java-codebase-rag meta` /
+  `GraphMetaOutput`), and a battery of representative Cypher queries
+  (`neighbors`, `find`, `describe`) return identical rows. Run on
+  `tests/bank-chat-system`, the call-graph smoke fixture, and one larger corpus.
+- **PR-1 benchmark:** capture `init` wall-clock before/after on the medium
+  corpus; report the graph-write phase delta.
+- **PR-2:** incremental equivalence — `increment` after a single-file change
+  yields the same graph as a full rebuild of that state (reuse the harness).
+- **PR-3:** assert the ignore object is constructed once per flow run (not per
+  file); existing flow tests unchanged; micro-benchmark confirms the ~25s drop.
+- **PR-4:** unit test that device resolution prefers mps when
+  `torch.backends.mps.is_available()` (monkeypatched), falls back to cpu
+  otherwise; embedding shape/dim unchanged.
+
+## Open Questions ([TBD])
+1. **Bulk source format** — Parquet vs CSV. Recommended: **Parquet** —
+   `pyarrow` is already present (transitive via `lancedb`), and it sidesteps
+   CSV quoting for Java FQNs / annotations / signatures. CSV is the simpler
+   fallback if Parquet proves awkward through the wrapper.
+2. **Does `COPY FROM` pass through the `ladybug` wrapper unchanged?** —
+   Recommended: confirm with a ~10-line spike as the first step of PR-1
+   (low-cost de-risk, folded into PR-1, not a separate spike PR). kuzu 0.11.3
+   supports `COPY FROM` natively; the only unknown is whether `ladybug`'s
+   `conn.execute` forwards it verbatim.
+3. **MPS-vs-CPU numerical drift (PR-4)** — re-index required or optional?
+   Recommended: **optional**; document in a README "Re-index recommended"
+   callout. Cosine ranking is stable across the ~1e-5 backend difference.
+4. **PR-3 cache vehicle** — cocoindex `ContextKey` vs a module-global?
+   Recommended: **`ContextKey`** (cocoindex-native, correct across multiple flow
+   runs / lifespans, keeps the dependency in the flow module).
+5. **Does PR-1 touch `increment`?** — No. Per agreed scope, `increment` keeps
+   its current per-row write until PR-2. PR-1 is init/reprocess only.
+
+## Out of scope
+- ANN vector index — parked (issue #337); query latency is fine today and ANN
+  would tax indexing.
+- `watch` live mode — issue #336.
+- Replacing or restructuring the cocoindex flow.
+- Changing the embedding model or dimension.
+- Parallelizing the graph analysis passes (pass1–pass6).
+- Converting the incremental write path in PR-1 (it is PR-2).
+
+## Sequencing / Follow-ups
+- **PR-1** — bulk `COPY FROM` for the full rebuild path + equivalence harness +
+  benchmark. Biggest win (~81% phase). Starts with the ladybug-pass-through
+  spike (Open Question 2).
+- **PR-2** — shared bulk primitive applied to the incremental path (preserve
+  Route-MERGE dedup).
+- **PR-3** — `LayeredIgnore` → flow-lifespan `ContextKey`.
+- **PR-4** — `cuda → mps → cpu` device default + README callout.
+- PR-3 and PR-4 are independent of PR-1/2 and of each other; they can land in
+  any order. PR-2 depends on PR-1's shared primitive.
+
+## PR body (proposal-only) template
+### What
+Adds `propose/active/INIT-INCREMENT-PERF-PROPOSE.md` describing the init /
+increment performance program: bulk `COPY FROM` graph writes (full path first),
+lifespan-cached `LayeredIgnore`, and an MPS embedding default.
+
+### Why now
+Profiling (2026-06-21) showed graph writes are ~81% of `init`; the three levers
+above are measured, independent, and unblock the project's stated init/increment
+latency pain.
+
+### Highlights
+- PR-1: bulk `COPY FROM` for the full rebuild path — projected ~312s graph write
+  → tens of seconds; `init` ~395s → ~120s on the profiled corpus.
+- PR-2: same primitive extended to the incremental path.
+- PR-3: hoist `LayeredIgnore` to a `ContextKey` — ~25s → ~0s.
+- PR-4: default embedding device `cuda → mps → cpu` — ~28s → ~16s on Apple Silicon.
+- No ontology bump; PR-1/2/3 re-index-free; PR-4 optional re-index callout.
+
+### Tests
+Proposal-only; baseline unchanged.
+
+### Out of scope
+- Implementation of any PR (PR-1…PR-4 follow).
+- ANN index (#337) and watch mode (#336).

From 0bfa5ac487ec0e825ac48ddca8aae9e100c00259 Mon Sep 17 00:00:00 2001
From: Dmitry Teryaev <doudmitry@gmail.com>
Date: Sun, 21 Jun 2026 22:14:10 +0300
Subject: [PATCH 2/2] docs(propose): apply review feedback to init/increment
 perf (drop PR-4, align format)

5-lens subagent review of the proposal found:
- PR-4 (MPS device default) was built on a false premise: the flow already
  auto-selects MPS (SBERT_DEVICE unset -> device=None -> cuda->mps->cpu), so
  the profiled init embedded on MPS (~16s), not CPU. Dropped; rationale moved
  to Out of scope.
- PR-1 mechanism corrected to in-memory pyarrow COPY FROM $param (not Parquet
  file); staging invariants made explicit (REL FROM/TO column rule, CALLS dedup
  + callee_declaring_role materialization at staging, node-before-edge order);
  atomicity note added.
- PR-3 broadened: also memoize is_ignored, not just hoist the constructor.
- Citations fixed: full-rebuild node writer is _write_nodes at :3096 (not the
  incremental MERGE path at 824-825); ~21 per-row sites in write fns (not 44);
  _CREATE_SYMBOL/_MERGE_SYMBOL at :3007-3026.

Also aligned the doc to the repo's current propose format (matches
LADYBUG-DB-MIGRATE-PROPOSE): natural-English H1, Scope with In/Out subsections,
no TL;DR, no PR-body-template section, no edit-history narration.

Co-Authored-By: Claude <noreply@anthropic.com>
---
 propose/active/INIT-INCREMENT-PERF-PROPOSE.md | 233 +++++-------------
 1 file changed, 68 insertions(+), 165 deletions(-)

diff --git a/propose/active/INIT-INCREMENT-PERF-PROPOSE.md b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md
index 0dbb96b7..d71b240a 100644
--- a/propose/active/INIT-INCREMENT-PERF-PROPOSE.md
+++ b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md
@@ -1,191 +1,94 @@
-# INIT-INCREMENT-PERF-PROPOSE
+# Faster init/increment — bulk graph writes + cached ignore
 
 ## Status
 Proposal — not yet implemented. Design-only; no production code in this PR.
-Scope agreed with maintainer: PR-1 rewrites the **full rebuild path only**
-(init / reprocess); the incremental path and the two smaller levers follow as
-separate PRs under this same proposal.
+
+Scope agreed with maintainer: PR-1 rewrites the **full rebuild path only** (init / reprocess); the incremental path and the ignore-cache fix follow as separate PRs under this same proposal.
 
 ## Problem Statement
-`init` / `increment` wall-clock is the project's stated pain point. A profiled
-`java-codebase-rag init` on a medium Java corpus (Shopizer: 1210 files → 1167
-indexed, 3879 chunks, ~32k graph edges, total **395s**) breaks down as:
+
+`init` / `increment` wall-clock is the project's stated pain point. A profiled `java-codebase-rag init` on a medium Java corpus (Shopizer: 1210 files → 1167 indexed, 3879 chunks, ~32k graph edges, total **395s**) breaks down as:
 
 | phase | time | share |
 |---|---|---|
-| **LadybugDB graph write** (edges ~250s + nodes ~62s + routes/meta ~10s) | **~322s** | **~81%** |
-| cocoindex vectors (embed ~28s + LayeredIgnore-per-file ~25s + parse/enrich ~3s + LanceDB ~12s) | ~68s | ~17% |
+| **LadybugDB graph write** (edges ~250s + nodes ~62s + routes ~4s write + passes ~5s analysis) | **~321s** | **~81%** |
+| cocoindex vectors (embed ~16s on MPS + LayeredIgnore-per-file ~25s + parse/enrich ~3s + LanceDB/orchestration residual) | ~68s | ~17% |
 | optimize | ~5s | ~1% |
 
-Three independent root causes surface:
-
-1. **Per-row graph writes (the ~81% lever).** `build_ast_graph.py` writes every
-   node and edge one statement at a time via `conn.execute(query, row)` inside
-   loops — nodes at `build_ast_graph.py:3046-3093` (`_MERGE_SYMBOL` /
-   `_CREATE_SYMBOL`), edges at `3250-3398` (the `CREATE (...)` strings defined
-   `3108-3315`). 44 `conn.execute` call sites, almost all in per-row loops.
-   Measured ~7.8 ms/edge → ~250s for ~32k edges. kuzu (LadybugDB) is 1–2 orders
-   faster via `COPY FROM` bulk import.
-2. **`LayeredIgnore` rebuilt per file (a ~25s waste inside the vectors phase).**
-   `process_java_file` / `process_sql_file` / `process_yaml_file` in
-   `java_index_flow_lancedb.py` each construct `LayeredIgnore(project_root)`
-   once per file. On 1167 files that is ~25s of pure re-construction of an
-   object that is identical for the whole flow run.
-3. **Embeddings run on CPU by default (~28s) when MPS is available (~16s).**
-   `SBERT_DEVICE` is unset → the embedder defaults to CPU. On Apple Silicon MPS
-   is available and ~1.7x faster for `all-MiniLM-L6-v2`; the device resolution
-   never considers it.
+Two independent root causes account for almost all of it:
+
+1. **Per-row graph writes (the ~81% lever).** `build_ast_graph.py` writes every node and edge one statement at a time via `conn.execute(query, row)` inside loops — ~21 per-row `conn.execute` sites across the full-rebuild write functions (44 is the file-wide total). Nodes via `_write_nodes` (`build_ast_graph.py:3096`, impl `_write_nodes_impl:3029` using `_CREATE_SYMBOL`/`_MERGE_SYMBOL`, strings at `:3007-3026`), called from `write_ladybug:3893`; edges at `3250-3398`. Measured ~7.8 ms/edge → ~250s for ~32k edges. A micro-benchmark on the real Symbol schema measured ~300×: ~5.6 ms/row per-row vs ~0.018 ms/row bulk `COPY FROM`.
+2. **`LayeredIgnore` rebuilt per file (a ~25s waste inside the vectors phase).** `process_java_file` / `process_sql_file` / `process_yaml_file` in `java_index_flow_lancedb.py` each construct `LayeredIgnore(project_root)` once per file *and* re-run `is_ignored`'s `_mega(spec)` merge per file. On 1167 files that is ~25s of work for an object + spec that are identical for the whole flow run.
+
+A third candidate — defaulting the embedding device to MPS — was investigated and rejected (see Out of scope): the flow already auto-selects MPS, so the profiled embed ran on MPS (~16s), not CPU.
 
 ## Proposed Solution
 
-### PR-1 — Bulk `COPY FROM` for the full rebuild path (the big win)
-The full build assembles the entire graph in memory (`GraphTables`) and then
-writes it. That makes bulk insert a clean swap: instead of 44 per-row
-`conn.execute` loops, stage the assembled nodes and edges and load them with
-kuzu `COPY <table> FROM <source>`.
-
-- **Paths in scope:** the full rebuild used by `init` and `reprocess`
-  (`_write_nodes_impl(...)` callers at `build_ast_graph.py:824-825` and `:3103`,
-  plus the edge-emission block `3250-3398`).
-- **Paths NOT in scope for PR-1:** the incremental delete-then-emit path
-  (`_delete_file_scope`, `:673`, and the pass5/6 `Route` MERGE at `:3819-3821`).
-  Incremental touches a small scope (changed files + single-hop dependents), so
-  its per-row cost is low; converting it is a follow-up PR under this proposal.
-- **Mechanism:** stage node rows and each edge-type's rows to a bulk source, then
-  `COPY FROM`. Recommended source format is **Parquet** (see Open Question 1) —
-  `pyarrow` is already a transitive dependency via `lancedb`, and Parquet avoids
-  CSV-quoting hazards for Java FQNs / annotations / signatures.
-- **De-risk:** PR-1 begins with a ~10-line spike confirming `COPY FROM` passes
-  through the `ladybug` wrapper unchanged (Open Question 2), then proceeds to the
-  rewrite.
-- **Equivalence:** the rewritten build MUST produce a byte-for-byte equivalent
-  graph. An equivalence harness (see Tests) proves node/edge counts, GraphMeta
-  counters, and a battery of Cypher queries are identical between old and new.
-
-Expected: the ~312s graph-write phase → tens of seconds; overall `init` on this
-corpus from ~395s toward ~120s (projected; measured in PR-1).
-
-### PR-2 (follow-up) — Bulk write for the incremental path
-Refactor a shared stage→`COPY FROM` primitive out of PR-1 and apply it to the
-incremental `_delete_file_scope` → re-emit flow, preserving the pass5/6
-`MERGE (r:Route)` dedup semantics (`build_ast_graph.py:3819-3821`).
-
-### PR-3 — Cache `LayeredIgnore` as a cocoindex `ContextKey`
-Replace the three per-file `LayeredIgnore(project_root)` constructions with a
-single ignore instance built once per flow run, exposed via a cocoindex
-`ContextKey` (lifespan-scoped). The ignore decision per file is unchanged; only
-construction is hoisted. Keeps the cocoindex dependency inside
-`java_index_flow_lancedb.py` (AGENTS.md compliant). Expected ~25s → ~0s.
-
-### PR-4 — Default embedding device to MPS when available
-Extend the device resolution in `index_common.py` to `cuda → mps → cpu`
-(overridable via the existing `SBERT_DEVICE`). On Apple Silicon this cuts embed
-~28s → ~16s; on Linux servers / CI without MPS it falls back to CPU unchanged.
-Same model, same 384-dim embeddings — only the backend changes.
+Three PRs. The graph write is by far the largest lever, so PR-1 is the priority; PR-2 and PR-3 are independent of each other and can land in any order.
+
+### PR-1 — Bulk `COPY FROM` for the full rebuild path
+
+The full build assembles the entire graph in memory (`GraphTables`, fully populated by pass1–pass6 before any `_write_*` call, `build_ast_graph.py:3914`) and then writes it. That makes bulk insert a clean swap: instead of per-row `conn.execute` loops, stage the assembled rows and load them with kuzu `COPY FROM`.
+
+**Mechanism — in-memory pyarrow.** The `ladybug` wrapper's first-class bulk path is `COPY <table> FROM $param` with an in-memory pyarrow table (`conn.execute` forwards `COPY FROM` verbatim and accepts a pyarrow param). Build the `pa.table` from the existing in-memory `*_rows` lists — zero disk I/O, native to the wrapper. Parquet-file staging is the fallback only.
+
+**Staging invariants (must hold for byte-equivalence):**
+- **REL-table column rule.** kuzu `COPY FROM` into a REL table requires the first two columns to be the FROM/TO node primary keys (the node `id`). The staging shape per table must match the `_SCHEMA_*` constants exactly.
+- **Materialize write-time work at staging.** Several rows are computed *during* the current per-row writes and must be produced before staging: the CALLS dedup (`seen_calls`, `build_ast_graph.py:3282-3288`), the `callee_declaring_role` lookup, and the UNRESOLVED dedup (`seen_ucs`, `3317-3321`). Apply these to the in-memory lists, then stage the result.
+- **Node-before-edge ordering.** Stage and load all node tables before REL tables (kuzu enforces endpoint existence). The current `_write_*` call order already does this; preserve it.
+
+**Atomicity.** The current per-row path is not atomic (per-statement autocommit; a crash mid-build leaves a partial graph). `COPY FROM` raises this to per-table atomicity — an improvement, not a regression; the overall "rebuild in place" crash-safety story is unchanged.
+
+**Equivalence.** The rewritten build must produce a byte-equivalent graph, proven by an equivalence harness (see Tests): node/edge counts, GraphMeta counters, full edge property rows (incl. `source_file`, CALLS `callee_declaring_role`), and a battery of Cypher queries identical between old and new.
+
+Expected: ~321s graph-write phase → tens of seconds; overall `init` on this corpus from ~395s toward ~120s (projected; measured in PR-1).
+
+### PR-2 — Bulk write for the incremental path (follow-up)
+
+Refactor a shared stage→`COPY FROM` primitive out of PR-1 and apply it to the incremental `_delete_file_scope` → re-emit flow, preserving the pass5/6 `MERGE (r:Route)` dedup semantics (`build_ast_graph.py:3819-3821`).
+
+### PR-3 — Cached `LayeredIgnore` (+ `is_ignored` memo) as a `ContextKey`
+
+Hoist the three per-file `LayeredIgnore(project_root)` constructions (`java_index_flow_lancedb.py:351/423/471`) into a single instance built once per flow run via a cocoindex `ContextKey` (lifespan-scoped — `PROJECT_ROOT`, `EMBEDDER`, `LANCE_DB` are already `ContextKey`s in `coco_lifespan`, `:60-72`/`:287-306`). Additionally memoize `is_ignored`'s `_mega(spec)` merge (cache the merged spec, or LRU by relative path) — the per-file cost is in `is_ignored`, not just the constructor, so construction hoisting alone will not reach ~0s. Keeps the cocoindex dependency inside `java_index_flow_lancedb.py` (AGENTS.md compliant). Expected ~25s → ~0s.
 
 ## Scope
-- **PR-1:** replace per-row node/edge writes in the full rebuild path with
-  bulk `COPY FROM`; add equivalence harness + benchmark.
-- **PR-2:** shared bulk primitive applied to the incremental path.
-- **PR-3:** hoist `LayeredIgnore` to a flow-lifespan `ContextKey`.
-- **PR-4:** `cuda → mps → cpu` device default in `index_common.py`.
-- No new MCP tools, no new env vars (MPS reuses `SBERT_DEVICE`), no new public
-  surface.
 
-## Schema / Ontology / Re-index impact
-- **Ontology bump:** not required. No node/edge kinds, properties, or
-  enrichment semantics change. `ontology_version` stays 17.
-- **PR-1 / PR-2 re-index:** not required. The graph contents are identical
-  (proven by the equivalence harness); only the write mechanism changes. Users
-  pick up the faster path on their next `init` / `reprocess` / `increment`
-  naturally.
-- **PR-3 re-index:** not required. Same chunks, same vectors; only the ignore
-  check is faster.
-- **PR-4 re-index:** recommended (optional), not required. Switching the
-  default backend to MPS changes stored embeddings at the ~1e-5 level (different
-  kernel numerics); cosine ranking is stable, so existing CPU-built indexes keep
-  working, but a fresh `init` yields a single consistent backend. Needs a README
-  "Re-index recommended" callout.
-- **Config / tool surface:** none new.
+### In scope
 
-## Tests / Validation
-- **PR-1 equivalence harness (mandatory):** build the same source tree old-way
-  (per-row) and new-way (`COPY FROM`); assert identical: node count, per-type
-  edge counts, `GraphMeta` counters (via `java-codebase-rag meta` /
-  `GraphMetaOutput`), and a battery of representative Cypher queries
-  (`neighbors`, `find`, `describe`) return identical rows. Run on
-  `tests/bank-chat-system`, the call-graph smoke fixture, and one larger corpus.
-- **PR-1 benchmark:** capture `init` wall-clock before/after on the medium
-  corpus; report the graph-write phase delta.
-- **PR-2:** incremental equivalence — `increment` after a single-file change
-  yields the same graph as a full rebuild of that state (reuse the harness).
-- **PR-3:** assert the ignore object is constructed once per flow run (not per
-  file); existing flow tests unchanged; micro-benchmark confirms the ~25s drop.
-- **PR-4:** unit test that device resolution prefers mps when
-  `torch.backends.mps.is_available()` (monkeypatched), falls back to cpu
-  otherwise; embedding shape/dim unchanged.
+- **PR-1:** replace per-row node/edge writes in the full rebuild path (`write_ladybug:3893` → `_write_nodes:3096` → `_write_nodes_impl:3029`; edge emit `3250-3398`) with bulk in-memory-pyarrow `COPY FROM`; add equivalence harness + benchmark.
+- **PR-2:** shared bulk primitive applied to the incremental path (preserve Route-MERGE dedup).
+- **PR-3:** hoist `LayeredIgnore` to a flow-lifespan `ContextKey` and memoize `is_ignored`.
 
-## Open Questions ([TBD])
-1. **Bulk source format** — Parquet vs CSV. Recommended: **Parquet** —
-   `pyarrow` is already present (transitive via `lancedb`), and it sidesteps
-   CSV quoting for Java FQNs / annotations / signatures. CSV is the simpler
-   fallback if Parquet proves awkward through the wrapper.
-2. **Does `COPY FROM` pass through the `ladybug` wrapper unchanged?** —
-   Recommended: confirm with a ~10-line spike as the first step of PR-1
-   (low-cost de-risk, folded into PR-1, not a separate spike PR). kuzu 0.11.3
-   supports `COPY FROM` natively; the only unknown is whether `ladybug`'s
-   `conn.execute` forwards it verbatim.
-3. **MPS-vs-CPU numerical drift (PR-4)** — re-index required or optional?
-   Recommended: **optional**; document in a README "Re-index recommended"
-   callout. Cosine ranking is stable across the ~1e-5 backend difference.
-4. **PR-3 cache vehicle** — cocoindex `ContextKey` vs a module-global?
-   Recommended: **`ContextKey`** (cocoindex-native, correct across multiple flow
-   runs / lifespans, keeps the dependency in the flow module).
-5. **Does PR-1 touch `increment`?** — No. Per agreed scope, `increment` keeps
-   its current per-row write until PR-2. PR-1 is init/reprocess only.
-
-## Out of scope
-- ANN vector index — parked (issue #337); query latency is fine today and ANN
-  would tax indexing.
+### Out of scope
+
+- **MPS embedding default — not needed.** The init flow already auto-selects MPS on Apple Silicon: `SBERT_DEVICE` unset → `config.py:280` omits it from the subprocess env → flow constructs `SentenceTransformerEmbedder(device=None)` (`java_index_flow_lancedb.py:291`) → `SentenceTransformer(device=None)` auto-detects `cuda → mps → cpu`. On this machine `torch.backends.mps.is_available()` is true, so the profiled init ran on MPS (~16s embed). There is no CPU→MPS win to recover; an override already exists (`SBERT_DEVICE` / `--embedding-device` / YAML `embedding.device`).
+- ANN vector index — parked (issue #337); query latency is fine today and ANN would tax indexing.
 - `watch` live mode — issue #336.
 - Replacing or restructuring the cocoindex flow.
 - Changing the embedding model or dimension.
 - Parallelizing the graph analysis passes (pass1–pass6).
 - Converting the incremental write path in PR-1 (it is PR-2).
 
+## Schema / Ontology / Re-index impact
+
+- **Ontology bump:** not required. No node/edge kinds, properties, or enrichment semantics change. `ontology_version` stays 17.
+- **Re-index required:** no. PR-1/2 change only the write mechanism; PR-3 changes only a cache. The graph contents are identical (proven by the equivalence harness), so users pick up the faster path on their next `init` / `reprocess` / `increment` naturally — no migration.
+- **Config / tool surface:** none new.
+
+## Tests / Validation
+
+1. **PR-1 equivalence harness (mandatory).** Build the same source tree old-way (per-row) and new-way (`COPY FROM`); assert identical: node count, per-type edge counts, `GraphMeta` counters (via `java-codebase-rag meta` / `GraphMetaOutput`), full property rows for a sample of N edges per type (including `source_file` and CALLS `callee_declaring_role` — proving the staging dedup/materialization is correct), and a battery of representative Cypher queries (`neighbors`, `find`, `describe`) returning identical rows. Run on `tests/bank-chat-system`, the call-graph smoke fixture, and one larger corpus.
+2. **PR-1 benchmark.** Capture `init` wall-clock before/after on the medium corpus; report the graph-write phase delta.
+3. **PR-2 incremental equivalence.** `increment` after a single-file change yields the same graph as a full rebuild of that state (reuse the harness).
+4. **PR-3.** Assert the ignore object is constructed once per flow run (not per file) and `is_ignored` is memoized; existing flow tests unchanged; micro-benchmark confirms the ~25s drop.
+
+## Open Questions ([TBD])
+
+1. Should the `GraphMeta` single-row MERGE (`build_ast_graph.py:3472-3473`) also move to bulk in PR-1, or stay per-row? — Recommended: **fold it into PR-1** (it is in the full-rebuild path; one extra small staging set).
+2. PR-3 cache vehicle — cocoindex `ContextKey` vs a module-global? — Recommended: **`ContextKey`** (cocoindex-native, lifespan-scoped, keeps the dependency in the flow module).
+
 ## Sequencing / Follow-ups
-- **PR-1** — bulk `COPY FROM` for the full rebuild path + equivalence harness +
-  benchmark. Biggest win (~81% phase). Starts with the ladybug-pass-through
-  spike (Open Question 2).
-- **PR-2** — shared bulk primitive applied to the incremental path (preserve
-  Route-MERGE dedup).
-- **PR-3** — `LayeredIgnore` → flow-lifespan `ContextKey`.
-- **PR-4** — `cuda → mps → cpu` device default + README callout.
-- PR-3 and PR-4 are independent of PR-1/2 and of each other; they can land in
-  any order. PR-2 depends on PR-1's shared primitive.
-
-## PR body (proposal-only) template
-### What
-Adds `propose/active/INIT-INCREMENT-PERF-PROPOSE.md` describing the init /
-increment performance program: bulk `COPY FROM` graph writes (full path first),
-lifespan-cached `LayeredIgnore`, and an MPS embedding default.
-
-### Why now
-Profiling (2026-06-21) showed graph writes are ~81% of `init`; the three levers
-above are measured, independent, and unblock the project's stated init/increment
-latency pain.
-
-### Highlights
-- PR-1: bulk `COPY FROM` for the full rebuild path — projected ~312s graph write
-  → tens of seconds; `init` ~395s → ~120s on the profiled corpus.
-- PR-2: same primitive extended to the incremental path.
-- PR-3: hoist `LayeredIgnore` to a `ContextKey` — ~25s → ~0s.
-- PR-4: default embedding device `cuda → mps → cpu` — ~28s → ~16s on Apple Silicon.
-- No ontology bump; PR-1/2/3 re-index-free; PR-4 optional re-index callout.
-
-### Tests
-Proposal-only; baseline unchanged.
 
-### Out of scope
-- Implementation of any PR (PR-1…PR-4 follow).
-- ANN index (#337) and watch mode (#336).
+- **PR-1** — bulk in-memory-pyarrow `COPY FROM` for the full rebuild path + equivalence harness + benchmark. Biggest win (~81% phase).
+- **PR-2** — shared bulk primitive applied to the incremental path. Depends on PR-1's primitive.
+- **PR-3** — `LayeredIgnore` + `is_ignored` → flow-lifespan `ContextKey`. Independent of PR-1/2.