From f727f1512f5015691a52070d9004e48d52917517 Mon Sep 17 00:00:00 2001 From: Dmitry Teryaev Date: Sun, 21 Jun 2026 21:35:26 +0300 Subject: [PATCH 1/2] docs(propose): init/increment perf program (bulk graph writes, cached ignore, mps) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Proposal-only. Profiles init at ~395s on a medium Java corpus and sequenced three measured, independent levers as four PRs: - PR-1: bulk COPY FROM for the full rebuild path (init/reprocess) — the ~81% graph-write lever; init projected ~395s -> ~120s. - PR-2: same primitive extended to the incremental path. - PR-3: hoist LayeredIgnore to a flow-lifespan ContextKey — ~25s -> ~0s. - PR-4: default embedding device cuda -> mps -> cpu — ~28s -> ~16s on Apple Silicon. No ontology bump; PR-1/2/3 re-index-free; PR-4 optional re-index callout. ANN index (parked, #337) and watch mode (#336) explicitly out of scope. Co-Authored-By: Claude --- propose/active/INIT-INCREMENT-PERF-PROPOSE.md | 191 ++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 propose/active/INIT-INCREMENT-PERF-PROPOSE.md diff --git a/propose/active/INIT-INCREMENT-PERF-PROPOSE.md b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md new file mode 100644 index 00000000..0dbb96b7 --- /dev/null +++ b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md @@ -0,0 +1,191 @@ +# INIT-INCREMENT-PERF-PROPOSE + +## Status +Proposal — not yet implemented. Design-only; no production code in this PR. +Scope agreed with maintainer: PR-1 rewrites the **full rebuild path only** +(init / reprocess); the incremental path and the two smaller levers follow as +separate PRs under this same proposal. + +## Problem Statement +`init` / `increment` wall-clock is the project's stated pain point. A profiled +`java-codebase-rag init` on a medium Java corpus (Shopizer: 1210 files → 1167 +indexed, 3879 chunks, ~32k graph edges, total **395s**) breaks down as: + +| phase | time | share | +|---|---|---| +| **LadybugDB graph write** (edges ~250s + nodes ~62s + routes/meta ~10s) | **~322s** | **~81%** | +| cocoindex vectors (embed ~28s + LayeredIgnore-per-file ~25s + parse/enrich ~3s + LanceDB ~12s) | ~68s | ~17% | +| optimize | ~5s | ~1% | + +Three independent root causes surface: + +1. **Per-row graph writes (the ~81% lever).** `build_ast_graph.py` writes every + node and edge one statement at a time via `conn.execute(query, row)` inside + loops — nodes at `build_ast_graph.py:3046-3093` (`_MERGE_SYMBOL` / + `_CREATE_SYMBOL`), edges at `3250-3398` (the `CREATE (...)` strings defined + `3108-3315`). 44 `conn.execute` call sites, almost all in per-row loops. + Measured ~7.8 ms/edge → ~250s for ~32k edges. kuzu (LadybugDB) is 1–2 orders + faster via `COPY FROM` bulk import. +2. **`LayeredIgnore` rebuilt per file (a ~25s waste inside the vectors phase).** + `process_java_file` / `process_sql_file` / `process_yaml_file` in + `java_index_flow_lancedb.py` each construct `LayeredIgnore(project_root)` + once per file. On 1167 files that is ~25s of pure re-construction of an + object that is identical for the whole flow run. +3. **Embeddings run on CPU by default (~28s) when MPS is available (~16s).** + `SBERT_DEVICE` is unset → the embedder defaults to CPU. On Apple Silicon MPS + is available and ~1.7x faster for `all-MiniLM-L6-v2`; the device resolution + never considers it. + +## Proposed Solution + +### PR-1 — Bulk `COPY FROM` for the full rebuild path (the big win) +The full build assembles the entire graph in memory (`GraphTables`) and then +writes it. That makes bulk insert a clean swap: instead of 44 per-row +`conn.execute` loops, stage the assembled nodes and edges and load them with +kuzu `COPY FROM `. + +- **Paths in scope:** the full rebuild used by `init` and `reprocess` + (`_write_nodes_impl(...)` callers at `build_ast_graph.py:824-825` and `:3103`, + plus the edge-emission block `3250-3398`). +- **Paths NOT in scope for PR-1:** the incremental delete-then-emit path + (`_delete_file_scope`, `:673`, and the pass5/6 `Route` MERGE at `:3819-3821`). + Incremental touches a small scope (changed files + single-hop dependents), so + its per-row cost is low; converting it is a follow-up PR under this proposal. +- **Mechanism:** stage node rows and each edge-type's rows to a bulk source, then + `COPY FROM`. Recommended source format is **Parquet** (see Open Question 1) — + `pyarrow` is already a transitive dependency via `lancedb`, and Parquet avoids + CSV-quoting hazards for Java FQNs / annotations / signatures. +- **De-risk:** PR-1 begins with a ~10-line spike confirming `COPY FROM` passes + through the `ladybug` wrapper unchanged (Open Question 2), then proceeds to the + rewrite. +- **Equivalence:** the rewritten build MUST produce a byte-for-byte equivalent + graph. An equivalence harness (see Tests) proves node/edge counts, GraphMeta + counters, and a battery of Cypher queries are identical between old and new. + +Expected: the ~312s graph-write phase → tens of seconds; overall `init` on this +corpus from ~395s toward ~120s (projected; measured in PR-1). + +### PR-2 (follow-up) — Bulk write for the incremental path +Refactor a shared stage→`COPY FROM` primitive out of PR-1 and apply it to the +incremental `_delete_file_scope` → re-emit flow, preserving the pass5/6 +`MERGE (r:Route)` dedup semantics (`build_ast_graph.py:3819-3821`). + +### PR-3 — Cache `LayeredIgnore` as a cocoindex `ContextKey` +Replace the three per-file `LayeredIgnore(project_root)` constructions with a +single ignore instance built once per flow run, exposed via a cocoindex +`ContextKey` (lifespan-scoped). The ignore decision per file is unchanged; only +construction is hoisted. Keeps the cocoindex dependency inside +`java_index_flow_lancedb.py` (AGENTS.md compliant). Expected ~25s → ~0s. + +### PR-4 — Default embedding device to MPS when available +Extend the device resolution in `index_common.py` to `cuda → mps → cpu` +(overridable via the existing `SBERT_DEVICE`). On Apple Silicon this cuts embed +~28s → ~16s; on Linux servers / CI without MPS it falls back to CPU unchanged. +Same model, same 384-dim embeddings — only the backend changes. + +## Scope +- **PR-1:** replace per-row node/edge writes in the full rebuild path with + bulk `COPY FROM`; add equivalence harness + benchmark. +- **PR-2:** shared bulk primitive applied to the incremental path. +- **PR-3:** hoist `LayeredIgnore` to a flow-lifespan `ContextKey`. +- **PR-4:** `cuda → mps → cpu` device default in `index_common.py`. +- No new MCP tools, no new env vars (MPS reuses `SBERT_DEVICE`), no new public + surface. + +## Schema / Ontology / Re-index impact +- **Ontology bump:** not required. No node/edge kinds, properties, or + enrichment semantics change. `ontology_version` stays 17. +- **PR-1 / PR-2 re-index:** not required. The graph contents are identical + (proven by the equivalence harness); only the write mechanism changes. Users + pick up the faster path on their next `init` / `reprocess` / `increment` + naturally. +- **PR-3 re-index:** not required. Same chunks, same vectors; only the ignore + check is faster. +- **PR-4 re-index:** recommended (optional), not required. Switching the + default backend to MPS changes stored embeddings at the ~1e-5 level (different + kernel numerics); cosine ranking is stable, so existing CPU-built indexes keep + working, but a fresh `init` yields a single consistent backend. Needs a README + "Re-index recommended" callout. +- **Config / tool surface:** none new. + +## Tests / Validation +- **PR-1 equivalence harness (mandatory):** build the same source tree old-way + (per-row) and new-way (`COPY FROM`); assert identical: node count, per-type + edge counts, `GraphMeta` counters (via `java-codebase-rag meta` / + `GraphMetaOutput`), and a battery of representative Cypher queries + (`neighbors`, `find`, `describe`) return identical rows. Run on + `tests/bank-chat-system`, the call-graph smoke fixture, and one larger corpus. +- **PR-1 benchmark:** capture `init` wall-clock before/after on the medium + corpus; report the graph-write phase delta. +- **PR-2:** incremental equivalence — `increment` after a single-file change + yields the same graph as a full rebuild of that state (reuse the harness). +- **PR-3:** assert the ignore object is constructed once per flow run (not per + file); existing flow tests unchanged; micro-benchmark confirms the ~25s drop. +- **PR-4:** unit test that device resolution prefers mps when + `torch.backends.mps.is_available()` (monkeypatched), falls back to cpu + otherwise; embedding shape/dim unchanged. + +## Open Questions ([TBD]) +1. **Bulk source format** — Parquet vs CSV. Recommended: **Parquet** — + `pyarrow` is already present (transitive via `lancedb`), and it sidesteps + CSV quoting for Java FQNs / annotations / signatures. CSV is the simpler + fallback if Parquet proves awkward through the wrapper. +2. **Does `COPY FROM` pass through the `ladybug` wrapper unchanged?** — + Recommended: confirm with a ~10-line spike as the first step of PR-1 + (low-cost de-risk, folded into PR-1, not a separate spike PR). kuzu 0.11.3 + supports `COPY FROM` natively; the only unknown is whether `ladybug`'s + `conn.execute` forwards it verbatim. +3. **MPS-vs-CPU numerical drift (PR-4)** — re-index required or optional? + Recommended: **optional**; document in a README "Re-index recommended" + callout. Cosine ranking is stable across the ~1e-5 backend difference. +4. **PR-3 cache vehicle** — cocoindex `ContextKey` vs a module-global? + Recommended: **`ContextKey`** (cocoindex-native, correct across multiple flow + runs / lifespans, keeps the dependency in the flow module). +5. **Does PR-1 touch `increment`?** — No. Per agreed scope, `increment` keeps + its current per-row write until PR-2. PR-1 is init/reprocess only. + +## Out of scope +- ANN vector index — parked (issue #337); query latency is fine today and ANN + would tax indexing. +- `watch` live mode — issue #336. +- Replacing or restructuring the cocoindex flow. +- Changing the embedding model or dimension. +- Parallelizing the graph analysis passes (pass1–pass6). +- Converting the incremental write path in PR-1 (it is PR-2). + +## Sequencing / Follow-ups +- **PR-1** — bulk `COPY FROM` for the full rebuild path + equivalence harness + + benchmark. Biggest win (~81% phase). Starts with the ladybug-pass-through + spike (Open Question 2). +- **PR-2** — shared bulk primitive applied to the incremental path (preserve + Route-MERGE dedup). +- **PR-3** — `LayeredIgnore` → flow-lifespan `ContextKey`. +- **PR-4** — `cuda → mps → cpu` device default + README callout. +- PR-3 and PR-4 are independent of PR-1/2 and of each other; they can land in + any order. PR-2 depends on PR-1's shared primitive. + +## PR body (proposal-only) template +### What +Adds `propose/active/INIT-INCREMENT-PERF-PROPOSE.md` describing the init / +increment performance program: bulk `COPY FROM` graph writes (full path first), +lifespan-cached `LayeredIgnore`, and an MPS embedding default. + +### Why now +Profiling (2026-06-21) showed graph writes are ~81% of `init`; the three levers +above are measured, independent, and unblock the project's stated init/increment +latency pain. + +### Highlights +- PR-1: bulk `COPY FROM` for the full rebuild path — projected ~312s graph write + → tens of seconds; `init` ~395s → ~120s on the profiled corpus. +- PR-2: same primitive extended to the incremental path. +- PR-3: hoist `LayeredIgnore` to a `ContextKey` — ~25s → ~0s. +- PR-4: default embedding device `cuda → mps → cpu` — ~28s → ~16s on Apple Silicon. +- No ontology bump; PR-1/2/3 re-index-free; PR-4 optional re-index callout. + +### Tests +Proposal-only; baseline unchanged. + +### Out of scope +- Implementation of any PR (PR-1…PR-4 follow). +- ANN index (#337) and watch mode (#336). From 0bfa5ac487ec0e825ac48ddca8aae9e100c00259 Mon Sep 17 00:00:00 2001 From: Dmitry Teryaev Date: Sun, 21 Jun 2026 22:14:10 +0300 Subject: [PATCH 2/2] docs(propose): apply review feedback to init/increment perf (drop PR-4, align format) 5-lens subagent review of the proposal found: - PR-4 (MPS device default) was built on a false premise: the flow already auto-selects MPS (SBERT_DEVICE unset -> device=None -> cuda->mps->cpu), so the profiled init embedded on MPS (~16s), not CPU. Dropped; rationale moved to Out of scope. - PR-1 mechanism corrected to in-memory pyarrow COPY FROM $param (not Parquet file); staging invariants made explicit (REL FROM/TO column rule, CALLS dedup + callee_declaring_role materialization at staging, node-before-edge order); atomicity note added. - PR-3 broadened: also memoize is_ignored, not just hoist the constructor. - Citations fixed: full-rebuild node writer is _write_nodes at :3096 (not the incremental MERGE path at 824-825); ~21 per-row sites in write fns (not 44); _CREATE_SYMBOL/_MERGE_SYMBOL at :3007-3026. Also aligned the doc to the repo's current propose format (matches LADYBUG-DB-MIGRATE-PROPOSE): natural-English H1, Scope with In/Out subsections, no TL;DR, no PR-body-template section, no edit-history narration. Co-Authored-By: Claude --- propose/active/INIT-INCREMENT-PERF-PROPOSE.md | 233 +++++------------- 1 file changed, 68 insertions(+), 165 deletions(-) diff --git a/propose/active/INIT-INCREMENT-PERF-PROPOSE.md b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md index 0dbb96b7..d71b240a 100644 --- a/propose/active/INIT-INCREMENT-PERF-PROPOSE.md +++ b/propose/active/INIT-INCREMENT-PERF-PROPOSE.md @@ -1,191 +1,94 @@ -# INIT-INCREMENT-PERF-PROPOSE +# Faster init/increment — bulk graph writes + cached ignore ## Status Proposal — not yet implemented. Design-only; no production code in this PR. -Scope agreed with maintainer: PR-1 rewrites the **full rebuild path only** -(init / reprocess); the incremental path and the two smaller levers follow as -separate PRs under this same proposal. + +Scope agreed with maintainer: PR-1 rewrites the **full rebuild path only** (init / reprocess); the incremental path and the ignore-cache fix follow as separate PRs under this same proposal. ## Problem Statement -`init` / `increment` wall-clock is the project's stated pain point. A profiled -`java-codebase-rag init` on a medium Java corpus (Shopizer: 1210 files → 1167 -indexed, 3879 chunks, ~32k graph edges, total **395s**) breaks down as: + +`init` / `increment` wall-clock is the project's stated pain point. A profiled `java-codebase-rag init` on a medium Java corpus (Shopizer: 1210 files → 1167 indexed, 3879 chunks, ~32k graph edges, total **395s**) breaks down as: | phase | time | share | |---|---|---| -| **LadybugDB graph write** (edges ~250s + nodes ~62s + routes/meta ~10s) | **~322s** | **~81%** | -| cocoindex vectors (embed ~28s + LayeredIgnore-per-file ~25s + parse/enrich ~3s + LanceDB ~12s) | ~68s | ~17% | +| **LadybugDB graph write** (edges ~250s + nodes ~62s + routes ~4s write + passes ~5s analysis) | **~321s** | **~81%** | +| cocoindex vectors (embed ~16s on MPS + LayeredIgnore-per-file ~25s + parse/enrich ~3s + LanceDB/orchestration residual) | ~68s | ~17% | | optimize | ~5s | ~1% | -Three independent root causes surface: - -1. **Per-row graph writes (the ~81% lever).** `build_ast_graph.py` writes every - node and edge one statement at a time via `conn.execute(query, row)` inside - loops — nodes at `build_ast_graph.py:3046-3093` (`_MERGE_SYMBOL` / - `_CREATE_SYMBOL`), edges at `3250-3398` (the `CREATE (...)` strings defined - `3108-3315`). 44 `conn.execute` call sites, almost all in per-row loops. - Measured ~7.8 ms/edge → ~250s for ~32k edges. kuzu (LadybugDB) is 1–2 orders - faster via `COPY FROM` bulk import. -2. **`LayeredIgnore` rebuilt per file (a ~25s waste inside the vectors phase).** - `process_java_file` / `process_sql_file` / `process_yaml_file` in - `java_index_flow_lancedb.py` each construct `LayeredIgnore(project_root)` - once per file. On 1167 files that is ~25s of pure re-construction of an - object that is identical for the whole flow run. -3. **Embeddings run on CPU by default (~28s) when MPS is available (~16s).** - `SBERT_DEVICE` is unset → the embedder defaults to CPU. On Apple Silicon MPS - is available and ~1.7x faster for `all-MiniLM-L6-v2`; the device resolution - never considers it. +Two independent root causes account for almost all of it: + +1. **Per-row graph writes (the ~81% lever).** `build_ast_graph.py` writes every node and edge one statement at a time via `conn.execute(query, row)` inside loops — ~21 per-row `conn.execute` sites across the full-rebuild write functions (44 is the file-wide total). Nodes via `_write_nodes` (`build_ast_graph.py:3096`, impl `_write_nodes_impl:3029` using `_CREATE_SYMBOL`/`_MERGE_SYMBOL`, strings at `:3007-3026`), called from `write_ladybug:3893`; edges at `3250-3398`. Measured ~7.8 ms/edge → ~250s for ~32k edges. A micro-benchmark on the real Symbol schema measured ~300×: ~5.6 ms/row per-row vs ~0.018 ms/row bulk `COPY FROM`. +2. **`LayeredIgnore` rebuilt per file (a ~25s waste inside the vectors phase).** `process_java_file` / `process_sql_file` / `process_yaml_file` in `java_index_flow_lancedb.py` each construct `LayeredIgnore(project_root)` once per file *and* re-run `is_ignored`'s `_mega(spec)` merge per file. On 1167 files that is ~25s of work for an object + spec that are identical for the whole flow run. + +A third candidate — defaulting the embedding device to MPS — was investigated and rejected (see Out of scope): the flow already auto-selects MPS, so the profiled embed ran on MPS (~16s), not CPU. ## Proposed Solution -### PR-1 — Bulk `COPY FROM` for the full rebuild path (the big win) -The full build assembles the entire graph in memory (`GraphTables`) and then -writes it. That makes bulk insert a clean swap: instead of 44 per-row -`conn.execute` loops, stage the assembled nodes and edges and load them with -kuzu `COPY
FROM `. - -- **Paths in scope:** the full rebuild used by `init` and `reprocess` - (`_write_nodes_impl(...)` callers at `build_ast_graph.py:824-825` and `:3103`, - plus the edge-emission block `3250-3398`). -- **Paths NOT in scope for PR-1:** the incremental delete-then-emit path - (`_delete_file_scope`, `:673`, and the pass5/6 `Route` MERGE at `:3819-3821`). - Incremental touches a small scope (changed files + single-hop dependents), so - its per-row cost is low; converting it is a follow-up PR under this proposal. -- **Mechanism:** stage node rows and each edge-type's rows to a bulk source, then - `COPY FROM`. Recommended source format is **Parquet** (see Open Question 1) — - `pyarrow` is already a transitive dependency via `lancedb`, and Parquet avoids - CSV-quoting hazards for Java FQNs / annotations / signatures. -- **De-risk:** PR-1 begins with a ~10-line spike confirming `COPY FROM` passes - through the `ladybug` wrapper unchanged (Open Question 2), then proceeds to the - rewrite. -- **Equivalence:** the rewritten build MUST produce a byte-for-byte equivalent - graph. An equivalence harness (see Tests) proves node/edge counts, GraphMeta - counters, and a battery of Cypher queries are identical between old and new. - -Expected: the ~312s graph-write phase → tens of seconds; overall `init` on this -corpus from ~395s toward ~120s (projected; measured in PR-1). - -### PR-2 (follow-up) — Bulk write for the incremental path -Refactor a shared stage→`COPY FROM` primitive out of PR-1 and apply it to the -incremental `_delete_file_scope` → re-emit flow, preserving the pass5/6 -`MERGE (r:Route)` dedup semantics (`build_ast_graph.py:3819-3821`). - -### PR-3 — Cache `LayeredIgnore` as a cocoindex `ContextKey` -Replace the three per-file `LayeredIgnore(project_root)` constructions with a -single ignore instance built once per flow run, exposed via a cocoindex -`ContextKey` (lifespan-scoped). The ignore decision per file is unchanged; only -construction is hoisted. Keeps the cocoindex dependency inside -`java_index_flow_lancedb.py` (AGENTS.md compliant). Expected ~25s → ~0s. - -### PR-4 — Default embedding device to MPS when available -Extend the device resolution in `index_common.py` to `cuda → mps → cpu` -(overridable via the existing `SBERT_DEVICE`). On Apple Silicon this cuts embed -~28s → ~16s; on Linux servers / CI without MPS it falls back to CPU unchanged. -Same model, same 384-dim embeddings — only the backend changes. +Three PRs. The graph write is by far the largest lever, so PR-1 is the priority; PR-2 and PR-3 are independent of each other and can land in any order. + +### PR-1 — Bulk `COPY FROM` for the full rebuild path + +The full build assembles the entire graph in memory (`GraphTables`, fully populated by pass1–pass6 before any `_write_*` call, `build_ast_graph.py:3914`) and then writes it. That makes bulk insert a clean swap: instead of per-row `conn.execute` loops, stage the assembled rows and load them with kuzu `COPY FROM`. + +**Mechanism — in-memory pyarrow.** The `ladybug` wrapper's first-class bulk path is `COPY
FROM $param` with an in-memory pyarrow table (`conn.execute` forwards `COPY FROM` verbatim and accepts a pyarrow param). Build the `pa.table` from the existing in-memory `*_rows` lists — zero disk I/O, native to the wrapper. Parquet-file staging is the fallback only. + +**Staging invariants (must hold for byte-equivalence):** +- **REL-table column rule.** kuzu `COPY FROM` into a REL table requires the first two columns to be the FROM/TO node primary keys (the node `id`). The staging shape per table must match the `_SCHEMA_*` constants exactly. +- **Materialize write-time work at staging.** Several rows are computed *during* the current per-row writes and must be produced before staging: the CALLS dedup (`seen_calls`, `build_ast_graph.py:3282-3288`), the `callee_declaring_role` lookup, and the UNRESOLVED dedup (`seen_ucs`, `3317-3321`). Apply these to the in-memory lists, then stage the result. +- **Node-before-edge ordering.** Stage and load all node tables before REL tables (kuzu enforces endpoint existence). The current `_write_*` call order already does this; preserve it. + +**Atomicity.** The current per-row path is not atomic (per-statement autocommit; a crash mid-build leaves a partial graph). `COPY FROM` raises this to per-table atomicity — an improvement, not a regression; the overall "rebuild in place" crash-safety story is unchanged. + +**Equivalence.** The rewritten build must produce a byte-equivalent graph, proven by an equivalence harness (see Tests): node/edge counts, GraphMeta counters, full edge property rows (incl. `source_file`, CALLS `callee_declaring_role`), and a battery of Cypher queries identical between old and new. + +Expected: ~321s graph-write phase → tens of seconds; overall `init` on this corpus from ~395s toward ~120s (projected; measured in PR-1). + +### PR-2 — Bulk write for the incremental path (follow-up) + +Refactor a shared stage→`COPY FROM` primitive out of PR-1 and apply it to the incremental `_delete_file_scope` → re-emit flow, preserving the pass5/6 `MERGE (r:Route)` dedup semantics (`build_ast_graph.py:3819-3821`). + +### PR-3 — Cached `LayeredIgnore` (+ `is_ignored` memo) as a `ContextKey` + +Hoist the three per-file `LayeredIgnore(project_root)` constructions (`java_index_flow_lancedb.py:351/423/471`) into a single instance built once per flow run via a cocoindex `ContextKey` (lifespan-scoped — `PROJECT_ROOT`, `EMBEDDER`, `LANCE_DB` are already `ContextKey`s in `coco_lifespan`, `:60-72`/`:287-306`). Additionally memoize `is_ignored`'s `_mega(spec)` merge (cache the merged spec, or LRU by relative path) — the per-file cost is in `is_ignored`, not just the constructor, so construction hoisting alone will not reach ~0s. Keeps the cocoindex dependency inside `java_index_flow_lancedb.py` (AGENTS.md compliant). Expected ~25s → ~0s. ## Scope -- **PR-1:** replace per-row node/edge writes in the full rebuild path with - bulk `COPY FROM`; add equivalence harness + benchmark. -- **PR-2:** shared bulk primitive applied to the incremental path. -- **PR-3:** hoist `LayeredIgnore` to a flow-lifespan `ContextKey`. -- **PR-4:** `cuda → mps → cpu` device default in `index_common.py`. -- No new MCP tools, no new env vars (MPS reuses `SBERT_DEVICE`), no new public - surface. -## Schema / Ontology / Re-index impact -- **Ontology bump:** not required. No node/edge kinds, properties, or - enrichment semantics change. `ontology_version` stays 17. -- **PR-1 / PR-2 re-index:** not required. The graph contents are identical - (proven by the equivalence harness); only the write mechanism changes. Users - pick up the faster path on their next `init` / `reprocess` / `increment` - naturally. -- **PR-3 re-index:** not required. Same chunks, same vectors; only the ignore - check is faster. -- **PR-4 re-index:** recommended (optional), not required. Switching the - default backend to MPS changes stored embeddings at the ~1e-5 level (different - kernel numerics); cosine ranking is stable, so existing CPU-built indexes keep - working, but a fresh `init` yields a single consistent backend. Needs a README - "Re-index recommended" callout. -- **Config / tool surface:** none new. +### In scope -## Tests / Validation -- **PR-1 equivalence harness (mandatory):** build the same source tree old-way - (per-row) and new-way (`COPY FROM`); assert identical: node count, per-type - edge counts, `GraphMeta` counters (via `java-codebase-rag meta` / - `GraphMetaOutput`), and a battery of representative Cypher queries - (`neighbors`, `find`, `describe`) return identical rows. Run on - `tests/bank-chat-system`, the call-graph smoke fixture, and one larger corpus. -- **PR-1 benchmark:** capture `init` wall-clock before/after on the medium - corpus; report the graph-write phase delta. -- **PR-2:** incremental equivalence — `increment` after a single-file change - yields the same graph as a full rebuild of that state (reuse the harness). -- **PR-3:** assert the ignore object is constructed once per flow run (not per - file); existing flow tests unchanged; micro-benchmark confirms the ~25s drop. -- **PR-4:** unit test that device resolution prefers mps when - `torch.backends.mps.is_available()` (monkeypatched), falls back to cpu - otherwise; embedding shape/dim unchanged. +- **PR-1:** replace per-row node/edge writes in the full rebuild path (`write_ladybug:3893` → `_write_nodes:3096` → `_write_nodes_impl:3029`; edge emit `3250-3398`) with bulk in-memory-pyarrow `COPY FROM`; add equivalence harness + benchmark. +- **PR-2:** shared bulk primitive applied to the incremental path (preserve Route-MERGE dedup). +- **PR-3:** hoist `LayeredIgnore` to a flow-lifespan `ContextKey` and memoize `is_ignored`. -## Open Questions ([TBD]) -1. **Bulk source format** — Parquet vs CSV. Recommended: **Parquet** — - `pyarrow` is already present (transitive via `lancedb`), and it sidesteps - CSV quoting for Java FQNs / annotations / signatures. CSV is the simpler - fallback if Parquet proves awkward through the wrapper. -2. **Does `COPY FROM` pass through the `ladybug` wrapper unchanged?** — - Recommended: confirm with a ~10-line spike as the first step of PR-1 - (low-cost de-risk, folded into PR-1, not a separate spike PR). kuzu 0.11.3 - supports `COPY FROM` natively; the only unknown is whether `ladybug`'s - `conn.execute` forwards it verbatim. -3. **MPS-vs-CPU numerical drift (PR-4)** — re-index required or optional? - Recommended: **optional**; document in a README "Re-index recommended" - callout. Cosine ranking is stable across the ~1e-5 backend difference. -4. **PR-3 cache vehicle** — cocoindex `ContextKey` vs a module-global? - Recommended: **`ContextKey`** (cocoindex-native, correct across multiple flow - runs / lifespans, keeps the dependency in the flow module). -5. **Does PR-1 touch `increment`?** — No. Per agreed scope, `increment` keeps - its current per-row write until PR-2. PR-1 is init/reprocess only. - -## Out of scope -- ANN vector index — parked (issue #337); query latency is fine today and ANN - would tax indexing. +### Out of scope + +- **MPS embedding default — not needed.** The init flow already auto-selects MPS on Apple Silicon: `SBERT_DEVICE` unset → `config.py:280` omits it from the subprocess env → flow constructs `SentenceTransformerEmbedder(device=None)` (`java_index_flow_lancedb.py:291`) → `SentenceTransformer(device=None)` auto-detects `cuda → mps → cpu`. On this machine `torch.backends.mps.is_available()` is true, so the profiled init ran on MPS (~16s embed). There is no CPU→MPS win to recover; an override already exists (`SBERT_DEVICE` / `--embedding-device` / YAML `embedding.device`). +- ANN vector index — parked (issue #337); query latency is fine today and ANN would tax indexing. - `watch` live mode — issue #336. - Replacing or restructuring the cocoindex flow. - Changing the embedding model or dimension. - Parallelizing the graph analysis passes (pass1–pass6). - Converting the incremental write path in PR-1 (it is PR-2). +## Schema / Ontology / Re-index impact + +- **Ontology bump:** not required. No node/edge kinds, properties, or enrichment semantics change. `ontology_version` stays 17. +- **Re-index required:** no. PR-1/2 change only the write mechanism; PR-3 changes only a cache. The graph contents are identical (proven by the equivalence harness), so users pick up the faster path on their next `init` / `reprocess` / `increment` naturally — no migration. +- **Config / tool surface:** none new. + +## Tests / Validation + +1. **PR-1 equivalence harness (mandatory).** Build the same source tree old-way (per-row) and new-way (`COPY FROM`); assert identical: node count, per-type edge counts, `GraphMeta` counters (via `java-codebase-rag meta` / `GraphMetaOutput`), full property rows for a sample of N edges per type (including `source_file` and CALLS `callee_declaring_role` — proving the staging dedup/materialization is correct), and a battery of representative Cypher queries (`neighbors`, `find`, `describe`) returning identical rows. Run on `tests/bank-chat-system`, the call-graph smoke fixture, and one larger corpus. +2. **PR-1 benchmark.** Capture `init` wall-clock before/after on the medium corpus; report the graph-write phase delta. +3. **PR-2 incremental equivalence.** `increment` after a single-file change yields the same graph as a full rebuild of that state (reuse the harness). +4. **PR-3.** Assert the ignore object is constructed once per flow run (not per file) and `is_ignored` is memoized; existing flow tests unchanged; micro-benchmark confirms the ~25s drop. + +## Open Questions ([TBD]) + +1. Should the `GraphMeta` single-row MERGE (`build_ast_graph.py:3472-3473`) also move to bulk in PR-1, or stay per-row? — Recommended: **fold it into PR-1** (it is in the full-rebuild path; one extra small staging set). +2. PR-3 cache vehicle — cocoindex `ContextKey` vs a module-global? — Recommended: **`ContextKey`** (cocoindex-native, lifespan-scoped, keeps the dependency in the flow module). + ## Sequencing / Follow-ups -- **PR-1** — bulk `COPY FROM` for the full rebuild path + equivalence harness + - benchmark. Biggest win (~81% phase). Starts with the ladybug-pass-through - spike (Open Question 2). -- **PR-2** — shared bulk primitive applied to the incremental path (preserve - Route-MERGE dedup). -- **PR-3** — `LayeredIgnore` → flow-lifespan `ContextKey`. -- **PR-4** — `cuda → mps → cpu` device default + README callout. -- PR-3 and PR-4 are independent of PR-1/2 and of each other; they can land in - any order. PR-2 depends on PR-1's shared primitive. - -## PR body (proposal-only) template -### What -Adds `propose/active/INIT-INCREMENT-PERF-PROPOSE.md` describing the init / -increment performance program: bulk `COPY FROM` graph writes (full path first), -lifespan-cached `LayeredIgnore`, and an MPS embedding default. - -### Why now -Profiling (2026-06-21) showed graph writes are ~81% of `init`; the three levers -above are measured, independent, and unblock the project's stated init/increment -latency pain. - -### Highlights -- PR-1: bulk `COPY FROM` for the full rebuild path — projected ~312s graph write - → tens of seconds; `init` ~395s → ~120s on the profiled corpus. -- PR-2: same primitive extended to the incremental path. -- PR-3: hoist `LayeredIgnore` to a `ContextKey` — ~25s → ~0s. -- PR-4: default embedding device `cuda → mps → cpu` — ~28s → ~16s on Apple Silicon. -- No ontology bump; PR-1/2/3 re-index-free; PR-4 optional re-index callout. - -### Tests -Proposal-only; baseline unchanged. -### Out of scope -- Implementation of any PR (PR-1…PR-4 follow). -- ANN index (#337) and watch mode (#336). +- **PR-1** — bulk in-memory-pyarrow `COPY FROM` for the full rebuild path + equivalence harness + benchmark. Biggest win (~81% phase). +- **PR-2** — shared bulk primitive applied to the incremental path. Depends on PR-1's primitive. +- **PR-3** — `LayeredIgnore` + `is_ignored` → flow-lifespan `ContextKey`. Independent of PR-1/2.