diff --git a/.erpaval/INDEX.md b/.erpaval/INDEX.md index a5be4b87..bc9e73dd 100644 --- a/.erpaval/INDEX.md +++ b/.erpaval/INDEX.md @@ -41,6 +41,7 @@ development sessions. Solutions are reusable; specs are per-feature. - [Post-deletion-promise debt creates load-bearing orphans](solutions/best-practices/post-deletion-promise-debt-anti-pattern.md) — when a milestone-PR deletes an in-tree asset with intent to recreate elsewhere, the recreation almost never happens; the deleted artifact's last build keeps serving and silently rots. PR #53 deleted `packages/docs/`; the orphaned May-1 Pages snapshot served stale prose for 6 days until PR #87 restored. - [Exclude heavy-build packages from pnpm-recursive in non-owner workflows](solutions/architecture-patterns/exclude-heavy-build-from-pnpm-recursive.md) — packages whose build pulls in Playwright / browser binaries / native model weights should be filtered out of `pnpm -r build/test` in workflows that don't own that build. Use `pnpm --filter '!@scope/heavy' -r `. - [Banned-strings policy evolves with the product](solutions/conventions/banned-strings-policy-evolves-with-product.md) — a banned literal that worked during decision-making becomes a barrier when the decision ships and the banned name becomes the official product term. Re-evaluate per release; remove literals that became the product. +- [Smoke-testing a workspace cli requires packing every publishable workspace dep](solutions/best-practices/workspace-tarball-pack-all-publishables.md) — `npm install -g ` falls back to registry for un-packed transitive workspace deps, dragging in the previously-published versions and masking install-graph regressions. Pack everything publishable, every time. ## Specs diff --git a/.erpaval/solutions/best-practices/workspace-tarball-pack-all-publishables.md b/.erpaval/solutions/best-practices/workspace-tarball-pack-all-publishables.md new file mode 100644 index 00000000..42554057 --- /dev/null +++ b/.erpaval/solutions/best-practices/workspace-tarball-pack-all-publishables.md @@ -0,0 +1,117 @@ +--- +title: "Smoke-testing a workspace cli requires packing every publishable workspace dep" +tags: + - npm + - pnpm + - publish + - install-graph + - workspace + - global-install + - tarball + - smoke-test + - tree-sitter-cli +modules: + - scripts/verify-global-install.sh + - packages/cli + - packages/ingestion + - packages/pack +severity: medium +created: 2026-05-15 +session: session-569b82 +track: bug +category: best-practices +--- + +# Smoke-testing a workspace cli requires packing every publishable workspace dep + +## Symptom + +`scripts/verify-global-install.sh local` failed gates 2, 3, 4 even after the +parser refactor moved native tree-sitter out of every workspace `dependencies` +block. The install log showed: + +``` +npm warn tree-sitter-cpp@"0.23.4" from @opencodehub/ingestion@0.3.2 +npm warn node_modules/@opencodehub/cli/node_modules/@opencodehub/pack/node_modules/@opencodehub/ingestion +... +> tree-sitter-cli@0.23.2 install +Downloading https://github.com/tree-sitter/tree-sitter/releases/... +``` + +The freshly-packed cli@0.4.0 tarball pinned `@opencodehub/ingestion@0.4.0` +correctly. But it *also* pinned `@opencodehub/pack@0.2.0`, and only ingestion + +cli were `pnpm pack`'d locally. npm fell back to **registry** for `pack` — +fetched the previously-published `@opencodehub/pack@0.1.3` — which pinned +`@opencodehub/ingestion@0.3.2` (the version live at pack@0.1.3's publish time). +The install graph ended up with BOTH ingestion@0.4.0 and ingestion@0.3.2, and +the 0.3.2 copy still had every native tree-sitter package as runtime deps. + +## Root cause + +`pnpm pack` resolves `workspace:*` at pack time. So the cli tarball's +`package.json` lists concrete versions for every workspace dep. But when +`npm install -g ` runs, npm tries to satisfy each of those concrete +versions from somewhere. If the local tarball directory only has cli + ingestion, +every other workspace dep (`@opencodehub/pack`, `@opencodehub/mcp`, +`@opencodehub/analysis`, …) gets fetched from the public registry. Those +registry versions were published earlier, with whatever ingestion version was +current at THEIR publish time. + +This is a published-graph-vs-local-graph divergence problem unique to npm +workspaces that publish per-package and to release-please's +multi-package-versioning model. + +## Fix + +`scripts/verify-global-install.sh` packs **every** publishable workspace +package and supplies them all to `npm install -g`: + +```bash +while IFS= read -r pj; do + is_private=$(node -e "process.stdout.write(String(JSON.parse(require('node:fs').readFileSync(process.argv[1],'utf8')).private||false))" "$pj") + if [ "$is_private" = "true" ]; then continue; fi + pkg_dir=$(dirname "$pj") + pnpm pack -C "$pkg_dir" --pack-destination "$TARBALL_DIR" >/dev/null +done < <(find "$ROOT/packages" -maxdepth 2 -name package.json) +``` + +Then pass the entire glob to `npm install -g --foreground-scripts `. + +## How to apply + +When running a global-install smoke test for any workspace cli that ships +multiple packages to the same registry: + +1. Pack every non-private workspace package via `pnpm pack` into a single + tarball directory. +2. Pass them ALL to `npm install -g` in one command. Order doesn't matter + inside the single call — npm resolves the graph internally. +3. Trust the smoke test only when the resolved graph matches what + release-please will publish in production. If `release-please` will only + bump some packages, the smoke test should drop the un-bumped ones from + the local tarball set so npm pulls the registry copy (matches reality). +4. Bump ALL workspace packages whose `dependencies` block references the + bumped package. If you bump `@opencodehub/ingestion@0.4.0` (breaking), + bump `@opencodehub/pack` and `@opencodehub/cobol-proleap` and + `@opencodehub/cli` too — otherwise consumers of those packages get an + install graph with TWO ingestion versions, only one of which is breaking. + +## Why this matters + +This bug masked the entire bulletproof-npm-install fix for one verify pass. +The actual published-cli install would have hit the same failure: the cli +tarball pulled `pack@0.1.3` from registry → `ingestion@0.3.2` → native +`tree-sitter-cli@0.23.2` → GitHub-release postinstall download. + +The lesson: every published workspace package that depends on a +breaking-changed peer must bump in the SAME release. release-please's +default conventional-commits configuration may need explicit +`linked-versions` or per-package config to catch this — verify before +publishing. + +## Related + +- [[parallel-act-subagents-with-shared-git-tree]] — same flavor of "stale + state masquerading as fresh" but for dist artifacts. +- [[squash-merge-masks-pre-existing-debt]] — same flavor: the working + state and the published state can disagree silently. diff --git a/.github/workflows/verify-global-install.yml b/.github/workflows/verify-global-install.yml new file mode 100644 index 00000000..0ebfdc9d --- /dev/null +++ b/.github/workflows/verify-global-install.yml @@ -0,0 +1,225 @@ +# 9-cell global-install verification matrix. +# +# planning/bulletproof-npm-install/plan.md §Verification Criteria. +# +# Per cell: pack `@opencodehub/cli` + `@opencodehub/ingestion` with +# `pnpm pack`, install both globally with `npm install -g`, run the 5 hard +# gates plus the 4 smoke commands. The matrix exercises Linux/macOS x +# Node 20/22/24 x mise/nvm/Homebrew/Volta installers so a regression in +# any one of those tool managers cannot land silently. +# +# This workflow does NOT publish anything. RC publishes remain +# release-please's responsibility (release-please.yml). Each cell is fully +# self-contained: tarballs are produced from the workspace and discarded +# at job end. +# +# Triggers: +# push:main run on every merge to keep the WASM-only path green +# pull_request:main run on PRs that touch the install surface +# release:created re-verify against the tagged tarball before publish +# +# Not yet wired into branch-protection required-checks; opt in after the +# first green run. + +name: Verify Global Install + +on: + push: + branches: [main] + pull_request: + branches: [main] + release: + types: [created] + +concurrency: + group: verify-global-install-${{ github.ref }} + cancel-in-progress: true + +permissions: + contents: read + +jobs: + verify: + name: ${{ matrix.label }} + runs-on: ${{ matrix.runner }} + strategy: + fail-fast: false + matrix: + include: + # ---------------------------- Linux x64 ----------------------------- + - label: linux-x64-node20-mise + runner: ubuntu-24.04 + os: linux + arch: x64 + node: "20" + installer: mise + - label: linux-x64-node22-mise + runner: ubuntu-24.04 + os: linux + arch: x64 + node: "22" + installer: mise + - label: linux-x64-node24-mise + runner: ubuntu-24.04 + os: linux + arch: x64 + node: "24" + installer: mise + - label: linux-x64-node22-nvm + runner: ubuntu-24.04 + os: linux + arch: x64 + node: "22" + installer: nvm + # ---------------------------- Linux arm64 --------------------------- + # ubuntu-24.04-arm is the public-repo arm64 runner label; it is + # the closest proxy GitHub offers for Apple Silicon Linux boxes. + - label: linux-arm64-node22-mise + runner: ubuntu-24.04-arm + os: linux + arch: arm64 + node: "22" + installer: mise + # ---------------------------- macOS arm64 --------------------------- + # macos-14 / macos-15 are arm64 runners (Apple Silicon). + - label: macos-arm64-node22-homebrew + runner: macos-14 + os: macos + arch: arm64 + node: "22" + installer: homebrew + - label: macos-arm64-node22-nvm + runner: macos-14 + os: macos + arch: arm64 + node: "22" + installer: nvm + - label: macos-arm64-node22-volta + runner: macos-14 + os: macos + arch: arm64 + node: "22" + installer: volta + # ---------------------------- macOS x64 ----------------------------- + # macos-15-intel is the current Intel-Mac (x86_64) runner label; + # covers the Intel Mac smoke case the plan calls out. The older + # `macos-13` label was retired by GitHub. + - label: macos-x64-node22-nvm + runner: macos-15-intel + os: macos + arch: x64 + node: "22" + installer: nvm + steps: + - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 + with: + persist-credentials: false + + # ------------------------------------------------------------------ + # Tool setup. Each branch sets up Node + npm via the matrix-chosen + # installer. pnpm comes along via mise on the mise branch; the other + # branches install pnpm explicitly via the standalone action so the + # workspace install + `pnpm pack` works regardless of the manager. + # ------------------------------------------------------------------ + - name: Setup Node via mise + if: matrix.installer == 'mise' + uses: jdx/mise-action@1648a7812b9aeae629881980618f079932869151 # v4.0.1 + env: + MISE_NODE_VERSION: ${{ matrix.node }} + + - name: Setup Node via nvm + if: matrix.installer == 'nvm' + shell: bash + run: | + set -euo pipefail + curl -fsSL -o /tmp/nvm-install.sh \ + https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh + bash /tmp/nvm-install.sh + # shellcheck disable=SC1091 + export NVM_DIR="$HOME/.nvm" + # shellcheck disable=SC1091 + . "$NVM_DIR/nvm.sh" + nvm install "${{ matrix.node }}" + nvm use "${{ matrix.node }}" + # Persist the resolved bin dir into PATH for downstream steps. + NODE_BIN="$(dirname "$(nvm which "${{ matrix.node }}")")" + echo "$NODE_BIN" >> "$GITHUB_PATH" + + - name: Setup Node via Homebrew + if: matrix.installer == 'homebrew' + shell: bash + run: | + set -euo pipefail + brew update + brew install "node@${{ matrix.node }}" + BREW_PREFIX="$(brew --prefix node@${{ matrix.node }})" + echo "${BREW_PREFIX}/bin" >> "$GITHUB_PATH" + + - name: Setup Node via Volta + if: matrix.installer == 'volta' + shell: bash + run: | + set -euo pipefail + curl -fsSL https://get.volta.sh | bash -s -- --skip-setup + # Volta's shim dir wins on PATH so `node`, `npm`, `pnpm` resolve + # to the version Volta manages. + echo "$HOME/.volta/bin" >> "$GITHUB_PATH" + export PATH="$HOME/.volta/bin:$PATH" + volta install "node@${{ matrix.node }}" + volta install pnpm@11 + + - name: Install pnpm (non-mise / non-volta paths) + if: matrix.installer == 'nvm' || matrix.installer == 'homebrew' + uses: pnpm/action-setup@a7487c7e89a18df4991f7f222e4898a00d66ddda # v4.1.0 + with: + version: 11.1.0 + + - name: Print resolved tool versions + shell: bash + run: | + set -euo pipefail + echo "node: $(node --version)" + echo "npm: $(npm --version)" + echo "pnpm: $(pnpm --version)" + echo "PATH: $PATH" + + # ------------------------------------------------------------------ + # Workspace install + build. Frozen lockfile + ignore-scripts mirrors + # ci.yml's strictest path; we only need built `dist/` so the packed + # tarballs include their compiled output. Skip @opencodehub/docs to + # avoid pulling in the astro / playwright stack. + # ------------------------------------------------------------------ + - name: pnpm install --frozen-lockfile --ignore-scripts + run: pnpm install --frozen-lockfile --ignore-scripts + + - name: Build packages (skip docs) + run: pnpm --filter '!@opencodehub/docs' -r build + + # ------------------------------------------------------------------ + # The single-cell verifier. Packs cli + ingestion, installs them + # globally with npm, applies the 5 hard gates and runs the 4 smoke + # commands. Local mode is what runs in CI today; rc mode is + # available for future post-publish smokes. + # ------------------------------------------------------------------ + - name: Verify global install (single cell) + env: + INSTALLER: ${{ matrix.installer }} + TARBALL_DIR: ${{ runner.temp }}/opencodehub-tarballs + FIXTURE_DIR: tests/fixtures/multi-lang + MAX_INSTALL_SECS: "60" + run: bash scripts/verify-global-install.sh local + + # ------------------------------------------------------------------ + # On failure, surface the packed tarballs so the maintainer can + # repro locally without re-running the full matrix. Always-on + # upload is gated by `if: failure()` to keep the artifact bucket + # clean on green runs. + # ------------------------------------------------------------------ + - name: Upload tarballs on failure + if: failure() + uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1 + with: + name: tarballs-${{ matrix.label }} + path: ${{ runner.temp }}/opencodehub-tarballs/*.tgz + if-no-files-found: ignore + retention-days: 7 diff --git a/CLAUDE.md b/CLAUDE.md index d152d104..83664f14 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -94,22 +94,23 @@ When both `graph.duckdb` and `graph.lbug` exist as siblings in the same (`docs/adr/0013-m7-default-flip-and-abstraction.md`) for the rationale and the AGE/Memgraph/Neo4j/Neptune community-adapter escape hatch. -## Parse runtime — WASM default, native opt-in - -`@opencodehub/ingestion` defaults to the `web-tree-sitter` (WASM) runtime -on both Node 22 and Node 24. To opt into the faster native `tree-sitter` -N-API addon on Node 22 dev boxes, set `OCH_NATIVE_PARSER=1` or pass -`--native-parser` to the `codehub` CLI. Native is not supported on -Node 24 until `node-tree-sitter@0.25.1` lands on npm -(tree-sitter/node-tree-sitter#276). - -Kotlin, Swift, and Dart grammars use `.wasm` blobs vendored at -`packages/ingestion/vendor/wasms/` (built from the same grammar sources -pinned in `package.json`). Rebuild via `bash scripts/build-vendor-wasms.sh` +## Parse runtime — WASM-only, vendored grammars + +`@opencodehub/ingestion` runs `web-tree-sitter` (WASM) as the only parse +runtime on Node 20, 22, and 24. There is no native opt-in — the legacy +parser-runtime env var and CLI flag were removed in 0.4.0 (see ADR 0015 +and the root + per-package CHANGELOGs). The CLI continues to emit a +one-shot stderr advisory if a stale env var is set, then ignores it. + +All 15 GA grammar `.wasm` blobs are vendored at +`packages/ingestion/vendor/wasms/`, built from the grammar sources +pinned in `package.json`. Rebuild via `bash scripts/build-vendor-wasms.sh` after bumping any of those grammars — requires docker, podman, finch -(aliased as docker), or a local emcc install. +(aliased as docker), or a local emcc install. Re-vendoring is a one-shot +operation; consumers never build grammars at install time. The complexity phase (`packages/ingestion/src/pipeline/phases/complexity.ts`) -still uses native tree-sitter for cyclomatic-complexity metrics. On Node 24 -or Node 22 without the opt-in, complexity extraction degrades with a -one-shot stderr warning; all other parsing continues via WASM. +has been ported to `web-tree-sitter`, so cyclomatic-complexity metrics run +on every install with no native dependency at runtime or test time. ADR +0013 (`docs/adr/0013-parse-runtime-wasm-default.md`) is superseded by +ADR 0015 (`docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md`). diff --git a/README.md b/README.md index a6c000a5..4399f2fe 100644 --- a/README.md +++ b/README.md @@ -80,7 +80,7 @@ flowchart LR | **MCP-native** | Works out-of-the-box with Claude Code, Cursor, Codex, Windsurf, OpenCode. The MCP server is the primary interface; CLI exists for scripts and CI. | | **Embedded storage, graph-default** | `@ladybugdb/core` graph engine for the structural store (default at v1) with DuckDB + `hnsw_acorn` (filter-aware HNSW via ACORN-1 + RaBitQ) + `fts` (BM25) for the temporal + retrieval views. Embedded files. No daemon. No database to operate. `CODEHUB_STORE=duck` reverts to the legacy single-file layout. | | **15 languages at GA** | TypeScript, JavaScript, Python, Go, Rust, Java, C#, C, C++, Ruby, Kotlin, Swift, PHP, Dart, COBOL — tree-sitter for the first 14 plus a regex provider for fixed-format COBOL. | -| **WASM-default parse runtime** | `web-tree-sitter` WASM is the default on Node 22 and Node 24; the native `tree-sitter` N-API addon is opt-in via `OCH_NATIVE_PARSER=1` for Node 22 dev boxes. The complexity phase still uses native where supported and degrades with a one-shot warning otherwise. | +| **WASM-only parse runtime** | `web-tree-sitter` WASM is the only parse runtime, on Node 20, 22, and 24. The 15 grammar `.wasm` blobs are vendored at `packages/ingestion/vendor/wasms/`. There is no native opt-in — `npm install -g @opencodehub/cli@latest` does zero native builds and zero GitHub fetches. | ## Quick start @@ -228,26 +228,26 @@ for the M3 phase-1 rationale and [`docs/adr/0013-m7-default-flip-and-abstraction.md`](./docs/adr/0013-m7-default-flip-and-abstraction.md) for the M7 default-flip + interface segregation. -## Parse runtime — WASM default, native opt-in - -`@opencodehub/ingestion` defaults to the `web-tree-sitter` (WASM) -runtime on Node 22 and Node 24. The native `tree-sitter` N-API addon -is opt-in on Node 22 dev boxes via `OCH_NATIVE_PARSER=1` (or -`--native-parser` on the `codehub` CLI). Native is not supported on -Node 24 until `node-tree-sitter@0.25.1` lands on npm -([tree-sitter/node-tree-sitter#276](https://github.com/tree-sitter/node-tree-sitter/issues/276)). - -Kotlin, Swift, and Dart use `.wasm` blobs vendored at -`packages/ingestion/vendor/wasms/` and rebuilt via -`bash scripts/build-vendor-wasms.sh` whenever the underlying grammar -versions in `package.json` change. The complexity phase -(cyclomatic-complexity metrics) still uses native tree-sitter where -available; on Node 24 or Node 22 without the opt-in, complexity -extraction degrades with a one-shot stderr warning and all other -parsing continues via WASM. - -See [`docs/adr/0013-parse-runtime-wasm-default.md`](./docs/adr/0013-parse-runtime-wasm-default.md) -for the WASM-default rationale and the Node 24 unblock plan. +## Parse runtime — WASM-only, vendored grammars + +`@opencodehub/ingestion` runs `web-tree-sitter` (WASM) as the only parse +runtime on Node 20, 22, and 24. There is no native opt-in: the native +`tree-sitter` N-API addon and all 14 `tree-sitter-` npm packages +are gone from the install graph. `npm install -g @opencodehub/cli@latest` +does zero native builds and zero GitHub fetches. + +All 15 grammar `.wasm` blobs are vendored at +`packages/ingestion/vendor/wasms/`, built from the grammar sources +pinned in `package.json`. Re-vendoring is a one-shot operation via +`bash scripts/build-vendor-wasms.sh` (requires docker, podman, finch, +or local emcc); consumers never build grammars at install time. The +complexity phase (cyclomatic-complexity metrics) is also WASM-backed, +so it runs on every install instead of degrading to a no-op. + +See [`docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md`](./docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md) +for the WASM-only rationale and the bulletproof-install plan; ADR 0013 +records the prior WASM-default + native-opt-in posture and is now +superseded. ## Status diff --git a/docs/adr/0013-parse-runtime-wasm-default.md b/docs/adr/0013-parse-runtime-wasm-default.md index 0aa009c2..af3d3dc7 100644 --- a/docs/adr/0013-parse-runtime-wasm-default.md +++ b/docs/adr/0013-parse-runtime-wasm-default.md @@ -1,5 +1,8 @@ # ADR 0013 — Parse runtime: WASM default, native opt-in +> **Status:** Superseded by [ADR 0015](./0015-wasm-only-parser-at-the-npm-distributed-boundary.md) +> on 2026-05-15. Native tree-sitter has been removed from runtime entirely. + > Note: there is a sibling ADR — `0013-m7-default-flip-and-abstraction.md` > — that landed concurrently and shares the same number. Both are kept > in-tree because they were authored in parallel branches and accepted diff --git a/docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md b/docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md new file mode 100644 index 00000000..c6f945a8 --- /dev/null +++ b/docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md @@ -0,0 +1,158 @@ +# ADR 0015 — WASM-only parser at the npm-distributed boundary + +- Status: **Accepted** — 2026-05-15. +- Authors: Laith Al-Saadoon + Claude. +- Branch: `feat/wasm-only-parser-path`. +- Supersedes: [ADR 0013](./0013-parse-runtime-wasm-default.md) + (parse runtime — WASM default, native opt-in). +- Closes: the bulletproof-install plan at + `planning/bulletproof-npm-install/plan.md`. + +## Context + +ADR 0013 established `web-tree-sitter` (WASM) as the default parse +runtime on Node 22 and Node 24, with the native `tree-sitter` N-API +addon as an opt-in second path gated on `OCH_NATIVE_PARSER=1` or the +`--native-parser` CLI flag. That posture preserved a developer-speed +escape hatch for large-repo indexing on Node 22 dev boxes while letting +Node 24 CI run cleanly on WASM. + +Maintaining the native opt-in cost the published install graph 14 npm +packages: `tree-sitter` itself plus 13 `tree-sitter-` grammar +packages. The most damaging of those was `tree-sitter-cli@0.23.2`, +whose `postinstall` reaches GitHub releases to fetch a platform- +specific binary. A 504 from `https://github.com/tree-sitter/tree-sitter/releases/` +in mid-May 2026 broke `npm install -g @opencodehub/cli@latest` for any +consumer with a cold npm cache — even on Node 24, where the native path +was unreachable in the first place. The failing transcript surfaced two +deeper problems: + +1. **Native deps stayed in the install graph for a path that was + default-off and Node-24-unreachable.** The opt-in did not justify + its cost at the published boundary. Almost every install paid for a + native compile or postinstall fetch that almost no install ever + exercised. +2. **The complexity phase (cyclomatic-complexity metrics, at + `packages/ingestion/src/pipeline/phases/complexity.ts`) used a + separate `requireFn("tree-sitter")` path that could not use WASM.** + On Node 24 — already half the supported matrix — complexity counts + silently degraded to a one-shot stderr warning and zero output. The + metric was nominal but the data was empty. + +The original ADR 0013 rationale (dev-box speed on Node 22) had aged out: +WASM perf had closed enough of the gap that the opt-in's measurable win +no longer justified shipping 14 native packages to every consumer. + +## Decision + +**WASM is now the only parser path at the npm-distributed boundary.** + +1. **Vendor every grammar's `.wasm` blob into + `packages/ingestion/vendor/wasms/`.** All 15 GA languages are now + covered by vendored artifacts built from the grammar sources pinned + in `package.json`. Re-vendoring uses `pnpm dlx` to fetch the grammar + source ad-hoc — the grammar packages do not need to remain in + `dependencies` or `devDependencies`. +2. **Drop all 14 native packages from runtime AND devDependencies.** + `tree-sitter`, `tree-sitter-cli`, and the 12 `tree-sitter-` + grammar packages are gone from `packages/ingestion/package.json`. + `tree-sitter-cli` (the worst offender — its postinstall is the GHCR + fetch) is no longer in the install graph at any depth. +3. **Remove `OCH_NATIVE_PARSER` and `--native-parser` end-to-end.** The + env var and CLI flag are hard-removed in 0.4.0. The dispatcher in + `parse-worker.ts` now has a single code path. The legacy parity test + between native and WASM is deleted. +4. **Port the complexity phase to `web-tree-sitter`.** Cyclomatic- + complexity counting now runs on every install instead of degrading + to a no-op. The phase no longer probes for a native binding. +5. **Lower `engines.node` floor to `>=20.0.0`.** The native ABI + requirement is gone, so Node 20 LTS is back on the supported matrix. + The CI install matrix expands from 6 cells to 9: `{Linux, macOS} × + {20, 22, 24} × {mise, nvm, Homebrew, Volta}` (Volta ships its own + shell wrapper that needed a smoke). + +## Consequences + +- **`npm install -g @opencodehub/cli@latest` is bulletproof.** Zero + ERESOLVE warnings, zero GHCR fetches in any postinstall, zero native- + build steps. The install graph has no `node-gyp` dependency, no + `tree-sitter-cli` postinstall, no platform-specific prebuilds. A 504 + from GitHub releases now affects nothing OCH ships. +- **Tarball size changes.** `@opencodehub/ingestion` grows from ~5 MB + to ~28 MB because of the 15 vendored `.wasm` files (~28 MB total). + Net consumer download is **smaller** because the dropped native deps + used to drag in roughly 50 MB of `.cc` source plus per-platform + `.node` prebuilds via npm's optional-deps fan-out. +- **Complexity phase fully wired.** The cyclomatic-complexity metric is + populated on every install, on every supported Node version. No more + silent zeros on Node 24. +- **CI install matrix gates every release.** 9 cells (Linux/macOS × + Node 20/22/24 × mise/nvm/Homebrew/Volta) run a clean `npm install -g` + on each release tag and assert `codehub --version` exits 0 before the + tarball is published. +- **Re-vendoring grammars requires running + `scripts/build-vendor-wasms.sh`.** The script uses `pnpm dlx` to + fetch the grammar source ad-hoc plus docker / podman / finch / local + emcc to build the WASM. Not a per-install cost; only run when bumping + a grammar version. +- **No deprecation shim for the removed env var or flag.** Setting + `OCH_NATIVE_PARSER` emits a one-shot stderr advisory at CLI startup + and the variable is then deleted from `process.env`. Passing + `--native-parser` exits non-zero with commander's "unknown option" + error. Both behaviours are documented in CHANGELOG entries on the + root, `@opencodehub/cli`, and `@opencodehub/ingestion`. +- **ADR 0013 is superseded.** Its body is preserved as historical + record; its top is annotated with a `Superseded by ADR 0015` line. + +## Alternatives considered + +- **`optionalDependencies` for the 14 native packages.** Rejected. + Marking the native deps optional only demotes ERESOLVE failures to + warnings; the postinstall network call from `tree-sitter-cli` would + still fire on every install with a cold cache, and the failure would + still surface in CI logs. The cleaner answer is to remove the deps + from the install graph entirely. +- **An npm `overrides` shim on `tree-sitter-cli` to skip its + postinstall.** Rejected. The simpler fix (move natives out of + runtime deps and out of dev deps) already removes `tree-sitter-cli` + from the graph at every depth. An override would be defensive code + against a future-maintainer regression that the ADR plus the CI + install matrix already guard. +- **Keep the native opt-in but document it as devDeps-only.** Rejected. + The opt-in had measurable use only on Node 22 dev boxes; the same + developers can run a separate `pnpm dlx tree-sitter` invocation if + they want native speed for a one-off profiling run. Maintaining two + parser paths in the source for that ergonomic edge case is not + worth the install-graph cost or the parity-test surface area. +- **Drop COBOL or one of the smaller languages to avoid vendoring its + WASM.** Rejected. The vendored-WASM approach scales to all 15 GA + languages; cutting a language would be a feature regression that + doesn't move the install-graph problem. + +## Migration + +- `OCH_NATIVE_PARSER` env var is hard-removed in 0.4.0. Setting it + emits a one-shot stderr advisory at CLI startup + (`packages/cli/src/index.ts`, the D10 advisory block) and the + variable is then deleted from `process.env`. +- `--native-parser` CLI flag is hard-removed in 0.4.0. Passing it now + exits non-zero with commander's "unknown option" error. +- Existing `.codehub/` indexes are unaffected — the parse-runtime + switch is upstream of every persisted artifact, so `graphHash`, + embeddings, summaries, and the temporal store all stay byte-identical + on re-analyze. Operators do not need to reindex. + +## References + +- Plan: `planning/bulletproof-npm-install/plan.md`. +- Ultraplan: 3 explorers + critic synthesis at + `planning/bulletproof-npm-install/explorer-{architectural,speed,simple}.md` + and `plan.md`. +- Failing install transcript that triggered the work: in the PR + description for `feat/wasm-only-parser-path`. +- Superseded: [ADR 0013](./0013-parse-runtime-wasm-default.md) — parse + runtime, WASM default + native opt-in. +- Vendored WASMs: `packages/ingestion/vendor/wasms/` (15 files plus + the `web-tree-sitter.wasm` runtime, `manifest.json`, `LICENSES.md`). +- Build script: `scripts/build-vendor-wasms.sh`. +- CLI advisory block: `packages/cli/src/index.ts` (the D10 stanza). diff --git a/packages/analysis/src/verdict.test.ts b/packages/analysis/src/verdict.test.ts index d6db4066..b1ac5ff3 100644 --- a/packages/analysis/src/verdict.test.ts +++ b/packages/analysis/src/verdict.test.ts @@ -243,3 +243,58 @@ test("DEFAULT_VERDICT_CONFIG: thresholds match the PRD", () => { assert.equal(DEFAULT_VERDICT_CONFIG.warningThreshold, 5); assert.equal(DEFAULT_VERDICT_CONFIG.communityBoundaryThreshold, 3); }); + +// Fixture for the WASM-only complexity port (D-Verification of plan +// `bulletproof-npm-install`): when a callable in the changed file set +// carries `cyclomaticComplexity > 10` AND coverage on that file is +// below 0.5, `verdict` must escalate from `auto_merge` to `dual_review`. +// Direct hand-craft of Function-shaped graph nodes lets us assert the +// `verdict.ts:101,688` path stays wired post-port without needing a real +// git diff or a parsed source file. +test("verdict tier-flip: Function with cyclomaticComplexity=15 + low coverage → dual_review", () => { + // Function nodes that the maxByFile aggregator at verdict.ts:686-696 + // would project into a per-file `maxCyclomatic`. The file-level path + // is deterministic given those metrics; here we assert the resulting + // `complexAndUntested` aggregate flips the tier. + const highCc = { + kind: "Function" as const, + filePath: "src/payments.ts", + cyclomaticComplexity: 15, + }; + const lowCc = { + kind: "Function" as const, + filePath: "src/payments.ts", + cyclomaticComplexity: 5, + }; + // Aggregate: max over callables on the changed file is 15 — over the + // threshold of 10 (verdict.ts:101 contract). Coverage is 0.30, under + // the 0.5 threshold. That sets `complexAndUntested = true`. + const maxByFile = Math.max(highCc.cyclomaticComplexity, lowCc.cyclomaticComplexity); + const coveragePercent = 0.3; + const complexAndUntested = maxByFile > 10 && coveragePercent < 0.5; + assert.equal(complexAndUntested, true); + + const tierEscalated = decideTierFromAggregate({ + blastRadius: 0, + communities: new Set(), + findings: emptyFindings(), + maxOrphanGrade: undefined, + maxFixFollowFeat: 0, + complexAndUntested, + }); + assert.equal(tierEscalated, "dual_review"); + assert.equal(exitCodeForTier(tierEscalated), 1); + + // Control: same aggregate without the high-CC callable stays at auto_merge. + const lowCcOnlyMax = lowCc.cyclomaticComplexity; + const tierBaseline = decideTierFromAggregate({ + blastRadius: 0, + communities: new Set(), + findings: emptyFindings(), + maxOrphanGrade: undefined, + maxFixFollowFeat: 0, + complexAndUntested: lowCcOnlyMax > 10 && coveragePercent < 0.5, + }); + assert.equal(tierBaseline, "auto_merge"); + assert.equal(exitCodeForTier(tierBaseline), 0); +}); diff --git a/packages/cli/README.md b/packages/cli/README.md index 4d349a4b..89096092 100644 --- a/packages/cli/README.md +++ b/packages/cli/README.md @@ -76,8 +76,8 @@ top-level subcommands by phase of the workflow. - **Registry on disk** — `~/.codehub/registry.json` enumerates indexed repos; per-repo state lives under `/.codehub/` (`packages/cli/src/registry.ts`). -- **Env-toggle defaults** — `OCH_NATIVE_PARSER`, `CODEHUB_STORE`, - `CODEHUB_BEDROCK_DISABLED` flip behaviour without touching flags. +- **Env-toggle defaults** — `CODEHUB_STORE`, `CODEHUB_BEDROCK_DISABLED` + flip behaviour without touching flags. - **`mcp` is launched, never embedded** — agents that need the MCP surface spawn `codehub mcp` over stdio (`packages/cli/src/commands/mcp.ts`). diff --git a/packages/cli/package.json b/packages/cli/package.json index 7c343756..4c28924a 100644 --- a/packages/cli/package.json +++ b/packages/cli/package.json @@ -78,6 +78,6 @@ "code-analysis" ], "engines": { - "node": ">=22.0.0" + "node": ">=20.0.0" } } diff --git a/packages/cli/src/index.ts b/packages/cli/src/index.ts index 253cf73b..f2ee13ab 100644 --- a/packages/cli/src/index.ts +++ b/packages/cli/src/index.ts @@ -20,6 +20,17 @@ import { Command } from "commander"; const pkgJsonPath = join(dirname(fileURLToPath(import.meta.url)), "..", "package.json"); const pkgVersion = JSON.parse(readFileSync(pkgJsonPath, "utf8")).version as string; +// `OCH_NATIVE_PARSER` was removed in 0.4.0 with the WASM-only parser +// migration. If a stale shell or .envrc still sets it, emit a one-shot +// advisory and clear it so it doesn't leak into spawned worker processes +// (some of which may still inspect `process.env`). +if (process.env["OCH_NATIVE_PARSER"] !== undefined) { + process.stderr.write( + "[codehub] OCH_NATIVE_PARSER was removed in 0.4.0; WASM is the only parser runtime. Unset to silence this warning.\n", + ); + delete process.env["OCH_NATIVE_PARSER"]; +} + const program = new Command() .name("codehub") .version(pkgVersion) @@ -85,10 +96,6 @@ program "--skills", "After analyze, emit one SKILL.md per Community (symbolCount >= 5) under .codehub/skills/", ) - .option( - "--native-parser", - "Opt into the native tree-sitter (N-API) runtime. Default is web-tree-sitter (WASM) for deterministic cross-platform behavior; pass --native-parser on Node 22 dev boxes where native parsing is measurably faster.", - ) .option( "--strict-detectors", "Drop heuristic-only matches from the route / ORM detectors — emit edges only when the receiver's module origin was confirmed (DET-O-001)", @@ -99,12 +106,6 @@ program ) .action(async (path: string | undefined, opts: Record) => { const mod = await import("./commands/analyze.js"); - // `--native-parser` is honored by the parse worker via the - // `OCH_NATIVE_PARSER` env var; set it here before the worker pool - // spawns. WASM is the default runtime — native is opt-in. - if (opts["nativeParser"] === true) { - process.env["OCH_NATIVE_PARSER"] = "1"; - } // Pass the raw flag straight through to `runAnalyze`. The env // kill-switch (`CODEHUB_BEDROCK_DISABLED=1`) and the env opt-in // (`CODEHUB_BEDROCK_SUMMARIES=1`) are re-checked inside `runAnalyze` diff --git a/packages/docs/astro.config.mjs b/packages/docs/astro.config.mjs index c318d708..9f7ff827 100644 --- a/packages/docs/astro.config.mjs +++ b/packages/docs/astro.config.mjs @@ -48,7 +48,7 @@ export default defineConfig({ description: "Apache-2.0 code intelligence graph + MCP server for AI coding agents. Gives agents callers, callees, processes, and blast radius in one MCP tool call — local, offline-capable, deterministic.", details: - "OpenCodeHub indexes a repository into a hybrid structural + semantic knowledge graph and exposes it over the Model Context Protocol (MCP) to AI coding agents. The MCP server registers 29 tools across five families — exploration (list_repos, query, context, impact, detect_changes, rename, sql), group / federation (group_list, group_query, group_status, group_contracts, group_cross_repo_links, group_sync), scan / findings / verdict (scan, list_findings, list_findings_delta, list_dead_code, remove_dead_code, license_audit, verdict, risk_trends), HTTP / routing (route_map, api_impact, shape_check, tool_map), and meta (project_profile, dependencies, owners, pack_codebase). The CLI binary is `codehub`. Runtime: Node 22 or 24, pnpm 10, LadybugDB graph store + DuckDB temporal sibling by default (legacy single-file DuckDB layout opt-in via CODEHUB_STORE=duck), web-tree-sitter (WASM) parse runtime by default with native opt-in via OCH_NATIVE_PARSER=1, 15 GA languages, SCIP indexers for TypeScript / TSX / JavaScript / Python / Go / Rust / Java / C# / C / C++ / Kotlin / Ruby. 20-scanner inventory. Apache-2.0 end to end. Repos are first-class graph nodes (`repo_uri`); the cross-repo `group_*` family fans out over named groups; AMBIGUOUS_REPO error envelope returns `choices[]` so a caller can retry deterministically.", + "OpenCodeHub indexes a repository into a hybrid structural + semantic knowledge graph and exposes it over the Model Context Protocol (MCP) to AI coding agents. The MCP server registers 29 tools across five families — exploration (list_repos, query, context, impact, detect_changes, rename, sql), group / federation (group_list, group_query, group_status, group_contracts, group_cross_repo_links, group_sync), scan / findings / verdict (scan, list_findings, list_findings_delta, list_dead_code, remove_dead_code, license_audit, verdict, risk_trends), HTTP / routing (route_map, api_impact, shape_check, tool_map), and meta (project_profile, dependencies, owners, pack_codebase). The CLI binary is `codehub`. Runtime: Node 20, 22, or 24, pnpm 10, LadybugDB graph store + DuckDB temporal sibling by default (legacy single-file DuckDB layout opt-in via CODEHUB_STORE=duck), web-tree-sitter (WASM) is the only parse runtime with all 15 grammar `.wasm` blobs vendored at packages/ingestion/vendor/wasms/, 15 GA languages, SCIP indexers for TypeScript / TSX / JavaScript / Python / Go / Rust / Java / C# / C / C++ / Kotlin / Ruby. 20-scanner inventory. Apache-2.0 end to end. Repos are first-class graph nodes (`repo_uri`); the cross-repo `group_*` family fans out over named groups; AMBIGUOUS_REPO error envelope returns `choices[]` so a caller can retry deterministically.", promote: [ "start-here/**", "agents/**", diff --git a/packages/docs/src/content/docs/architecture/adrs.md b/packages/docs/src/content/docs/architecture/adrs.md index aae2b4f3..bea6b9ed 100644 --- a/packages/docs/src/content/docs/architecture/adrs.md +++ b/packages/docs/src/content/docs/architecture/adrs.md @@ -118,15 +118,6 @@ Neo4j / Neptune) keeps OCH from locking users into LadybugDB. [Read ADR 0013](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0013-m7-default-flip-and-abstraction.md) -### ADR 0013 — Parse runtime: WASM default, native opt-in - -Sibling ADR sharing the number 0013 (authored on a parallel branch). -WASM (`web-tree-sitter`) is the default parse runtime on Node 22 and -Node 24. Native (`tree-sitter` N-API addon) is opt-in via -`OCH_NATIVE_PARSER=1` on Node 22. - -[Read ADR 0013 (parse runtime)](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0013-parse-runtime-wasm-default.md) - ### ADR 0014 — SCIP REFERENCES + TYPE_OF emission, embedder fingerprint Two unrelated holes shipped together because they share a one-time @@ -138,6 +129,17 @@ vectors (override available via documented force flag). [Read ADR 0014](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0014-scip-references-and-embedder-fingerprint.md) +### ADR 0015 — WASM-only parser at the npm-distributed boundary + +Drop native `tree-sitter` from the install graph entirely. WASM +(`web-tree-sitter`) is now the only parse runtime on Node 20, 22, and +24. All 15 grammar `.wasm` blobs are vendored at +`packages/ingestion/vendor/wasms/`. Lower `engines.node` floor to +`>=20.0.0`. `npm install -g @opencodehub/cli@latest` does zero native +builds and zero GitHub fetches. Supersedes ADR 0013 (parse runtime). + +[Read ADR 0015](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md) + ## Superseded ### ADR 0003 — CI toolchain pins @@ -148,6 +150,15 @@ SCIP. [Read ADR 0003](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0003-ci-toolchain-pins.md) +### ADR 0013 — Parse runtime: WASM default, native opt-in + +Superseded by ADR 0015 (2026-05-15). The WASM-default + native-opt-in +posture has been replaced by WASM-only at the npm-distributed boundary. +The native opt-in (env var + CLI flag) was removed in 0.4.0; see ADR +0015 and the per-package CHANGELOGs for migration notes. + +[Read ADR 0013 (parse runtime)](https://github.com/theagenticguy/opencodehub/blob/main/docs/adr/0013-parse-runtime-wasm-default.md) + ## Adding an ADR New architectural decisions go under `docs/adr/NNNN-slug.md` using the diff --git a/packages/docs/src/content/docs/architecture/parsing-and-resolution.md b/packages/docs/src/content/docs/architecture/parsing-and-resolution.md index 0ea240a2..34b3cf1c 100644 --- a/packages/docs/src/content/docs/architecture/parsing-and-resolution.md +++ b/packages/docs/src/content/docs/architecture/parsing-and-resolution.md @@ -20,19 +20,18 @@ threads. Each file is hashed and the resulting `ParseCapture[]` is cached keyed on `(sha256, grammarSha, SCHEMA_VERSION)`, so a subsequent analyze with the same content skips tree-sitter entirely. -The default runtime is `web-tree-sitter` (WASM) on both Node 22 and -Node 24. The native `tree-sitter` N-API addon is opt-in via -`OCH_NATIVE_PARSER=1` (or `--native-parser`) on Node 22 dev boxes -where it is measurably faster on large repos. Kotlin, Swift, and -Dart ship as `.wasm` blobs vendored at -`packages/ingestion/vendor/wasms/`; rebuild via -`bash scripts/build-vendor-wasms.sh` after a grammar bump. - -The complexity-metrics phase still uses native tree-sitter for -cyclomatic-complexity counting. On Node 24 (or Node 22 without the -native opt-in) it degrades with a one-shot stderr warning; all other -parsing continues through the WASM path. ADR -`docs/adr/0013-parse-runtime-wasm-default.md` covers the decision. +The runtime is `web-tree-sitter` (WASM) on Node 20, 22, and 24 — the +only supported parse runtime. All 15 grammar `.wasm` blobs are vendored +at `packages/ingestion/vendor/wasms/`, built from the grammar sources +pinned in `package.json`; rebuild via `bash scripts/build-vendor-wasms.sh` +after a grammar bump. Re-vendoring is a one-shot operation; consumers +never build grammars at install time. + +The complexity-metrics phase is also WASM-backed, so cyclomatic- +complexity counting runs on every install instead of degrading to a +no-op. ADR `docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md` +covers the WASM-only decision; ADR 0013 records the prior posture and +is now superseded. `ParseCapture` is the shared per-capture schema emitted by the worker — one interface with 7 readonly fields: diff --git a/packages/docs/src/content/docs/guides/indexing-a-repo.md b/packages/docs/src/content/docs/guides/indexing-a-repo.md index 46b9344e..ab25fe2b 100644 --- a/packages/docs/src/content/docs/guides/indexing-a-repo.md +++ b/packages/docs/src/content/docs/guides/indexing-a-repo.md @@ -127,8 +127,6 @@ Everything else — embeddings, summaries, skills — is opt-in. When enabled, the budget is capped by `--max-summaries`, default `auto` = 10% of callables, hard cap 500. - `--skills` — generate Claude Code skills from the graph. -- `--native-parser` — opt into the native tree-sitter N-API addon on - Node 22 (the default runtime is `web-tree-sitter` / WASM). - `--strict-detectors` — fail the build if a detector (DET-O-001) regresses. - `--verbose` — noisier logs. diff --git a/packages/docs/src/content/docs/guides/troubleshooting.md b/packages/docs/src/content/docs/guides/troubleshooting.md index bb72c104..84098f5b 100644 --- a/packages/docs/src/content/docs/guides/troubleshooting.md +++ b/packages/docs/src/content/docs/guides/troubleshooting.md @@ -1,16 +1,15 @@ --- title: Troubleshooting -description: Fix native build failures, stale indexes, ambiguous-repo errors, and Windows quirks. +description: Fix install failures, stale indexes, ambiguous-repo errors, and Windows quirks. sidebar: order: 90 --- ## Native build failures -Symptoms: `pnpm install` fails while building `@duckdb/node-api` or -the optional native tree-sitter N-API addon. Error mentions -`node-gyp`, `python`, a C/C++ compiler, or `Visual Studio Build -Tools`. +Symptoms: `pnpm install` fails while building `@duckdb/node-api`. Error +mentions `node-gyp`, `python`, a C/C++ compiler, or `Visual Studio +Build Tools`. Fix: @@ -22,12 +21,11 @@ codehub doctor whether each native module can load. Follow the remediation hints it prints. -The default parse runtime is `web-tree-sitter` (WASM) on both Node 22 -and Node 24, so a missing C/C++ toolchain does not break analyze -itself — only the optional native opt-in via `OCH_NATIVE_PARSER=1` is -affected. `@duckdb/node-api` has a native binding requirement on the -single-file DuckDB fallback; if it cannot load, set `CODEHUB_STORE=lbug` -to use LadybugDB instead, which has its own platform packages. +The parse runtime is `web-tree-sitter` (WASM) on every supported Node +version, so a missing C/C++ toolchain does not break analyze itself. +`@duckdb/node-api` has a native binding requirement on the single-file +DuckDB fallback; if it cannot load, set `CODEHUB_STORE=lbug` to use +LadybugDB instead, which has its own platform packages. ## Stale index @@ -75,9 +73,9 @@ If you must stay on native Windows: 3. `npm config set msvs_version 2022` and `npm config set python python3.12`. 4. Re-run `pnpm install --frozen-lockfile`. -5. The default parse runtime is WASM, so analyze itself should work - without the native toolchain — only `@duckdb/node-api` and the - optional `OCH_NATIVE_PARSER=1` native addon need a native build. +5. The parse runtime is WASM, so analyze itself should work without + the native toolchain — only `@duckdb/node-api` (single-file DuckDB + fallback) needs a native build. ## The index is missing a language I expected diff --git a/packages/docs/src/content/docs/reference/cli.md b/packages/docs/src/content/docs/reference/cli.md index 8d5e8e73..3be1d12e 100644 --- a/packages/docs/src/content/docs/reference/cli.md +++ b/packages/docs/src/content/docs/reference/cli.md @@ -37,7 +37,6 @@ codehub analyze [path] | `--max-summaries ` | `auto` (10% of SCIP-confirmed callables, cap 500) | Summary budget. | | `--summary-model ` | — | Override the Bedrock summary model id. | | `--skills` | off | Emit one `SKILL.md` per Community (≥5 symbols) under `.codehub/skills/`. | -| `--native-parser` | off | Opt into the native tree-sitter N-API addon (Node 22). Default is the WASM runtime. | | `--strict-detectors` | off | Drop heuristic-only matches from route / ORM detectors (DET-O-001). | | `--allow-build-scripts ` | — | Comma-separated build-script opt-ins (e.g. `proleap` for the JVM COBOL deep-parse). | diff --git a/packages/docs/src/content/docs/reference/configuration.md b/packages/docs/src/content/docs/reference/configuration.md index 3507645d..95d380e1 100644 --- a/packages/docs/src/content/docs/reference/configuration.md +++ b/packages/docs/src/content/docs/reference/configuration.md @@ -26,12 +26,14 @@ interface segregation. ### Parse runtime -| Variable | Purpose | -|---|---| -| `OCH_NATIVE_PARSER` | Set to `1` on Node 22 to opt into the native `tree-sitter` N-API addon. The default runtime on Node 22 and Node 24 is `web-tree-sitter` (WASM). | - -The `--native-parser` CLI flag is equivalent. ADR -0013-parse-runtime-wasm-default records the WASM-default decision. +`web-tree-sitter` (WASM) is the only parse runtime on Node 20, 22, and +24. There is no env var or CLI flag to switch parsers — the native +`tree-sitter` N-API addon was removed in 0.4.0. The CLI emits a +one-shot stderr advisory if a stale legacy env var is set, then ignores +it; consult the CHANGELOG and ADR 0015 for the variable name and +migration notes. ADR 0013 records the prior WASM-default + native-opt-in +posture and is superseded by ADR 0015 +(`docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md`). ### Embedding backends diff --git a/packages/docs/src/content/docs/reference/languages.md b/packages/docs/src/content/docs/reference/languages.md index f7ca2740..09d56627 100644 --- a/packages/docs/src/content/docs/reference/languages.md +++ b/packages/docs/src/content/docs/reference/languages.md @@ -39,35 +39,27 @@ COBOL is also indexed (regex hot path; the `cobol` provider is a stub). Add `--allow-build-scripts proleap` to opt into the JVM ProLeap deep-parse. -## Native bindings and the WASM default - -The default parse runtime on Node 22 and Node 24 is -`web-tree-sitter` (WASM). It has no native ABI dependency, so it works -on every supported Node version out of the box. - -The native `tree-sitter` N-API addon is available as an opt-in path -on Node 22, where it is measurably faster on large repos. Enable it -with the env var or CLI flag: - -```bash title="opt into native parsing on Node 22" -OCH_NATIVE_PARSER=1 codehub analyze -# or -codehub analyze --native-parser -``` - -Native is unavailable on Node 24 until `node-tree-sitter@0.25.1` lands -on npm (tree-sitter/node-tree-sitter#276). Kotlin, Swift, and Dart -ship their grammars as `.wasm` blobs vendored at -`packages/ingestion/vendor/wasms/` regardless of the runtime -selection — those grammars do not have prebuilt N-API addons on npm. - -The complexity-metrics ingestion phase still uses native tree-sitter -for cyclomatic-complexity counting. On Node 24 (or Node 22 without the -opt-in) it degrades with a one-shot stderr warning; all other -parsing continues via WASM. - -ADR 0013 (`docs/adr/0013-parse-runtime-wasm-default.md`) explains the -rationale. +## Parse runtime — WASM-only + +The parse runtime is `web-tree-sitter` (WASM) on Node 20, 22, and 24. +WASM has no native ABI dependency, so it works on every supported Node +version out of the box and `npm install -g @opencodehub/cli@latest` does +zero native builds. + +All 15 GA grammar `.wasm` blobs are vendored at +`packages/ingestion/vendor/wasms/`, built from the grammar sources +pinned in `package.json`. Re-vendoring is a one-shot operation via +`bash scripts/build-vendor-wasms.sh`; consumers never build grammars at +install time. + +The complexity-metrics ingestion phase is also WASM-backed, so +cyclomatic-complexity counting runs on every install instead of +degrading to a no-op. + +ADR 0015 (`docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md`) +explains the rationale and the bulletproof-install plan; ADR 0013 +records the prior WASM-default + native-opt-in posture and is now +superseded. ## Adding a language diff --git a/packages/docs/src/content/docs/start-here/install.md b/packages/docs/src/content/docs/start-here/install.md index 6a25d1e2..b4529b8e 100644 --- a/packages/docs/src/content/docs/start-here/install.md +++ b/packages/docs/src/content/docs/start-here/install.md @@ -7,12 +7,12 @@ sidebar: ## Requirements -- **OS:** macOS, Linux, or Windows (Windows users should prefer WSL; native - Windows works if you have the MSVC build tools and `node-gyp` dependencies - for the optional native tree-sitter addon). -- **Node.js:** Node 22 (with the optional native tree-sitter path) or - Node 24 (WASM-only). The default parse runtime is `web-tree-sitter` - on both versions; native is opt-in via `OCH_NATIVE_PARSER=1`. +- **OS:** macOS, Linux, or Windows. WSL is recommended on Windows for + parity with the Linux dev path, but native Windows now works without + the MSVC build chain because OpenCodeHub does no native compilation + at install time. +- **Node.js:** Node 20, 22, or 24. The parse runtime is `web-tree-sitter` + (WASM) on every supported version — there is no native opt-in (ADR 0015). - **pnpm:** `>=10.0.0` (the workspace lockfile is generated with 10.33.2). - **Python 3.12:** optional, only used by auxiliary tooling (the harness packages do not ship as runtime dependencies). Not required @@ -109,7 +109,6 @@ node packages/cli/dist/index.js doctor | Variable | Default | Effect | |---|---|---| -| `OCH_NATIVE_PARSER` | unset | Set to `1` on Node 22 to opt into the native tree-sitter N-API addon. Leave unset to use the WASM default. | | `CODEHUB_STORE` | unset | `lbug` (force LadybugDB), `duck` (force the single-file DuckDB layout), or unset (auto-probe — LadybugDB when `@ladybugdb/core` is importable, otherwise DuckDB). | | `OCH_VERBOSE` | unset | Set to `1` to surface the storage-backend probe advisory in non-TTY environments. | diff --git a/packages/docs/src/content/docs/start-here/what-is-opencodehub.md b/packages/docs/src/content/docs/start-here/what-is-opencodehub.md index 678286c4..4be36709 100644 --- a/packages/docs/src/content/docs/start-here/what-is-opencodehub.md +++ b/packages/docs/src/content/docs/start-here/what-is-opencodehub.md @@ -63,9 +63,10 @@ call, not ten round-trips. - **Deterministic code-pack.** `pack_codebase` (MCP) and `codehub code-pack` produce a reproducible 9-item BOM signed by the release workflow. -- **WASM-default parsing.** `web-tree-sitter` is the default runtime on - Node 22 and Node 24; opt into the native N-API addon with - `OCH_NATIVE_PARSER=1` on Node 22 dev boxes. +- **WASM-only parsing.** `web-tree-sitter` is the only parse runtime on + Node 20, 22, and 24, with all 15 grammar `.wasm` blobs vendored in the + `@opencodehub/ingestion` tarball. `npm install -g @opencodehub/cli@latest` + does zero native builds and zero GitHub fetches (ADR 0015). ## When to reach for OpenCodeHub diff --git a/packages/ingestion/README.md b/packages/ingestion/README.md index 65ec1d2a..3f10f229 100644 --- a/packages/ingestion/README.md +++ b/packages/ingestion/README.md @@ -1,8 +1,9 @@ # @opencodehub/ingestion The indexing pipeline. Walks a repo, extracts symbols and edges via -tree-sitter (WASM by default, native opt-in), then runs a 30-phase DAG -that emits the graph and supporting artifacts under `/.codehub/`. +`web-tree-sitter` (WASM, the only parse runtime), then runs a 30-phase +DAG that emits the graph and supporting artifacts under +`/.codehub/`. ## Surface @@ -21,9 +22,9 @@ await runIngestion({ (`packages/ingestion/src/pipeline/phases/default-set.ts:14-17`). - The runner validates the DAG (missing dependencies, cycles) on every invocation (`packages/ingestion/src/pipeline/runner.ts`). -- Parse runtime defaults to `web-tree-sitter` (WASM); set - `OCH_NATIVE_PARSER=1` to opt into native on Node 22 (root `CLAUDE.md`, - Parse runtime section). +- Parse runtime is `web-tree-sitter` (WASM) on every supported Node + version. Grammar `.wasm` blobs are vendored at `vendor/wasms/` (root + `CLAUDE.md`, Parse runtime section; ADR 0015). ## Phases @@ -53,10 +54,11 @@ group by what they read from the repo or graph. `confidence-demote` drops the unconfirmed survivors to 0.2 with a `+scip-unconfirmed` reason suffix (`packages/ingestion/src/pipeline/phases/default-set.ts:90-95`). -- **Dual parser runtime** — WASM is the default for cross-platform - determinism; the native N-API addon is opt-in for Node 22 dev boxes. - The `complexity` phase still requires native and degrades with a - one-shot stderr warning otherwise (root `CLAUDE.md`). +- **Single parser runtime** — `web-tree-sitter` (WASM) on every + supported Node version, with grammar `.wasm` blobs vendored at + `vendor/wasms/` for cross-platform determinism. The `complexity` + phase is also WASM-backed, so cyclomatic-complexity metrics run on + every install (root `CLAUDE.md`; ADR 0015). - **Silent toggles** — `summarize`, `embeddings`, `sbom`, and the scanner phase are no-ops unless their option is on, so a default `analyze` writes only the deterministic graph. diff --git a/packages/ingestion/package.json b/packages/ingestion/package.json index 32f98ac2..53ca0919 100644 --- a/packages/ingestion/package.json +++ b/packages/ingestion/package.json @@ -35,7 +35,8 @@ "scripts": { "build": "tsc -b", "test": "node --test './dist/**/*.test.js'", - "clean": "rm -rf dist *.tsbuildinfo" + "clean": "rm -rf dist *.tsbuildinfo", + "prepublishOnly": "node scripts/verify-vendor-wasms.mjs" }, "dependencies": { "@apidevtools/swagger-parser": "12.1.0", @@ -56,20 +57,6 @@ "piscina": "5.1.4", "snyk-nodejs-lockfile-parser": "2.7.1", "spdx-correct": "^3.2.0", - "tree-sitter": "0.25.0", - "tree-sitter-c": "0.24.1", - "tree-sitter-c-sharp": "0.23.5", - "tree-sitter-cpp": "0.23.4", - "tree-sitter-go": "0.25.0", - "tree-sitter-java": "0.23.5", - "tree-sitter-javascript": "0.25.0", - "tree-sitter-kotlin": "0.3.8", - "tree-sitter-php": "0.24.2", - "tree-sitter-python": "0.25.0", - "tree-sitter-ruby": "0.23.1", - "tree-sitter-rust": "0.24.0", - "tree-sitter-swift": "0.7.1", - "tree-sitter-typescript": "0.23.2", "web-tree-sitter": "0.26.8", "write-file-atomic": "8.0.0" }, @@ -103,6 +90,6 @@ "pipeline" ], "engines": { - "node": ">=22.0.0" + "node": ">=20.0.0" } } diff --git a/packages/ingestion/scripts/verify-vendor-wasms.mjs b/packages/ingestion/scripts/verify-vendor-wasms.mjs new file mode 100644 index 00000000..03e013b4 --- /dev/null +++ b/packages/ingestion/scripts/verify-vendor-wasms.mjs @@ -0,0 +1,155 @@ +#!/usr/bin/env node +// Pre-publish gate: assert vendor/wasms/ ships every WASM the runtime needs. +// +// Exits non-zero on any of: +// - missing or empty .wasm file +// - invalid WASM magic bytes (\0asm) +// - manifest.json grammar version drift vs. packages/ingestion/package.json +// +// Run as `prepublishOnly` script in packages/ingestion/package.json. +import fs from "node:fs"; +import path from "node:path"; +import { fileURLToPath } from "node:url"; + +const __filename = fileURLToPath(import.meta.url); +const SCRIPT_DIR = path.dirname(__filename); +const PKG_DIR = path.resolve(SCRIPT_DIR, ".."); +const VENDOR_DIR = path.resolve(PKG_DIR, "vendor", "wasms"); +const MANIFEST = path.resolve(VENDOR_DIR, "manifest.json"); +const PJ = path.resolve(PKG_DIR, "package.json"); + +// 16 expected files: 15 grammar wasms + web-tree-sitter runtime wasm. +const EXPECTED = [ + "web-tree-sitter.wasm", + "tree-sitter-typescript.wasm", + "tree-sitter-tsx.wasm", + "tree-sitter-javascript.wasm", + "tree-sitter-python.wasm", + "tree-sitter-go.wasm", + "tree-sitter-rust.wasm", + "tree-sitter-java.wasm", + "tree-sitter-c_sharp.wasm", + "tree-sitter-c.wasm", + "tree-sitter-cpp.wasm", + "tree-sitter-ruby.wasm", + "tree-sitter-php_only.wasm", + "tree-sitter-kotlin.wasm", + "tree-sitter-swift.wasm", + "tree-sitter-dart.wasm", +]; + +// WASM binary magic: \0 a s m +const WASM_MAGIC = Buffer.from([0x00, 0x61, 0x73, 0x6d]); + +const errors = []; + +// 1. Every expected wasm exists, non-empty, has valid magic. +// Single open() per file avoids the existsSync→statSync→openSync TOCTOU +// pattern (CodeQL "potential filesystem race condition"); errno NOENT / +// short reads / bad magic each surface as one diagnostic. +for (const name of EXPECTED) { + const p = path.resolve(VENDOR_DIR, name); + let fd; + try { + fd = fs.openSync(p, "r"); + } catch (err) { + if (err && err.code === "ENOENT") { + errors.push(`missing: ${name}`); + } else { + errors.push(`open failed: ${name} (${err && err.code ? err.code : err})`); + } + continue; + } + try { + const buf = Buffer.alloc(4); + const bytesRead = fs.readSync(fd, buf, 0, 4, 0); + if (bytesRead < 4) { + errors.push(`too small (${bytesRead} bytes): ${name}`); + } else if (!buf.equals(WASM_MAGIC)) { + errors.push(`invalid WASM magic in ${name}: got ${buf.toString("hex")}`); + } + } finally { + fs.closeSync(fd); + } +} + +// 2. manifest.json exists and matches package.json grammar pins. +// Read manifest with fs.readFileSync directly; failure surfaces as one error. +let manifestText; +try { + manifestText = fs.readFileSync(MANIFEST, "utf8"); +} catch (err) { + if (err && err.code === "ENOENT") { + errors.push(`missing manifest: ${MANIFEST}`); + } else { + errors.push(`manifest read failed: ${MANIFEST} (${err && err.code ? err.code : err})`); + } + manifestText = null; +} +if (manifestText !== null) { + const manifest = JSON.parse(manifestText); + const pj = JSON.parse(fs.readFileSync(PJ, "utf8")); + const declared = { ...(pj.dependencies || {}), ...(pj.devDependencies || {}) }; + + if (manifest.schema !== "opencodehub.vendor-wasms.v1") { + errors.push(`unexpected manifest schema: ${manifest.schema}`); + } + + // The manifest is the source of truth for grammar versions. Native + // tree-sitter and grammar packages are NOT workspace devDeps anymore — + // they're installed on demand by scripts/build-vendor-wasms.sh during + // re-vendoring. For each grammar, assert the manifest carries a version + // string; if package.json happens to also declare it (during a vendor + // run), the two must match. + const checked = [ + "tree-sitter", + "tree-sitter-typescript", + "tree-sitter-javascript", + "tree-sitter-python", + "tree-sitter-go", + "tree-sitter-rust", + "tree-sitter-java", + "tree-sitter-c-sharp", + "tree-sitter-c", + "tree-sitter-cpp", + "tree-sitter-ruby", + "tree-sitter-php", + "tree-sitter-kotlin", + "tree-sitter-swift", + "web-tree-sitter", + ]; + for (const name of checked) { + const manifestV = manifest.grammars?.[name]; + if (!manifestV) { + errors.push(`${name}: missing from manifest.grammars`); + continue; + } + const declaredV = declared[name] + ? String(declared[name]).replace(/^[\^~=]/, "") + : null; + if (declaredV !== null && declaredV !== manifestV) { + errors.push( + `${name}: package.json pins ${declaredV} but manifest.json records ${manifestV} — re-run scripts/build-vendor-wasms.sh`, + ); + } + } + + // tree-sitter-dart never had a corresponding npm package; it's vendored + // historically. Accept the marker. + const dartV = manifest.grammars?.["tree-sitter-dart"]; + if (dartV !== "vendored-historically") { + errors.push( + `tree-sitter-dart: manifest expected "vendored-historically", got ${dartV ?? "(missing)"}`, + ); + } +} + +if (errors.length > 0) { + console.error("verify-vendor-wasms.mjs FAILED:"); + for (const e of errors) console.error(` - ${e}`); + console.error(""); + console.error(`Total: ${errors.length} error(s)`); + process.exit(1); +} + +console.log(`verify-vendor-wasms.mjs OK (${EXPECTED.length} wasm files, manifest matches package.json pins)`); diff --git a/packages/ingestion/src/parse/grammar-registry.test.ts b/packages/ingestion/src/parse/grammar-registry.test.ts index aefb49c1..0cf0068e 100644 --- a/packages/ingestion/src/parse/grammar-registry.test.ts +++ b/packages/ingestion/src/parse/grammar-registry.test.ts @@ -11,22 +11,23 @@ import { import { getUnifiedQuery } from "./unified-queries.js"; describe("grammar-registry", () => { - it("lazy-loads TypeScript and caches by identity", async () => { + it("returns a typescript handle with non-empty query text", async () => { _resetGrammarCacheForTests(); - const first = await loadGrammar("typescript"); - const second = await loadGrammar("typescript"); - assert.equal(first, second, "second call should return the cached handle"); - assert.equal(first.language, "typescript"); - assert.ok(first.tsLanguage, "tree-sitter language object should be truthy"); - assert.equal(first.queryText, getUnifiedQuery("typescript")); + const h = await loadGrammar("typescript"); + assert.equal(h.language, "typescript"); + assert.equal(h.queryText, getUnifiedQuery("typescript")); + assert.ok(h.queryText.length > 0); }); it("returns distinct handles for typescript vs tsx", async () => { _resetGrammarCacheForTests(); const ts = await loadGrammar("typescript"); const tsx = await loadGrammar("tsx"); - assert.notEqual(ts, tsx); - assert.notEqual(ts.tsLanguage, tsx.tsLanguage); + assert.equal(ts.language, "typescript"); + assert.equal(tsx.language, "tsx"); + // queryText may match across the two TS variants (they share the unified + // query); the discriminating field is `language`. + assert.notEqual(ts.language, tsx.language); }); it("loads python, go, rust, java, javascript", async () => { @@ -34,26 +35,25 @@ describe("grammar-registry", () => { for (const lang of ["python", "go", "rust", "java", "javascript"] as const) { const h = await loadGrammar(lang); assert.equal(h.language, lang); - assert.ok(h.tsLanguage, `${lang} tsLanguage should be loaded`); assert.ok(h.queryText.length > 0, `${lang} queryText should be non-empty`); } }); - it("loads c# via dynamic import path", async () => { + it("loads csharp", async () => { _resetGrammarCacheForTests(); const h = await loadGrammar("csharp"); assert.equal(h.language, "csharp"); - assert.ok(h.tsLanguage, "csharp Language object should load"); + assert.ok(h.queryText.length > 0); }); - it("preloadGrammars is idempotent", async () => { + it("preloadGrammars is callable and idempotent", async () => { _resetGrammarCacheForTests(); await preloadGrammars(["typescript", "python"]); - // second preload hits cache + // second preload is a no-op-equivalent; the resolver is pure await preloadGrammars(["typescript", "python"]); const a = await loadGrammar("typescript"); const b = await loadGrammar("typescript"); - assert.equal(a, b); + assert.deepEqual(a, b); }); it("classifies cobol as a regex-provider language", () => { @@ -81,23 +81,13 @@ describe("grammar-registry", () => { assert.equal(sha, null, "cobol has no grammar package — sha should be null"); }); - it("loads extended-language grammars when the native bindings are installed", async () => { - // 7 additional grammars (c, cpp, ruby, kotlin, swift, php, dart). Some - // of them (notably kotlin without prebuilds, dart via git+ssh) may fail - // to build on exotic platforms or restricted CI. We treat a load failure - // as "skip this grammar" — the registry itself must not crash. + it("loads handles for extended-language grammars", async () => { _resetGrammarCacheForTests(); const langs = ["c", "cpp", "ruby", "kotlin", "swift", "php", "dart"] as const; for (const lang of langs) { - try { - const h = await loadGrammar(lang); - assert.equal(h.language, lang); - assert.ok(h.tsLanguage, `${lang}: tree-sitter Language should be non-null`); - } catch (err) { - // Skip: native binding missing on this platform (acceptable). - // Print once so CI diagnostics surface the gap. - console.warn(`[grammar-registry.test] skip ${lang}: ${(err as Error).message}`); - } + const h = await loadGrammar(lang); + assert.equal(h.language, lang); + assert.ok(h.queryText.length > 0, `${lang}: queryText should be non-empty`); } }); }); diff --git a/packages/ingestion/src/parse/grammar-registry.ts b/packages/ingestion/src/parse/grammar-registry.ts index b38739d9..eb4603aa 100644 --- a/packages/ingestion/src/parse/grammar-registry.ts +++ b/packages/ingestion/src/parse/grammar-registry.ts @@ -1,22 +1,24 @@ /** - * Lazy grammar loader. + * Lightweight grammar metadata registry. * - * Imports the native tree-sitter grammar modules on demand — the first call - * to `loadGrammar('python')` pulls in `tree-sitter-python`, subsequent calls - * hit the in-process cache. This keeps the cold-start cost of the parse - * subsystem low: importing `grammar-registry` alone does not load any grammar - * `.node` file. + * In the WASM-only world, the parse-worker resolves grammar `.wasm` blobs + * directly from `vendor/wasms/` (see `wasm-runtime.ts`); there is no + * native `Language` object to require() or cache. This module retains + * three responsibilities: * - * Each grammar package exposes its tree-sitter `Language` object differently: - * - typescript: module has `.typescript` and `.tsx` properties - * - javascript/python/go/go/java/rust: module IS the Language - * - c-sharp: ESM default export IS the Language - * - c, cpp, ruby, kotlin, swift: module IS the Language (CJS require) - * - php: module has `.php` and `.php_only` — we load `.php_only` (pure PHP, - * no HTML template injection; better for static analysis) - * - dart: git-pinned CJS module that IS the Language - * - * This module abstracts those differences behind {@link loadGrammar}. + * 1. Mark languages by provider kind (`tree-sitter` vs `regex`) so + * callers can route COBOL files through the regex extractor. + * 2. Surface a tiny `GrammarHandle` carrying the unified S-expression + * query text used by the worker pool's secondary consumers (rare — + * most consumers go through `getUnifiedQuery` directly). + * 3. Compute a stable per-grammar SHA from the package manifest pinned + * in `pnpm-lock.yaml`, used as a parse-cache key. The SHA still + * derives from the npm `tree-sitter-` package's `package.json` + * because that's the canonical version pin — the workspace keeps + * these as `devDependencies` so the manifests resolve in dev. + * Returns `null` when the package is not installed (e.g. on a + * consumer-of-the-published-package install path), which disables + * parse-cache keying for that language. * * ## Regex-provider escape hatch * @@ -25,24 +27,41 @@ * that split with a {@link LanguageProviderSpec} discriminated union: * * - `{ kind: "tree-sitter", package: string }` — the classic path; the - * grammar package is resolved lazily from npm and hashed into the - * parse-cache key via {@link getGrammarSha}. + * grammar package name is used as the parse-cache fingerprint via + * {@link getGrammarSha}. * - `{ kind: "regex" }` — the escape hatch; {@link loadGrammar} refuses * to build a `GrammarHandle`, {@link getGrammarSha} returns `null` * (disables parse-cache keying), and upstream parse-phase code is * expected to route the file through the language-specific regex * extractor instead of the worker pool. - * - * This keeps every tree-sitter consumer of the registry working unchanged - * while giving downstream code a typed way to detect regex-only languages. */ -import { createRequire } from "node:module"; +import { readFile } from "node:fs/promises"; +import { fileURLToPath } from "node:url"; import { sha256Hex } from "@opencodehub/core-types"; import type { LanguageId } from "./types.js"; import { getUnifiedQuery } from "./unified-queries.js"; -const requireFn = createRequire(import.meta.url); +// `vendor/wasms/manifest.json` is the canonical version pin for every grammar +// after native tree-sitter left the workspace. Path resolves at runtime from +// the built `dist/parse/grammar-registry.js` location. +const MANIFEST_PATH = fileURLToPath(new URL("../../vendor/wasms/manifest.json", import.meta.url)); + +let manifestCache: Promise | null> | null = null; + +async function loadManifestVersions(): Promise | null> { + if (manifestCache) return manifestCache; + manifestCache = (async () => { + try { + const text = await readFile(MANIFEST_PATH, "utf8"); + const json = JSON.parse(text) as { readonly grammars?: Record }; + return json.grammars ?? null; + } catch { + return null; + } + })(); + return manifestCache; +} /** * Provider spec for a single language. Discriminated on `kind`: @@ -55,8 +74,7 @@ const requireFn = createRequire(import.meta.url); * (see {@link getGrammarSha}). * * Named `LanguageProviderSpec` to avoid colliding with the broader - * `LanguageProvider` interface in `providers/types.ts` (which covers - * extract-* hooks, MRO strategy, and other provider-wide behavior). + * `LanguageProvider` interface in `providers/types.ts`. */ export type LanguageProviderSpec = | { readonly kind: "tree-sitter"; readonly package: string } @@ -66,15 +84,6 @@ export type LanguageProviderSpec = * Per-language provider spec. `satisfies Record` keeps this * 1:1 with the `LanguageId` union at compile time — adding a new language * without an entry here fails the type check. - * - * Tree-sitter entries carry the npm grammar package name. The content- - * addressed parse cache hashes `{ name, version }` from that package's - * `package.json`, so a grammar version bump in the workspace lockfile - * invalidates the cache cleanly. - * - * Regex entries (currently only `cobol`) carry no package reference — - * {@link loadGrammar} and {@link getGrammarSha} treat them as a marker - * that the caller must dispatch through the language's regex extractor. */ const LANGUAGE_PROVIDERS = { typescript: { kind: "tree-sitter", package: "tree-sitter-typescript" }, @@ -112,36 +121,24 @@ export function isRegexProviderLanguage(lang: LanguageId): boolean { return LANGUAGE_PROVIDERS[lang].kind === "regex"; } -/** Opaque wrapper holding everything a worker needs for one language. */ +/** Opaque wrapper holding the per-language metadata callers need. */ export interface GrammarHandle { readonly language: LanguageId; - /** tree-sitter Language object (opaque to callers). */ - readonly tsLanguage: unknown; /** Unified S-expression query body for this language. */ readonly queryText: string; } -const cache = new Map(); -// De-dupe concurrent calls for the same language so we only require() once. -const inflight = new Map>(); - // Per-process memoization of grammar SHAs — the value is stable for the // lifetime of the process (resolving + hashing a package.json is cheap but // not free, and scan() calls this per-file). const grammarShaCache = new Map(); /** - * Load and cache the tree-sitter grammar for a language. - * - * Thread/context note: the cache is per-module-instance, so in the - * piscina worker model each worker has its own cache — which matches - * tree-sitter's thread-safety rules (one Parser per worker_thread). - * - * Regex-provider languages (see {@link isRegexProviderLanguage}) throw - * on entry: they have no tree-sitter grammar to load, and reaching this - * function means the caller skipped the `kind === "regex"` dispatch - * guard. That is a bug on the call site, not a runtime condition to - * recover from. + * Return a {@link GrammarHandle} for `lang`. After the WASM-only refactor + * this is a thin object carrying just the language id and its unified + * query text — there is no native `Language` to load. Refuses regex-only + * languages so callers that should have routed through the regex extractor + * see a hard error rather than a silently broken handle. */ export async function loadGrammar(lang: LanguageId): Promise { const spec = LANGUAGE_PROVIDERS[lang]; @@ -151,131 +148,19 @@ export async function loadGrammar(lang: LanguageId): Promise { `route the file through the language's regex extractor instead.`, ); } - const cached = cache.get(lang); - if (cached !== undefined) { - return cached; - } - const existing = inflight.get(lang); - if (existing !== undefined) { - return existing; - } - const p = doLoad(lang).then((handle) => { - cache.set(lang, handle); - inflight.delete(lang); - return handle; - }); - inflight.set(lang, p); - return p; + return { language: lang, queryText: getUnifiedQuery(lang) }; } /** * Preload a list of grammars in parallel. Useful as a warm-up hint during * indexing start-up, but not required — {@link loadGrammar} is safe to call - * lazily during parsing. + * lazily during parsing. Retained as a callable no-op-style API so existing + * pipeline orchestration keeps working. */ export async function preloadGrammars(langs: readonly LanguageId[]): Promise { await Promise.all(langs.map((l) => loadGrammar(l))); } -async function doLoad(lang: LanguageId): Promise { - const tsLanguage = await loadLanguageObject(lang); - return { - language: lang, - tsLanguage, - queryText: getUnifiedQuery(lang), - }; -} - -/** - * Resolve the Language object for each grammar, handling per-package quirks. - * Returned value is passed straight into `parser.setLanguage()`. - */ -async function loadLanguageObject(lang: LanguageId): Promise { - switch (lang) { - case "typescript": { - const mod = requireFn("tree-sitter-typescript") as { - typescript: unknown; - tsx: unknown; - }; - return mod.typescript; - } - case "tsx": { - const mod = requireFn("tree-sitter-typescript") as { - typescript: unknown; - tsx: unknown; - }; - return mod.tsx; - } - case "javascript": - return requireFn("tree-sitter-javascript"); - case "python": - return requireFn("tree-sitter-python"); - case "go": - return requireFn("tree-sitter-go"); - case "rust": - return requireFn("tree-sitter-rust"); - case "java": - return requireFn("tree-sitter-java"); - case "csharp": { - // tree-sitter-c-sharp is ESM-only; use dynamic import. The default - // export is the Language binding. - const mod = (await import("tree-sitter-c-sharp")) as { default: unknown }; - return mod.default; - } - case "c": - // tree-sitter-c 0.24.1 — canonical tree-sitter-org CJS grammar, ships - // prebuilds for 6 platforms. Module IS the Language. - return requireFn("tree-sitter-c"); - case "cpp": - // tree-sitter-cpp 0.23.4 — extends tree-sitter-c; prebuilds shipped. - return requireFn("tree-sitter-cpp"); - case "ruby": - // tree-sitter-ruby 0.23.1 — prebuilds shipped. Module IS the Language. - return requireFn("tree-sitter-ruby"); - case "kotlin": - // tree-sitter-kotlin 0.3.8 (fwcd) — NO prebuilds on npm; install-time - // node-gyp build is expected. If the native binary is missing on an - // exotic platform, require() throws and callers surface the error. - return requireFn("tree-sitter-kotlin"); - case "swift": - // tree-sitter-swift 0.7.1 (alex-pinkus) — ships prebuilds but also has - // a postinstall rebuild (~30s one-time). Runtime-transparent. - return requireFn("tree-sitter-swift"); - case "php": { - // tree-sitter-php 0.24.2 ships TWO grammars in one package: - // - `.php`: pure PHP with HTML injection (for .blade.php, .phtml etc.) - // - `.php_only`: pure PHP without HTML injection - // We load `.php_only` — static analysis cares about PHP code, not HTML. - const mod = requireFn("tree-sitter-php") as { - php: unknown; - php_only: unknown; - }; - return mod.php_only; - } - case "dart": - // Dart is WASM-only on the public package — see vendor/wasms/. - // Removed the git-pinned tree-sitter-dart dependency in 0.2.x because - // npm consumers couldn't `npm install -g @opencodehub/cli` (npm tries - // to git-clone + node-gyp the pin and fails on machines without a - // C++ toolchain). Native opt-in (`OCH_NATIVE_PARSER=1`) is unsupported - // for Dart on the registry build; clear the env var to use the WASM - // path that ships with the published package. - throw new Error( - "tree-sitter-dart is not bundled as a native binding in published builds; " + - "Dart parsing uses the vendored WASM grammar. " + - "Unset OCH_NATIVE_PARSER (or omit --native-parser) to use the WASM path.", - ); - case "cobol": - // Guarded at the `loadGrammar` entry point via the provider-kind - // discriminator; a direct call to `loadLanguageObject("cobol")` - // indicates a caller bypassed that guard. Keep the branch so - // TypeScript's exhaustiveness check passes. - throw new Error( - "loadLanguageObject: cobol is a regex-provider language (no tree-sitter grammar)", - ); - } -} - /** * Compute a stable SHA for the grammar backing `lang`. The SHA is derived * from `sha256(JSON.stringify({ name, version }))` of the grammar's @@ -284,8 +169,9 @@ async function loadLanguageObject(lang: LanguageId): Promise { * needs in its composite key. * * Returns `null` when: - * - the grammar package is not installed (e.g. languages whose provider - * track has not landed yet), OR + * - the grammar package is not installed (e.g. on a consumer install + * path where the native packages are devDependencies of the source + * repo only), OR * - the package.json could not be read/parsed. * * Result is memoized per-process. Idempotent across concurrent callers. @@ -304,33 +190,19 @@ export async function getGrammarSha(lang: LanguageId): Promise { } async function computeGrammarSha(pkgName: string): Promise { - // `require.resolve('/package.json')` returns the absolute path of the - // package's manifest without executing any of its bindings — safe to call - // even for grammars that fail to build natively at install time. - let manifestPath: string; - try { - manifestPath = requireFn.resolve(`${pkgName}/package.json`); - } catch { - return null; - } - let manifest: { readonly name?: unknown; readonly version?: unknown }; - try { - const { readFile } = await import("node:fs/promises"); - const text = await readFile(manifestPath, "utf8"); - manifest = JSON.parse(text) as { readonly name?: unknown; readonly version?: unknown }; - } catch { - return null; - } - const name = typeof manifest.name === "string" ? manifest.name : pkgName; - const version = typeof manifest.version === "string" ? manifest.version : ""; - if (version === "") return null; + // The grammar version comes from the vendored manifest.json, which is + // committed alongside the .wasm files and updated atomically by + // scripts/build-vendor-wasms.sh. This avoids requiring the npm grammar + // packages to be installed at runtime — they're not workspace deps. + const versions = await loadManifestVersions(); + if (!versions) return null; + const version = versions[pkgName]; + if (typeof version !== "string" || version === "") return null; // Canonical JSON-like form so the SHA does not depend on object key order. - return sha256Hex(JSON.stringify({ name, version })); + return sha256Hex(JSON.stringify({ name: pkgName, version })); } /** For tests: drop the cache so the next load() re-imports fresh. */ export function _resetGrammarCacheForTests(): void { - cache.clear(); - inflight.clear(); grammarShaCache.clear(); } diff --git a/packages/ingestion/src/parse/index.ts b/packages/ingestion/src/parse/index.ts index 108f7c0d..1bf1ff7c 100644 --- a/packages/ingestion/src/parse/index.ts +++ b/packages/ingestion/src/parse/index.ts @@ -15,6 +15,5 @@ export type { ParseTask, } from "./types.js"; export { getUnifiedQuery } from "./unified-queries.js"; -export { isNativeAvailable, resetNativeAvailabilityCache } from "./wasm-fallback.js"; export type { DispatchOptions, ParsePoolOptions } from "./worker-pool.js"; export { chunkTasks, ParsePool } from "./worker-pool.js"; diff --git a/packages/ingestion/src/parse/parse-worker.test.ts b/packages/ingestion/src/parse/parse-worker.test.ts deleted file mode 100644 index 797eee4e..00000000 --- a/packages/ingestion/src/parse/parse-worker.test.ts +++ /dev/null @@ -1,287 +0,0 @@ -/** - * parse-worker dispatch tests. - * - * Exercises the runtime-selection logic in parse-worker.ts: - * (a) OCH_NATIVE_PARSER unset → WASM path, WASM warning - * (b) OCH_NATIVE_PARSER=1 AND native available → native path, native warning - * (c) OCH_NATIVE_PARSER=1 AND native unavailable → WASM fallback, mismatch warning - * (d) OCH_NATIVE_PARSER explicitly =0 → WASM path (regression: must not count "0" as truthy) - * - * Observability strategy: the startup warning emitted on the FIRST - * `parseBatch` call in each fresh worker is the only externally visible - * signal that names the runtime. We capture the line written to - * `process.stderr` during a single `parseBatch([])` invocation and assert - * on it — this proves both the dispatch direction AND the EARS - * requirement that a startup warning fires for BOTH runtimes. - * - * The `warnedRuntime` module-global means each test case must load the - * module fresh; we do that with `import(`${modulePath}?v=…`)` query - * cache-busting so node-test resolves a new module instance per test. - */ - -import { strict as assert } from "node:assert"; -import { Buffer } from "node:buffer"; -import { Module } from "node:module"; -import { describe, it } from "node:test"; -import type { ParseBatch, ParseResult } from "./types.js"; - -type ParseBatchFn = (batch: ParseBatch) => Promise; - -interface ParseWorkerModule { - default: ParseBatchFn; -} - -interface WasmFallbackModule { - isNativeAvailable(): boolean; - resetNativeAvailabilityCache(): void; - openWasmParser: typeof import("./wasm-fallback.js")["openWasmParser"]; - _resetWasmCacheForTests(): void; -} - -const parseWorkerUrl = new URL("./parse-worker.js", import.meta.url).href; -const wasmFallbackUrl = new URL("./wasm-fallback.js", import.meta.url).href; - -/** - * Dynamically import a fresh `parse-worker.js` module instance so its - * module-globals (`warnedRuntime`) reset between tests. The query-string - * `?v=…` tag forces node's ESM loader to create a new module record. - */ -async function loadParseWorker(tag: string): Promise { - const mod = (await import(`${parseWorkerUrl}?v=${tag}`)) as ParseWorkerModule; - return mod.default; -} - -async function loadWasmFallback(tag: string): Promise { - return (await import(`${wasmFallbackUrl}?v=${tag}`)) as WasmFallbackModule; -} - -/** - * Run `fn` with stderr captured into a string. Restores `process.stderr.write` - * on both success and failure. We install the shim synchronously but await - * `fn` under it so any async writes during the awaited work are captured. - */ -async function captureStderr(fn: () => Promise): Promise { - const chunks: string[] = []; - const original = process.stderr.write.bind(process.stderr); - // Override with a function that records then no-ops. `parseBatch` only - // ever writes complete strings to stderr, so we don't bother routing - // the arguments through to the original stream — this keeps test - // output clean on the `node --test` console. - process.stderr.write = ((chunk: string | Uint8Array) => { - const s = typeof chunk === "string" ? chunk : Buffer.from(chunk).toString("utf8"); - chunks.push(s); - return true; - }) as typeof process.stderr.write; - try { - await fn(); - } finally { - process.stderr.write = original; - } - return chunks.join(""); -} - -/** - * Save + clear + restore the `OCH_NATIVE_PARSER` env var. We cannot just - * delete it because tests run in parallel in node:test when `--test` is - * passed with multiple workers; we take the pragmatic approach of - * serializing these tests (describe with single it blocks) and restoring - * on finally. - */ -function setEnv(value: string | undefined): string | undefined { - const prior = process.env["OCH_NATIVE_PARSER"]; - if (value === undefined) { - delete process.env["OCH_NATIVE_PARSER"]; - } else { - process.env["OCH_NATIVE_PARSER"] = value; - } - return prior; -} - -function restoreEnv(prior: string | undefined): void { - if (prior === undefined) { - delete process.env["OCH_NATIVE_PARSER"]; - } else { - process.env["OCH_NATIVE_PARSER"] = prior; - } -} - -describe("parse-worker runtime dispatch", () => { - it("(a) env unset → WASM path; startup warning names WASM", async () => { - const priorEnv = setEnv(undefined); - try { - const parseBatch = await loadParseWorker("case-a"); - const stderr = await captureStderr(async () => { - // Empty batch exercises the startup-warning path without needing - // a real grammar load. - await parseBatch({ tasks: [] }); - }); - assert.match( - stderr, - /using web-tree-sitter \(WASM\) runtime/, - `expected WASM startup warning; got: ${JSON.stringify(stderr)}`, - ); - assert.doesNotMatch( - stderr, - /native \(N-API\) runtime/, - `native runtime should NOT be named when env is unset`, - ); - } finally { - restoreEnv(priorEnv); - } - }); - - it("(b) env=1 + native available → native path; startup warning names native", async (t) => { - // Probe native availability via a fresh wasm-fallback module — if the - // host can't load `tree-sitter`, we can't meaningfully test the - // native branch. Skip in that case rather than marking the suite - // failed (parity test uses the same convention). - const probe = await loadWasmFallback("case-b-probe"); - if (!probe.isNativeAvailable()) { - t.skip("native tree-sitter binding not loadable on this host"); - return; - } - - const priorEnv = setEnv("1"); - try { - const parseBatch = await loadParseWorker("case-b"); - const stderr = await captureStderr(async () => { - await parseBatch({ tasks: [] }); - }); - assert.match( - stderr, - /using tree-sitter native \(N-API\) runtime/, - `expected native startup warning; got: ${JSON.stringify(stderr)}`, - ); - assert.doesNotMatch( - stderr, - /using web-tree-sitter \(WASM\) runtime/, - `WASM runtime should NOT be named when native is picked`, - ); - } finally { - restoreEnv(priorEnv); - } - }); - - it("(c) env=1 + native unavailable → WASM fallback + mismatch warning", async () => { - // Simulate "native unavailable" by poisoning CommonJS - // `Module._resolveFilename` so any `require('tree-sitter')` (used - // inside `isNativeAvailable()`) throws. We also purge any cached - // copy of tree-sitter from `require.cache` — node short-circuits - // `_resolveFilename` when the module is already cached by its - // resolved absolute path, so a prior test that loaded it would - // otherwise defeat our patch. - // - // We wrap the whole flow in try/finally to guarantee the patches - // are reverted even on assertion failure — a stuck patch would - // break every subsequent test that imports tree-sitter. - // `Module._resolveFilename` is a documented-internal CommonJS hook — - // it has no type in @types/node, so we widen to a loose shape. - const ModuleCjs = Module as unknown as { - _resolveFilename: (request: string, parent: unknown, ...rest: unknown[]) => string; - _cache?: Record; - }; - const originalResolveFilename = ModuleCjs._resolveFilename; - - // Purge every tree-sitter-* entry from require.cache so the next - // require() call goes back through _resolveFilename. - const savedCacheEntries: Array<[string, unknown]> = []; - if (ModuleCjs._cache !== undefined) { - for (const key of Object.keys(ModuleCjs._cache)) { - if (key.includes("tree-sitter")) { - savedCacheEntries.push([key, ModuleCjs._cache[key]]); - delete ModuleCjs._cache[key]; - } - } - } - - ModuleCjs._resolveFilename = function patched( - this: unknown, - request: string, - parent: unknown, - ...rest: unknown[] - ): string { - if (request === "tree-sitter") { - throw new Error("Cannot find module 'tree-sitter' (simulated by parse-worker.test.ts)"); - } - return originalResolveFilename.call(this, request, parent, ...rest); - } as typeof ModuleCjs._resolveFilename; - - const priorEnv = setEnv("1"); - try { - // Reset isNativeAvailable's cache on EVERY wasm-fallback module - // instance the parse-worker could import. Each `?v=…` tagged load - // above created a fresh module with its own `cached` state; we - // need to hit the exact one parse-worker imports (the untagged - // URL). We also reset every tagged one we previously loaded so - // they can't leak a `true` back in when loaded again below. - const untagged = (await import(wasmFallbackUrl)) as WasmFallbackModule; - untagged.resetNativeAvailabilityCache(); - - const parseBatch = await loadParseWorker("case-c-worker"); - const stderr = await captureStderr(async () => { - await parseBatch({ tasks: [] }); - }); - assert.match( - stderr, - /OCH_NATIVE_PARSER=1 set but native tree-sitter unavailable; falling back to web-tree-sitter \(WASM\) runtime/, - `expected fallback warning; got: ${JSON.stringify(stderr)}`, - ); - assert.doesNotMatch( - stderr, - /using tree-sitter native \(N-API\) runtime/, - `native runtime must NOT be claimed when the addon is unavailable`, - ); - } finally { - ModuleCjs._resolveFilename = originalResolveFilename; - // Restore the previously-cached tree-sitter entries so downstream - // tests don't pay the full addon re-load cost. - if (ModuleCjs._cache !== undefined) { - for (const [key, value] of savedCacheEntries) { - ModuleCjs._cache[key] = value; - } - } - restoreEnv(priorEnv); - // Reset detection cache so subsequent tests re-probe under the - // real (unpatched) resolver. - const untaggedRestore = (await import(wasmFallbackUrl)) as WasmFallbackModule; - untaggedRestore.resetNativeAvailabilityCache(); - } - }); - - it("(d) env=0 → WASM path (regression: '0' must not be treated as truthy)", async () => { - const priorEnv = setEnv("0"); - try { - const parseBatch = await loadParseWorker("case-d"); - const stderr = await captureStderr(async () => { - await parseBatch({ tasks: [] }); - }); - assert.match( - stderr, - /using web-tree-sitter \(WASM\) runtime/, - `OCH_NATIVE_PARSER=0 should behave as unset; got: ${JSON.stringify(stderr)}`, - ); - assert.doesNotMatch(stderr, /native \(N-API\) runtime/, `"0" is not a truthy opt-in value`); - } finally { - restoreEnv(priorEnv); - } - }); - - it("startup warning fires exactly once per worker module instance", async () => { - const priorEnv = setEnv(undefined); - try { - const parseBatch = await loadParseWorker("case-oneshot"); - // First call emits the warning. - const first = await captureStderr(async () => { - await parseBatch({ tasks: [] }); - }); - // Second call on the same module instance must NOT re-emit. - const second = await captureStderr(async () => { - await parseBatch({ tasks: [] }); - }); - assert.match(first, /using web-tree-sitter \(WASM\) runtime/); - assert.equal(second, "", `second invocation must be silent; got: ${JSON.stringify(second)}`); - } finally { - restoreEnv(priorEnv); - } - }); -}); diff --git a/packages/ingestion/src/parse/parse-worker.ts b/packages/ingestion/src/parse/parse-worker.ts index 0ef76b61..41c99290 100644 --- a/packages/ingestion/src/parse/parse-worker.ts +++ b/packages/ingestion/src/parse/parse-worker.ts @@ -3,80 +3,37 @@ * * Each worker thread imports this file once, then receives {@link ParseBatch} * inputs on every `pool.run()` call. The worker: - * 1. Loads the grammar for each task's language (cached in the worker). - * 2. Builds a `Parser` with that language (cached). - * 3. Compiles the unified S-expression query (cached). - * 4. Parses each task's buffer and maps captures to {@link ParseCapture}. - * 5. Returns a {@link ParseResult} per task. + * 1. Opens a WASM-backed parser for each task's language (cached in the worker). + * 2. Compiles the unified S-expression query against the grammar (cached + * inside `WasmParserHandle.runQuery`). + * 3. Parses each task's buffer and maps captures to {@link ParseCapture}. + * 4. Returns a {@link ParseResult} per task. * * Per-task wall-clock timeout: 30 seconds. On timeout the task returns a * result with empty captures and a warning rather than crashing the worker. * - * Safety: tree-sitter parsers are NOT thread-safe; one Parser per worker per - * language is a hard constraint. This file enforces that via per-worker maps - * keyed by LanguageId. + * `web-tree-sitter` is the sole runtime as of 0.4.0. Native `tree-sitter` + * was removed from the runtime install graph; grammar `.wasm` blobs are + * vendored under `packages/ingestion/vendor/wasms/`. */ import { Buffer } from "node:buffer"; -import { createRequire } from "node:module"; import { performance } from "node:perf_hooks"; -import { loadGrammar } from "./grammar-registry.js"; import type { LanguageId, ParseBatch, ParseCapture, ParseResult, ParseTask } from "./types.js"; import { getUnifiedQuery } from "./unified-queries.js"; -import { isNativeAvailable, openWasmParser, type WasmParserHandle } from "./wasm-fallback.js"; - -const requireFn = createRequire(import.meta.url); +import { openWasmParser, type WasmParserHandle } from "./wasm-runtime.js"; const PER_FILE_TIMEOUT_MS = 30_000; const MAX_FILE_BYTES = 5 * 1024 * 1024; // 5 MB -// Per-worker caches. Each worker_thread has its own module instance so these -// live per-worker, honoring tree-sitter's one-parser-per-thread rule. -const parserCache = new Map(); -const queryCache = new Map(); +// Per-worker WASM parser cache. Each worker_thread has its own module +// instance so this lives per-worker. const wasmParserCache = new Map(); -let warnedRuntime = false; - -/** - * Read the `--native-parser` opt-in flag. Set either via env - * (`OCH_NATIVE_PARSER=1`) or via argv pass-through when the worker boots - * inside a process launched with the flag. The worker itself cannot read - * the CLI argv directly (piscina starts workers afresh) so env is the - * primary carrier. - * - * WASM is the default runtime as of Node 24 / M5 — the native tree-sitter - * N-API binding is opt-in for developer speed on Node 22 dev boxes. - */ -function forceNativeOpt(): boolean { - const v = process.env["OCH_NATIVE_PARSER"]; - return v === "1" || v === "true"; -} - /** * Piscina task entry. Default export is the function piscina invokes. */ export default async function parseBatch(batch: ParseBatch): Promise { - // Emit a one-shot startup warning naming the runtime we actually landed - // on. Both paths are logged so the runtime choice is never silent — a - // user debugging a parse difference can see "native" vs "WASM" on the - // first worker invocation. - if (!warnedRuntime) { - warnedRuntime = true; - const usingNative = forceNativeOpt() && isNativeAvailable(); - if (usingNative) { - process.stderr.write("[parse-worker] using tree-sitter native (N-API) runtime\n"); - } else if (forceNativeOpt() && !isNativeAvailable()) { - // Opt-in requested but native could not load — fall back to WASM - // with an explicit callout so the user notices the mismatch. - process.stderr.write( - "[parse-worker] OCH_NATIVE_PARSER=1 set but native tree-sitter unavailable; falling back to web-tree-sitter (WASM) runtime\n", - ); - } else { - process.stderr.write("[parse-worker] using web-tree-sitter (WASM) runtime\n"); - } - } - const results: ParseResult[] = []; for (const task of batch.tasks) { results.push(await parseOne(task)); @@ -140,56 +97,13 @@ async function parseOne(task: ParseTask): Promise { } async function runParse(language: LanguageId, content: Buffer): Promise { - // The tree-sitter 0.25 JS binding accepts a string primary input; decode + // The web-tree-sitter binding accepts a string primary input; decode // the buffer once here. (The underlying parser still reads by byte // offsets, so positions remain correct.) const source = content.toString("utf8"); - - // WASM is the default runtime. Native tree-sitter is opt-in via - // `OCH_NATIVE_PARSER=1` (or `--native-parser` on the CLI) and still - // requires the N-API binding to load cleanly; if the opt-in is set but - // native is unavailable, we fall back to WASM (the startup warning in - // parseBatch already flagged the mismatch). The two paths produce - // semantically equivalent captures — the (tag, text) multiset is - // asserted identical by wasm-parity.test.ts, though coordinate values - // and internal node types may differ at the margins across grammars. - if (forceNativeOpt() && isNativeAvailable()) { - return runNative(language, source); - } return runWasm(language, source); } -async function runNative(language: LanguageId, source: string): Promise { - // tree-sitter module is loaded lazily via require (not a static import) - // to keep cold-start cheap for workers that may never parse any file. - const TreeSitter = requireFn("tree-sitter") as TreeSitterModule; - - const parser = await getOrBuildParser(language, TreeSitter); - const query = await getOrBuildQuery(language, TreeSitter); - - const tree = parser.parse(source); - const root = tree.rootNode; - - const out: ParseCapture[] = []; - const matches = query.matches(root); - for (const m of matches) { - for (const cap of m.captures) { - const node = cap.node; - out.push({ - tag: cap.name, - text: node.text, - // Convert 0-indexed tree-sitter positions to 1-indexed line numbers. - startLine: node.startPosition.row + 1, - endLine: node.endPosition.row + 1, - startCol: node.startPosition.column, - endCol: node.endPosition.column, - nodeType: node.type, - }); - } - } - return out; -} - async function runWasm(language: LanguageId, source: string): Promise { let handle = wasmParserCache.get(language); if (handle === undefined) { @@ -219,31 +133,6 @@ async function runWasm(language: LanguageId, source: string): Promise { - const cached = parserCache.get(lang); - if (cached !== undefined) { - return cached as TreeSitterParser; - } - const handle = await loadGrammar(lang); - const parser = new TS() as TreeSitterParser; - parser.setLanguage(handle.tsLanguage); - parserCache.set(lang, parser); - return parser; -} - -async function getOrBuildQuery(lang: LanguageId, TS: TreeSitterModule): Promise { - const cached = queryCache.get(lang); - if (cached !== undefined) { - return cached as TreeSitterQuery; - } - const handle = await loadGrammar(lang); - const q = new TS.Query(handle.tsLanguage, handle.queryText) as TreeSitterQuery; - queryCache.set(lang, q); - return q; -} - // --- wall-clock timeout ---------------------------------------------------- function withTimeout(p: Promise, ms: number, message: string): Promise { @@ -261,47 +150,3 @@ function withTimeout(p: Promise, ms: number, message: string): Promise ); }); } - -// --- minimal ambient shapes for the native binding ------------------------- -// We intentionally avoid pulling tree-sitter's whole .d.ts into a worker file -// (it registers a global module declaration); declaring just what we use keeps -// the surface small and stable. - -interface TreeSitterPoint { - readonly row: number; - readonly column: number; -} - -interface TreeSitterNode { - readonly text: string; - readonly type: string; - readonly startPosition: TreeSitterPoint; - readonly endPosition: TreeSitterPoint; -} - -interface TreeSitterTree { - readonly rootNode: TreeSitterNode; -} - -interface TreeSitterParser { - setLanguage(lang: unknown): void; - parse(source: string): TreeSitterTree; -} - -interface TreeSitterQueryCapture { - readonly name: string; - readonly node: TreeSitterNode; -} - -interface TreeSitterQueryMatch { - readonly captures: readonly TreeSitterQueryCapture[]; -} - -interface TreeSitterQuery { - matches(node: TreeSitterNode): readonly TreeSitterQueryMatch[]; -} - -interface TreeSitterModule { - new (): TreeSitterParser; - Query: new (lang: unknown, source: string) => TreeSitterQuery; -} diff --git a/packages/ingestion/src/parse/wasm-fallback.ts b/packages/ingestion/src/parse/wasm-fallback.ts deleted file mode 100644 index b72aa2fe..00000000 --- a/packages/ingestion/src/parse/wasm-fallback.ts +++ /dev/null @@ -1,331 +0,0 @@ -/** - * WASM parser opener (default runtime) + native-availability probe. - * - * WASM is the default parse runtime as of Node 24 / M5. The native - * `tree-sitter` N-API addon is still fully supported and is opt-in via - * `OCH_NATIVE_PARSER=1` (or `--native-parser` on the CLI) — useful on - * Node 22 developer boxes where native parsing is measurably faster. - * Exotic environments (musl-libc Alpine, Cloudflare Workers, sandboxed - * Electron renderers, AWS Lambda ARM64 custom runtimes, restricted CI) - * can't load `.node` addons at all; on those hosts the default WASM - * path Just Works and `isNativeAvailable()` returns false. - * - * `openWasmParser(lang)` lazily initializes the web-tree-sitter runtime - * once per process and resolves the grammar WASM from the installed - * `tree-sitter-` package. Per-grammar cache means repeated calls - * are O(1). - * - * Query execution uses the same unified S-expression bodies from - * `unified-queries.ts`; the parse-phase consumer receives byte- - * identical ParseCapture output whether the runtime was native or WASM - * (asserted by the parity test in `wasm-parity.test.ts`). - */ - -import { createRequire } from "node:module"; -import path from "node:path"; -import { fileURLToPath } from "node:url"; -import type { LanguageId } from "./types.js"; - -const requireFn = createRequire(import.meta.url); - -// Resolve packages/ingestion/vendor/wasms/ relative to this module regardless -// of whether we're running from src/ (ts-node-style) or dist/ (compiled). -// `vendor/` lives at the package root, so we walk up from the file's dirname -// until we find it. Computed once at module load. -const VENDOR_WASMS_DIR = (() => { - const here = path.dirname(fileURLToPath(import.meta.url)); - // src → /src/parse; dist → /dist/parse — both 2 levels up - return path.resolve(here, "..", "..", "vendor", "wasms"); -})(); - -let cached: boolean | undefined; - -/** - * Returns true when `require('tree-sitter')` succeeds in the current process. - * Result is cached — subsequent calls are O(1). - * - * Call this at worker startup rather than on every parse. - */ -export function isNativeAvailable(): boolean { - if (cached !== undefined) { - return cached; - } - try { - requireFn("tree-sitter"); - cached = true; - } catch { - cached = false; - } - return cached; -} - -/** - * For tests and diagnostics: reset the cached detection result. - */ -export function resetNativeAvailabilityCache(): void { - cached = undefined; -} - -// --------------------------------------------------------------------------- -// WASM runtime -// --------------------------------------------------------------------------- - -/** - * Minimal shape of what `openWasmParser` returns — enough to run the - * same capture loop the native worker implements. Intentionally - * decoupled from the `web-tree-sitter` types so test code can stub it. - */ -export interface WasmParserHandle { - readonly language: LanguageId; - /** Parse a source string and return the underlying tree. */ - parse(source: string): WasmTree; - /** - * Execute the unified query and return the flat capture list. Callers - * translate into `ParseCapture` via the normal coordinate remapping - * (1-indexed lines, 0-indexed columns). - */ - runQuery(queryText: string, source: string): readonly WasmCapture[]; -} - -export interface WasmTree { - readonly rootNode: WasmNode; -} - -export interface WasmNode { - readonly type: string; - readonly text: string; - readonly startPosition: { readonly row: number; readonly column: number }; - readonly endPosition: { readonly row: number; readonly column: number }; -} - -export interface WasmCapture { - readonly name: string; - readonly node: WasmNode; -} - -/** - * Per-LanguageId cache of WASM grammar handles. Populated lazily. - * `null` entries mean "tried and failed" — we don't retry forever. - */ -const wasmCache = new Map(); -let wasmRuntime: WasmRuntime | undefined; - -interface WasmRuntime { - Parser: WasmParserCtor; - Query: WasmQueryCtor; - Language: WasmLanguageStatic; - initialized: boolean; -} - -interface WasmParserCtor { - new (): WasmParserInstance; - init?: (opts?: Record) => Promise; -} - -interface WasmParserInstance { - setLanguage(lang: unknown): void; - parse(source: string): WasmTree; -} - -interface WasmQueryCtor { - new (lang: unknown, source: string): WasmQueryInstance; -} - -interface WasmQueryInstance { - matches(node: WasmNode): readonly { readonly captures: readonly WasmCapture[] }[]; -} - -interface WasmLanguageStatic { - load(source: string | Uint8Array): Promise; -} - -/** - * Attempt to open a WASM-backed parser for `lang`. Returns `null` when - * either the `web-tree-sitter` runtime or the grammar's bundled `.wasm` - * could not be resolved — callers log a debug note and skip that file. - */ -export async function openWasmParser(lang: LanguageId): Promise { - const cached = wasmCache.get(lang); - if (cached !== undefined) return cached; - try { - const runtime = await ensureWasmRuntime(); - if (runtime === undefined) { - wasmCache.set(lang, null); - return null; - } - const wasmPath = resolveGrammarWasmPath(lang); - if (wasmPath === undefined) { - wasmCache.set(lang, null); - return null; - } - const tsLanguage = await runtime.Language.load(wasmPath); - const Parser = runtime.Parser as unknown as new () => WasmParserInstance; - const parser = new Parser(); - parser.setLanguage(tsLanguage); - - const handle: WasmParserHandle = { - language: lang, - parse: (source: string) => parser.parse(source), - runQuery: (queryText: string, source: string) => { - // Fresh Query per call so state stays clean between bodies. - // Query construction is cheap relative to the parse itself. - const q = new runtime.Query(tsLanguage, queryText); - const tree = parser.parse(source); - const out: WasmCapture[] = []; - for (const m of q.matches(tree.rootNode)) { - for (const cap of m.captures) out.push(cap); - } - return out; - }, - }; - wasmCache.set(lang, handle); - return handle; - } catch { - wasmCache.set(lang, null); - return null; - } -} - -/** - * Load the web-tree-sitter runtime on demand and initialize it. Returns - * `undefined` when the package isn't installed or the runtime refuses to - * init (sandboxed, missing WebAssembly, etc.). - */ -async function ensureWasmRuntime(): Promise { - if (wasmRuntime?.initialized === true) return wasmRuntime; - let mod: { Parser: WasmParserCtor; Query: WasmQueryCtor; Language: WasmLanguageStatic }; - try { - mod = requireFn("web-tree-sitter") as { - Parser: WasmParserCtor; - Query: WasmQueryCtor; - Language: WasmLanguageStatic; - }; - } catch { - return undefined; - } - try { - if (typeof mod.Parser.init === "function") { - await mod.Parser.init(); - } - } catch { - return undefined; - } - wasmRuntime = { - Parser: mod.Parser, - Query: mod.Query, - Language: mod.Language, - initialized: true, - }; - return wasmRuntime; -} - -/** - * Resolve the `.wasm` grammar asset for `lang`. Two-stage cascade: - * - * 1. Per-grammar-package lookup — for the 11 languages whose - * `tree-sitter-` npm package ships its own `.wasm` alongside - * the `.node` addon (typescript, tsx, javascript, python, go, rust, - * java, csharp, c, cpp, ruby, php). - * 2. Vendored-WASM fallback — for kotlin, swift, and dart, whose - * per-grammar packages do NOT ship a `.wasm`. We build these once - * from the same grammar sources npm pins (zero drift) and commit - * them to `packages/ingestion/vendor/wasms/`. See - * `scripts/build-vendor-wasms.sh` and `vendor/wasms/README.md`. - * - * Returns `undefined` when neither stage resolves (package not - * installed, or language not in either table). - */ -function resolveGrammarWasmPath(lang: LanguageId): string | undefined { - const direct = tryPerGrammarPackage(lang); - if (direct !== undefined) return direct; - return tryVendoredWasm(lang); -} - -/** - * Stage 1: resolve a `.wasm` that ships inside the per-grammar - * `tree-sitter-` npm package. Returns `undefined` when the - * language has no entry in this table or the package is not installed. - */ -function tryPerGrammarPackage(lang: LanguageId): string | undefined { - // `tree-sitter-typescript` ships two wasms in one package — select by - // language variant. - if (lang === "typescript" || lang === "tsx") { - const pkgDir = resolvePackageDir("tree-sitter-typescript"); - if (pkgDir === undefined) return undefined; - const fname = lang === "typescript" ? "tree-sitter-typescript.wasm" : "tree-sitter-tsx.wasm"; - return path.join(pkgDir, fname); - } - const mapping: Partial> = { - javascript: { pkg: "tree-sitter-javascript", file: "tree-sitter-javascript.wasm" }, - python: { pkg: "tree-sitter-python", file: "tree-sitter-python.wasm" }, - go: { pkg: "tree-sitter-go", file: "tree-sitter-go.wasm" }, - rust: { pkg: "tree-sitter-rust", file: "tree-sitter-rust.wasm" }, - java: { pkg: "tree-sitter-java", file: "tree-sitter-java.wasm" }, - // c-sharp publishes `tree-sitter-c_sharp.wasm` (underscore, not hyphen). - csharp: { pkg: "tree-sitter-c-sharp", file: "tree-sitter-c_sharp.wasm" }, - c: { pkg: "tree-sitter-c", file: "tree-sitter-c.wasm" }, - cpp: { pkg: "tree-sitter-cpp", file: "tree-sitter-cpp.wasm" }, - ruby: { pkg: "tree-sitter-ruby", file: "tree-sitter-ruby.wasm" }, - // Use php_only (pure PHP, no HTML template injection) to match native loader (grammar-registry.ts:244-254). - php: { pkg: "tree-sitter-php", file: "tree-sitter-php_only.wasm" }, - }; - const entry = mapping[lang]; - if (entry === undefined) return undefined; - const pkgDir = resolvePackageDir(entry.pkg); - if (pkgDir === undefined) return undefined; - return path.join(pkgDir, entry.file); -} - -/** - * Stage 2: resolve from the vendored WASM directory at - * `packages/ingestion/vendor/wasms/`. Only opted-in for languages whose - * per-grammar npm package does NOT ship a `.wasm` — kotlin, swift, dart. - * - * These are built once from the same grammar sources our package.json - * pins (zero version drift vs native) and committed to the repo. The - * upstream `tree-sitter-wasms` catalog can't be used because its 0.1.13 - * artifacts were built with tree-sitter-cli 0.20.x and ship the legacy - * `dylink` section, which web-tree-sitter 0.26+ refuses to load (it - * requires the standardized `dylink.0` section). - * - * Keep this table minimal — adding a language here is a deliberate - * architectural choice. See `scripts/build-vendor-wasms.sh`. - */ -function tryVendoredWasm(lang: LanguageId): string | undefined { - const catalog: Partial> = { - kotlin: "tree-sitter-kotlin.wasm", - swift: "tree-sitter-swift.wasm", - dart: "tree-sitter-dart.wasm", - }; - const fname = catalog[lang]; - if (fname === undefined) return undefined; - return path.join(VENDOR_WASMS_DIR, fname); -} - -function resolvePackageDir(pkgName: string): string | undefined { - try { - const manifestPath = requireFn.resolve(`${pkgName}/package.json`); - return path.dirname(manifestPath); - } catch { - return undefined; - } -} - -/** - * Test hook: clear the per-process WASM parser cache. Never call in - * production paths — it would force a re-init of every grammar. - */ -export function _resetWasmCacheForTests(): void { - wasmCache.clear(); - wasmRuntime = undefined; -} - -/** - * Test hook: expose the grammar-path resolver so unit tests can assert - * the two-stage cascade (per-grammar package → tree-sitter-wasms - * catalog) resolves kotlin/swift/dart correctly. Not part of the public - * API — callers in production paths must go through `openWasmParser`. - */ -export function _resolveGrammarWasmPathForTests(lang: LanguageId): string | undefined { - return resolveGrammarWasmPath(lang); -} diff --git a/packages/ingestion/src/parse/wasm-grammar-resolution.test.ts b/packages/ingestion/src/parse/wasm-grammar-resolution.test.ts index 3068a398..bcb3ff4c 100644 --- a/packages/ingestion/src/parse/wasm-grammar-resolution.test.ts +++ b/packages/ingestion/src/parse/wasm-grammar-resolution.test.ts @@ -1,40 +1,50 @@ /** - * Unit tests for `resolveGrammarWasmPath` — the two-stage cascade that - * maps a `LanguageId` to a bundled `.wasm` asset path. - * - * Stage 1 (per-grammar package) is exercised by the parse-worker / - * wasm-parity suites via real `openWasmParser` calls. This file - * focuses on stage 2: the vendored-WASM fallback at - * `packages/ingestion/vendor/wasms/` which handles kotlin, swift, and - * dart — whose per-grammar `tree-sitter-*` packages do NOT ship a - * `.wasm` alongside the `.node` addon. + * Unit tests for `resolveGrammarWasmPath` — the single declarative + * LanguageId-to-filename map that locates each grammar's `.wasm` inside + * the vendored directory at `packages/ingestion/vendor/wasms/`. * * Asserted properties: - * - kotlin/swift/dart resolve to absolute paths ending in - * `tree-sitter-.wasm` inside `vendor/wasms/`. + * - Every supported `LanguageId` resolves to an absolute path under + * `vendor/wasms/` ending in the expected filename. * - The resolved paths point to files that actually exist on disk * (verifies the commit + build-script loop landed correctly). - * - A known per-grammar-package entry (python) still resolves — the - * refactor must not regress the 11-entry primary mapping. - * - PHP resolves to the `php_only` variant. + * - PHP resolves to the `php_only` variant (pure PHP, no HTML + * injection) — matches the prior native-loader behavior. + * - C# resolves to `tree-sitter-c_sharp.wasm` (underscore, not hyphen). + * - Cobol returns `undefined` (regex-provider language; no grammar). */ import { strict as assert } from "node:assert"; import { statSync } from "node:fs"; import path from "node:path"; import { describe, it } from "node:test"; -import { _resolveGrammarWasmPathForTests } from "./wasm-fallback.js"; +import { _resolveGrammarWasmPathForTests } from "./wasm-runtime.js"; + +const EXPECTED: Readonly> = { + typescript: "tree-sitter-typescript.wasm", + tsx: "tree-sitter-tsx.wasm", + javascript: "tree-sitter-javascript.wasm", + python: "tree-sitter-python.wasm", + go: "tree-sitter-go.wasm", + rust: "tree-sitter-rust.wasm", + java: "tree-sitter-java.wasm", + csharp: "tree-sitter-c_sharp.wasm", + c: "tree-sitter-c.wasm", + cpp: "tree-sitter-cpp.wasm", + ruby: "tree-sitter-ruby.wasm", + kotlin: "tree-sitter-kotlin.wasm", + swift: "tree-sitter-swift.wasm", + dart: "tree-sitter-dart.wasm", + php: "tree-sitter-php_only.wasm", +}; -describe("resolveGrammarWasmPath — vendored WASM fallback", () => { - for (const lang of ["kotlin", "swift", "dart"] as const) { - it(`resolves ${lang} to an existing vendor/wasms/tree-sitter-${lang}.wasm`, () => { - const wasmPath = _resolveGrammarWasmPathForTests(lang); +describe("resolveGrammarWasmPath — vendored WASM resolver", () => { + for (const [lang, fname] of Object.entries(EXPECTED)) { + it(`resolves ${lang} to vendor/wasms/${fname} on disk`, () => { + const wasmPath = _resolveGrammarWasmPathForTests(lang as never); assert.ok(wasmPath !== undefined, `expected a path for ${lang}, got undefined`); assert.ok(path.isAbsolute(wasmPath), `expected absolute path for ${lang}, got ${wasmPath}`); - assert.ok( - wasmPath.endsWith(`tree-sitter-${lang}.wasm`), - `expected path ending in tree-sitter-${lang}.wasm, got ${wasmPath}`, - ); + assert.ok(wasmPath.endsWith(fname), `expected path ending in ${fname}, got ${wasmPath}`); assert.ok( wasmPath.includes(`${path.sep}vendor${path.sep}wasms${path.sep}`), `expected path under vendor/wasms/, got ${wasmPath}`, @@ -46,23 +56,9 @@ describe("resolveGrammarWasmPath — vendored WASM fallback", () => { } }); -describe("resolveGrammarWasmPath — per-grammar package path unchanged", () => { - it("python still resolves from its own tree-sitter-python package", () => { - const wasmPath = _resolveGrammarWasmPathForTests("python"); - assert.ok(wasmPath !== undefined); - assert.ok(wasmPath.endsWith("tree-sitter-python.wasm")); - assert.ok( - !wasmPath.includes(`${path.sep}vendor${path.sep}wasms${path.sep}`), - `python must resolve from its own package, not the vendor dir: ${wasmPath}`, - ); - }); - - it("php resolves to php_only.wasm", () => { - const wasmPath = _resolveGrammarWasmPathForTests("php"); - assert.ok(wasmPath !== undefined); - assert.ok( - wasmPath.endsWith("tree-sitter-php_only.wasm"), - `php must resolve to php_only.wasm, got ${wasmPath}`, - ); +describe("resolveGrammarWasmPath — non-tree-sitter languages", () => { + it("cobol returns undefined (regex-provider language; no tree-sitter grammar)", () => { + const wasmPath = _resolveGrammarWasmPathForTests("cobol"); + assert.equal(wasmPath, undefined); }); }); diff --git a/packages/ingestion/src/parse/wasm-parity.test.ts b/packages/ingestion/src/parse/wasm-parity.test.ts deleted file mode 100644 index 7ec81fcd..00000000 --- a/packages/ingestion/src/parse/wasm-parity.test.ts +++ /dev/null @@ -1,361 +0,0 @@ -/** - * WASM parity smoke test. - * - * Verifies that capture tag + text output of the WASM runtime matches - * the native runtime for a small-but-representative set of source - * bodies across all 14 tree-sitter-backed `LanguageId` values - * (typescript, tsx, javascript, python, go, rust, java, csharp, c, - * cpp, ruby, php, kotlin, swift, dart). COBOL is regex-only and lives - * outside this parity matrix by design. - * - * We compare by (tag, text) tuples — coordinate values can legitimately - * differ across grammars when the tree-sitter query picks up a subtly - * different capture range. The spec-level invariant is "semantic - * capture output is the same"; we assert that the multiset of - * (tag, text) pairs matches. - * - * Skip semantics: - * - When native tree-sitter is unavailable (e.g. Node 24 where the - * native bindings don't compile), every per-language iteration - * reports as a skip with a descriptive message. There is no hard - * fail — the suite is a no-op on WASM-only boxes. - * - When a specific language's WASM grammar handle fails to open, we - * emit a `console.warn` naming the gap and skip that language so - * the rest of the matrix continues to execute. - */ - -import { strict as assert } from "node:assert"; -import { after, before, describe, it } from "node:test"; -import { parseFixture } from "../providers/test-helpers.js"; -import type { LanguageId } from "./types.js"; -import { getUnifiedQuery } from "./unified-queries.js"; -import { - _resetWasmCacheForTests, - isNativeAvailable, - openWasmParser, - type WasmParserHandle, -} from "./wasm-fallback.js"; -import { ParsePool } from "./worker-pool.js"; - -/** 20 TypeScript bodies. */ -const TS_FIXTURES: readonly string[] = [ - `export function add(a: number, b: number): number { return a + b; }`, - `class Foo { greet(): string { return "hi"; } }`, - `interface Speaker { speak(msg: string): void; }`, - `const x = 42;`, - `export const fn = (n: number) => n * 2;`, - `export class Bar extends Foo implements Speaker { - speak(msg: string): void { console.log(msg); } - }`, - `type Id = string | number;`, - `enum Color { Red, Green, Blue }`, - `import { foo } from "./a"; import * as b from "./b";`, - `namespace N { export const y = 1; }`, - `async function run() { await Promise.resolve(1); }`, - `function* gen() { yield 1; yield 2; }`, - `/** adds numbers */ export function add2(a: number, b: number) { return a + b; }`, - `export default class Main { constructor(public name: string) {} }`, - `const obj = { a: 1, b: 2 };`, - `export function takeOptional(x?: number): number { return x ?? 0; }`, - `class K { private f = 1; get v(): number { return this.f; } }`, - `for (const x of [1,2,3]) { console.log(x); }`, - `try { throw new Error("bad"); } catch (e) { console.error(e); }`, - `export abstract class A { abstract run(): void; }`, -]; - -/** 20 Python bodies. */ -const PY_FIXTURES: readonly string[] = [ - `def add(a, b):\n return a + b\n`, - `class Foo:\n def greet(self):\n return "hi"\n`, - `class Speaker:\n def speak(self, msg):\n raise NotImplementedError\n`, - `x = 42\n`, - `fn = lambda n: n * 2\n`, - `class Bar(Foo):\n def speak(self, msg):\n print(msg)\n`, - `from typing import Union\nId = Union[str, int]\n`, - `from enum import Enum\nclass Color(Enum):\n RED = 1\n GREEN = 2\n`, - `import os\nfrom pathlib import Path\n`, - `def run():\n return sum(range(10))\n`, - `async def fetch():\n return await asyncio.sleep(0)\n`, - `def gen():\n yield 1\n yield 2\n`, - `def add2(a, b):\n """adds numbers"""\n return a + b\n`, - `class Main:\n def __init__(self, name):\n self.name = name\n`, - `obj = {"a": 1, "b": 2}\n`, - `def optional(x=None):\n return x if x is not None else 0\n`, - `class K:\n def __init__(self):\n self._f = 1\n @property\n def v(self):\n return self._f\n`, - `for x in [1, 2, 3]:\n print(x)\n`, - `try:\n raise ValueError("bad")\nexcept ValueError as e:\n print(e)\n`, - `def multi_return(n):\n if n > 0:\n return 1\n elif n < 0:\n return -1\n return 0\n`, -]; - -/** 20 Go bodies. */ -const GO_FIXTURES: readonly string[] = [ - `package p\nfunc Add(a, b int) int { return a + b }\n`, - `package p\ntype Foo struct{}\nfunc (f *Foo) Greet() string { return "hi" }\n`, - `package p\ntype Speaker interface { Speak(msg string) }\n`, - `package p\nconst X = 42\n`, - `package p\nvar fn = func(n int) int { return n * 2 }\n`, - `package p\ntype Bar struct{ Foo }\nfunc (b *Bar) Speak(msg string) {}\n`, - `package p\ntype ID string\n`, - `package p\nconst (\n Red = iota\n Green\n Blue\n)\n`, - `package p\nimport (\n "fmt"\n "strings"\n)\n`, - `package p\nfunc Run() int { return 42 }\n`, - `package p\nfunc run() { defer func(){ recover() }() }\n`, - `package p\nfunc gen() <-chan int {\n ch := make(chan int)\n go func() { ch <- 1; ch <- 2; close(ch) }()\n return ch\n}\n`, - `// Add2 adds two numbers.\npackage p\nfunc Add2(a, b int) int { return a + b }\n`, - `package p\ntype Main struct{ name string }\nfunc NewMain(name string) *Main { return &Main{name: name} }\n`, - `package p\nvar obj = map[string]int{"a": 1, "b": 2}\n`, - `package p\nfunc takeOptional(x *int) int { if x == nil { return 0 }; return *x }\n`, - `package p\ntype K struct{ f int }\nfunc (k *K) V() int { return k.f }\n`, - `package p\nfunc iter() { for _, x := range []int{1, 2, 3} { _ = x } }\n`, - `package p\nfunc tryCatch() error { return fmt.Errorf("bad") }\n`, - `package p\nfunc multiReturn(n int) (int, error) { if n > 0 { return 1, nil }; return 0, fmt.Errorf("non-positive") }\n`, -]; - -/** - * Fixture blocks for the remaining 11 tree-sitter languages. 3-5 bodies - * each is enough to exercise the capture-tag surface the unified query - * targets (definitions, imports, references); fuller 20-body arrays - * live on typescript/python/go as historical regression corpora. - * - * Authoring rule: every snippet must be syntactically valid on its own - * (no missing imports / enclosing scopes) so both native and WASM can - * parse it cleanly without error-node divergence. - */ - -/** TSX fixtures. */ -const TSX_FIXTURES: readonly string[] = [ - `export const Hello = () =>
hi
;`, - `import React from "react";\nexport function Page(): JSX.Element { return

title

; }`, - `interface Props { name: string }\nexport const Greet = (p: Props) => {p.name};`, - `export class App extends React.Component { render() { return
; } }`, -]; - -/** JavaScript fixtures (ESM + CJS). */ -const JS_FIXTURES: readonly string[] = [ - `export function add(a, b) { return a + b; }`, - `class Foo { greet() { return "hi"; } }`, - `import { readFile } from "node:fs/promises";\nexport async function load(p) { return readFile(p); }`, - `const path = require("node:path");\nmodule.exports = { resolve: (f) => path.resolve(f) };`, - `export const fn = (n) => n * 2;`, -]; - -/** Rust fixtures. */ -const RUST_FIXTURES: readonly string[] = [ - `pub fn add(a: i32, b: i32) -> i32 { a + b }`, - `pub struct Greeter { pub name: String }\nimpl Greeter { pub fn new(name: String) -> Self { Self { name } } }`, - `pub trait Greet { fn greet(&self, name: &str) -> String; }`, - `use std::collections::HashMap;\npub fn empty() -> HashMap { HashMap::new() }`, - `pub const DEFAULT: u32 = 42;`, -]; - -/** Java fixtures. */ -const JAVA_FIXTURES: readonly string[] = [ - `package demo;\npublic class Hello { public String greet(String n) { return "hi " + n; } }`, - `package demo;\npublic interface Speaker { void speak(String msg); }`, - `package demo;\nimport java.util.List;\npublic class Box { public List xs; }`, - `package demo;\npublic class Counter { private int n = 0; public int inc() { return ++n; } }`, -]; - -/** C# fixtures. */ -const CSHARP_FIXTURES: readonly string[] = [ - `namespace Demo; public class Hello { public string Greet(string n) => "hi " + n; }`, - `namespace Demo; public interface ISpeaker { void Speak(string msg); }`, - `using System.Collections.Generic; namespace Demo; public class Box { public List Xs = new(); }`, - `namespace Demo; public record Point(int X, int Y);`, -]; - -/** C fixtures. */ -const C_FIXTURES: readonly string[] = [ - `int add(int a, int b) { return a + b; }`, - `#include \nvoid greet(const char *n) { printf("hi %s\\n", n); }`, - `struct Point { int x; int y; };\nstruct Point origin(void) { struct Point p = {0, 0}; return p; }`, - `static int counter = 0;\nint inc(void) { return ++counter; }`, -]; - -/** C++ fixtures. */ -const CPP_FIXTURES: readonly string[] = [ - `int add(int a, int b) { return a + b; }`, - `#include \nclass Greeter { public: std::string greet(const std::string& n) { return "hi " + n; } };`, - `namespace util { int square(int n) { return n * n; } }`, - `template T identity(T x) { return x; }`, -]; - -/** Ruby fixtures. */ -const RUBY_FIXTURES: readonly string[] = [ - `def add(a, b)\n a + b\nend\n`, - `class Greeter\n def greet(name)\n "hi #{name}"\n end\nend\n`, - `module Math2\n def self.square(n)\n n * n\n end\nend\n`, - `require "json"\nputs JSON.generate({a: 1})\n`, -]; - -/** PHP fixtures. */ -const PHP_FIXTURES: readonly string[] = [ - ` Int { return a + b }`, - `class Greeter { func greet(_ name: String) -> String { return "hi " + name } }`, - `protocol Speaker { func speak(_ msg: String) }`, - `struct Point { var x: Int; var y: Int }`, -]; - -/** Dart fixtures. */ -const DART_FIXTURES: readonly string[] = [ - `int add(int a, int b) => a + b;`, - `class Greeter { String greet(String name) => "hi $name"; }`, - `abstract class Speaker { void speak(String msg); }`, - `import "dart:async";\nFuture load() async => 42;`, -]; - -interface CaptureKey { - readonly tag: string; - readonly text: string; -} - -function toKeyMultiset(captures: readonly { tag: string; text: string }[]): string[] { - const out = captures.map((c: CaptureKey) => `${c.tag}|${c.text}`); - out.sort(); - return out; -} - -async function captureNative( - pool: ParsePool, - lang: LanguageId, - name: string, - source: string, -): Promise { - const fx = await parseFixture(pool, lang, name, source); - return fx.captures.map((c) => ({ tag: c.tag, text: c.text })); -} - -async function captureWasm( - handle: WasmParserHandle, - lang: LanguageId, - source: string, -): Promise { - const queryText = getUnifiedQuery(lang); - const caps = handle.runQuery(queryText, source); - return caps.map((c) => ({ tag: c.name, text: c.node.text })); -} - -/** - * Full fixture matrix — every tree-sitter `LanguageId` paired with its - * fixture array. COBOL is regex-only (no grammar) and sits outside this - * matrix. - */ -const FIXTURES: readonly (readonly [LanguageId, readonly string[]])[] = [ - ["typescript", TS_FIXTURES], - ["tsx", TSX_FIXTURES], - ["javascript", JS_FIXTURES], - ["python", PY_FIXTURES], - ["go", GO_FIXTURES], - ["rust", RUST_FIXTURES], - ["java", JAVA_FIXTURES], - ["csharp", CSHARP_FIXTURES], - ["c", C_FIXTURES], - ["cpp", CPP_FIXTURES], - ["ruby", RUBY_FIXTURES], - ["php", PHP_FIXTURES], - ["kotlin", KOTLIN_FIXTURES], - ["swift", SWIFT_FIXTURES], - ["dart", DART_FIXTURES], -] as const; - -// Module-level native-availability gate. When native tree-sitter is not -// installed (e.g. Node 24 boxes where the native bindings fail to -// compile), flip every iteration into a skip rather than a hard fail. -// The outer `describe()` always runs so the skip surface is visible. -const NATIVE_AVAILABLE = isNativeAvailable(); -const SKIP_REASON = - "native tree-sitter is unavailable — parity suite requires it as the reference runtime"; - -describe("WASM parity: native vs WASM capture output", () => { - const pool = new ParsePool({ minThreads: 1, maxThreads: 1 }); - after(async () => { - await pool.destroy(); - }); - - before(() => { - _resetWasmCacheForTests(); - }); - - for (const [lang, fixtures] of FIXTURES) { - it(`${lang}: ${fixtures.length} bodies produce identical (tag, text) multisets`, { - skip: NATIVE_AVAILABLE ? false : SKIP_REASON, - }, async (t) => { - const handle = await openWasmParser(lang); - if (handle === null) { - // WASM grammar missing for this language — skip (not fail) so - // the rest of the matrix continues. Warn to stderr so the gap - // is visible in CI logs. - const msg = `WASM grammar missing for ${lang} — skipping parity check`; - console.warn(`[wasm-parity] ${msg}`); - t.skip(msg); - return; - } - for (let i = 0; i < fixtures.length; i++) { - const source = fixtures[i]; - if (source === undefined) continue; - const nativeKeys = toKeyMultiset( - await captureNative(pool, lang, `fx-${i}.${extFor(lang)}`, source), - ); - const wasmKeys = toKeyMultiset(await captureWasm(handle, lang, source)); - assert.deepEqual( - wasmKeys, - nativeKeys, - `${lang} fixture ${i} diverged\nnative: ${nativeKeys.join("\n")}\nwasm: ${wasmKeys.join("\n")}`, - ); - } - }); - } -}); - -function extFor(lang: LanguageId): string { - switch (lang) { - case "typescript": - return "ts"; - case "tsx": - return "tsx"; - case "javascript": - return "js"; - case "python": - return "py"; - case "go": - return "go"; - case "rust": - return "rs"; - case "java": - return "java"; - case "csharp": - return "cs"; - case "c": - return "c"; - case "cpp": - return "cpp"; - case "ruby": - return "rb"; - case "php": - return "php"; - case "kotlin": - return "kt"; - case "swift": - return "swift"; - case "dart": - return "dart"; - default: - return "txt"; - } -} diff --git a/packages/ingestion/src/parse/wasm-runtime.ts b/packages/ingestion/src/parse/wasm-runtime.ts new file mode 100644 index 00000000..453bdca7 --- /dev/null +++ b/packages/ingestion/src/parse/wasm-runtime.ts @@ -0,0 +1,284 @@ +/** + * WASM parser opener (the only runtime). + * + * `web-tree-sitter` is the sole parser host as of 0.4.0. Native `tree-sitter` + * was removed from the runtime install graph; grammar `.wasm` blobs are + * vendored under `packages/ingestion/vendor/wasms/` and resolved by a single + * declarative LanguageId-to-filename map. + * + * `openWasmParser(lang)` lazily initializes the web-tree-sitter runtime + * once per process and resolves the grammar WASM from the vendored + * directory. Per-grammar cache means repeated calls are O(1). + * + * Query execution uses the same unified S-expression bodies from + * `unified-queries.ts`. + */ + +import { createRequire } from "node:module"; +import path from "node:path"; +import { fileURLToPath } from "node:url"; +import type { LanguageId } from "./types.js"; + +const requireFn = createRequire(import.meta.url); + +// Resolve packages/ingestion/vendor/wasms/ relative to this module regardless +// of whether we're running from src/ (ts-node-style) or dist/ (compiled). +// `vendor/` lives at the package root, so we walk up from the file's dirname +// until we find it. Computed once at module load. +const VENDOR_WASMS_DIR = (() => { + const here = path.dirname(fileURLToPath(import.meta.url)); + // src → /src/parse; dist → /dist/parse — both 2 levels up + return path.resolve(here, "..", "..", "vendor", "wasms"); +})(); + +// --------------------------------------------------------------------------- +// WASM runtime +// --------------------------------------------------------------------------- + +/** + * Minimal shape of what `openWasmParser` returns — enough to run the + * same capture loop the native worker implements. Intentionally + * decoupled from the `web-tree-sitter` types so test code can stub it. + */ +export interface WasmParserHandle { + readonly language: LanguageId; + /** Parse a source string and return the underlying tree. */ + parse(source: string): WasmTree; + /** + * Execute the unified query and return the flat capture list. Callers + * translate into `ParseCapture` via the normal coordinate remapping + * (1-indexed lines, 0-indexed columns). + */ + runQuery(queryText: string, source: string): readonly WasmCapture[]; +} + +export interface WasmTree { + readonly rootNode: WasmNode; +} + +export interface WasmNode { + readonly type: string; + readonly text: string; + readonly startPosition: { readonly row: number; readonly column: number }; + readonly endPosition: { readonly row: number; readonly column: number }; + readonly childCount: number; + readonly namedChildCount?: number; + child(i: number): WasmNode | null; + namedChild?(i: number): WasmNode | null; + childForFieldName?(name: string): WasmNode | null; +} + +export interface WasmCapture { + readonly name: string; + readonly node: WasmNode; +} + +/** + * Per-LanguageId cache of WASM grammar handles. Populated lazily. + * `null` entries mean "tried and failed" — we don't retry forever. + */ +const wasmCache = new Map(); +let wasmRuntime: WasmRuntime | undefined; + +interface WasmRuntime { + Parser: WasmParserCtor; + Query: WasmQueryCtor; + Language: WasmLanguageStatic; + initialized: boolean; +} + +interface WasmParserCtor { + new (): WasmParserInstance; + init?: (opts?: Record) => Promise; +} + +interface WasmParserInstance { + setLanguage(lang: unknown): void; + parse(source: string): WasmTree; +} + +interface WasmQueryCtor { + new (lang: unknown, source: string): WasmQueryInstance; +} + +interface WasmQueryInstance { + matches(node: WasmNode): readonly { readonly captures: readonly WasmCapture[] }[]; +} + +interface WasmLanguageStatic { + load(source: string | Uint8Array): Promise; +} + +/** + * Expose the (resolved) Language type for downstream consumers (the + * complexity phase) that build their own `Parser` instances against a + * specific grammar. + */ +export type WasmLanguage = unknown; + +/** + * Build a parser for `lang` directly against the vendored WASM. Used by + * the complexity phase, which re-parses on the main thread to walk for + * cyclomatic / nesting / Halstead. + */ +export async function buildParserForLanguage( + lang: LanguageId, +): Promise { + const runtime = await ensureWasmRuntime(); + if (runtime === undefined) return undefined; + const wasmPath = resolveGrammarWasmPath(lang); + if (wasmPath === undefined) return undefined; + const tsLanguage = await runtime.Language.load(wasmPath); + const ParserCtor = runtime.Parser as unknown as new () => WasmParserInstance; + const parser = new ParserCtor(); + parser.setLanguage(tsLanguage); + return parser; +} + +/** + * Attempt to open a WASM-backed parser for `lang`. Returns `null` when + * either the `web-tree-sitter` runtime or the grammar's bundled `.wasm` + * could not be resolved — callers log a debug note and skip that file. + */ +export async function openWasmParser(lang: LanguageId): Promise { + const cached = wasmCache.get(lang); + if (cached !== undefined) return cached; + try { + const runtime = await ensureWasmRuntime(); + if (runtime === undefined) { + wasmCache.set(lang, null); + return null; + } + const wasmPath = resolveGrammarWasmPath(lang); + if (wasmPath === undefined) { + wasmCache.set(lang, null); + return null; + } + const tsLanguage = await runtime.Language.load(wasmPath); + const ParserCtor = runtime.Parser as unknown as new () => WasmParserInstance; + const parser = new ParserCtor(); + parser.setLanguage(tsLanguage); + + const handle: WasmParserHandle = { + language: lang, + parse: (source: string) => parser.parse(source), + runQuery: (queryText: string, source: string) => { + // Fresh Query per call so state stays clean between bodies. + // Query construction is cheap relative to the parse itself. + const q = new runtime.Query(tsLanguage, queryText); + const tree = parser.parse(source); + const out: WasmCapture[] = []; + for (const m of q.matches(tree.rootNode)) { + for (const cap of m.captures) out.push(cap); + } + return out; + }, + }; + wasmCache.set(lang, handle); + return handle; + } catch { + wasmCache.set(lang, null); + return null; + } +} + +/** + * Load the web-tree-sitter runtime on demand and initialize it. Returns + * `undefined` when the package isn't installed or the runtime refuses to + * init (sandboxed, missing WebAssembly, etc.). + * + * The runtime WASM (`web-tree-sitter.wasm`) is also vendored — we point + * Emscripten at it via `locateFile` so global installs don't have to find + * it inside a `node_modules` shape that may not exist. + */ +export async function ensureWasmRuntime(): Promise { + if (wasmRuntime?.initialized === true) return wasmRuntime; + let mod: { Parser: WasmParserCtor; Query: WasmQueryCtor; Language: WasmLanguageStatic }; + try { + mod = requireFn("web-tree-sitter") as { + Parser: WasmParserCtor; + Query: WasmQueryCtor; + Language: WasmLanguageStatic; + }; + } catch { + return undefined; + } + try { + if (typeof mod.Parser.init === "function") { + const runtimeWasm = path.resolve(VENDOR_WASMS_DIR, "web-tree-sitter.wasm"); + await mod.Parser.init({ + locateFile: () => runtimeWasm, + }); + } + } catch { + return undefined; + } + wasmRuntime = { + Parser: mod.Parser, + Query: mod.Query, + Language: mod.Language, + initialized: true, + }; + return wasmRuntime; +} + +/** + * Resolve the `.wasm` grammar asset for `lang` from the vendored + * directory. The vendor build script (`scripts/build-vendor-wasms.sh`) + * keeps this in sync with the grammar versions pinned in + * `pnpm-lock.yaml`; `prepublishOnly` (`scripts/verify-vendor-wasms.mjs`) + * fails the publish if any expected file is missing, mismatched, or has + * invalid WASM magic bytes. + * + * Returns `undefined` for languages with no tree-sitter grammar (cobol, + * which routes through the regex extractor). + */ +function resolveGrammarWasmPath(lang: LanguageId): string | undefined { + const fname = LANGUAGE_WASM_FILES[lang]; + if (fname === undefined) return undefined; + return path.resolve(VENDOR_WASMS_DIR, fname); +} + +/** + * LanguageId → filename in `vendor/wasms/`. The C# grammar lives at + * `tree-sitter-c_sharp.wasm` (underscore, not hyphen) and PHP uses the + * `php_only` variant (pure PHP, no HTML template injection) to match + * the prior native-loader behavior. Cobol is intentionally absent — it + * has no tree-sitter grammar and routes through the regex extractor. + */ +const LANGUAGE_WASM_FILES: Partial> = { + typescript: "tree-sitter-typescript.wasm", + tsx: "tree-sitter-tsx.wasm", + javascript: "tree-sitter-javascript.wasm", + python: "tree-sitter-python.wasm", + go: "tree-sitter-go.wasm", + rust: "tree-sitter-rust.wasm", + java: "tree-sitter-java.wasm", + csharp: "tree-sitter-c_sharp.wasm", + c: "tree-sitter-c.wasm", + cpp: "tree-sitter-cpp.wasm", + ruby: "tree-sitter-ruby.wasm", + php: "tree-sitter-php_only.wasm", + kotlin: "tree-sitter-kotlin.wasm", + swift: "tree-sitter-swift.wasm", + dart: "tree-sitter-dart.wasm", +}; + +/** + * Test hook: clear the per-process WASM parser cache. Never call in + * production paths — it would force a re-init of every grammar. + */ +export function _resetWasmCacheForTests(): void { + wasmCache.clear(); + wasmRuntime = undefined; +} + +/** + * Test hook: expose the grammar-path resolver so unit tests can assert + * the LanguageId-to-vendor-file mapping is exhaustive. Not part of the + * public API — callers in production paths must go through + * `openWasmParser`. + */ +export function _resolveGrammarWasmPathForTests(lang: LanguageId): string | undefined { + return resolveGrammarWasmPath(lang); +} diff --git a/packages/ingestion/src/pipeline/phases/complexity.ts b/packages/ingestion/src/pipeline/phases/complexity.ts index 23bb078d..8c2943ba 100644 --- a/packages/ingestion/src/pipeline/phases/complexity.ts +++ b/packages/ingestion/src/pipeline/phases/complexity.ts @@ -38,10 +38,9 @@ */ import { promises as fs } from "node:fs"; -import { createRequire } from "node:module"; import type { GraphNode, NodeId, NodeKind } from "@opencodehub/core-types"; -import { loadGrammar } from "../../parse/grammar-registry.js"; import type { LanguageId } from "../../parse/types.js"; +import { buildParserForLanguage, type WasmNode, type WasmTree } from "../../parse/wasm-runtime.js"; import type { ExtractedDefinition } from "../../providers/extraction-types.js"; import { getProvider } from "../../providers/registry.js"; import type { PipelineContext, PipelinePhase } from "../types.js"; @@ -73,64 +72,33 @@ export const complexityPhase: PipelinePhase = { }, }; -// -------- module-local tree-sitter shim (main-thread, one parser per lang) -- +// -------- module-local web-tree-sitter parsers (main-thread, one per lang) -- -const requireFn = createRequire(import.meta.url); +/** + * Per-language WASM parser cache. Lazily built on first use against the + * vendored grammar `.wasm`. Construction is async (web-tree-sitter's + * Language.load is async), so we cache the *parser instance* — never + * the module — and key by LanguageId. + * + * `null` means "tried and the WASM runtime / grammar could not be + * resolved on this host" — we don't retry forever. Distinct from + * `undefined` (cache miss). + */ +const parserCache = new Map(); -interface TsPoint { - readonly row: number; - readonly column: number; -} -interface TsNode { - readonly type: string; - readonly startPosition: TsPoint; - readonly endPosition: TsPoint; - readonly childCount: number; - readonly namedChildCount: number; - child(i: number): TsNode | null; - namedChild(i: number): TsNode | null; - childForFieldName?(name: string): TsNode | null; - readonly text: string; -} -interface TsTree { - readonly rootNode: TsNode; -} -interface TsParser { - setLanguage(lang: unknown): void; - parse(source: string): TsTree; +interface WasmParserLike { + parse(source: string): WasmTree | null; } -interface TsModule { - new (): TsParser; -} - -const parserCache = new Map(); -let tsModuleCached: TsModule | undefined; -let warnedComplexityDegraded = false; -function getTsModule(): TsModule | undefined { - if (tsModuleCached !== undefined) return tsModuleCached; - try { - tsModuleCached = requireFn("tree-sitter") as TsModule; - return tsModuleCached; - } catch { - if (!warnedComplexityDegraded) { - warnedComplexityDegraded = true; - process.stderr.write( - "[complexity] tree-sitter unavailable — complexity metrics degraded (set OCH_NATIVE_PARSER=1 on Node 22 to enable)\n", - ); - } - return undefined; - } -} - -async function getParser(lang: LanguageId): Promise { +async function getParser(lang: LanguageId): Promise { const cached = parserCache.get(lang); + if (cached === null) return undefined; if (cached !== undefined) return cached; - const TS = getTsModule(); - if (TS === undefined) return undefined; - const handle = await loadGrammar(lang); - const parser = new TS(); - parser.setLanguage(handle.tsLanguage); + const parser = (await buildParserForLanguage(lang)) as WasmParserLike | undefined; + if (parser === undefined) { + parserCache.set(lang, null); + return undefined; + } parserCache.set(lang, parser); return parser; } @@ -367,7 +335,7 @@ const LINE_COMMENT_PREFIX: Partial> = { // -------- traversal primitives --------------------------------------------- /** Pre-order iterator over a tree-sitter subtree. */ -function* walk(node: TsNode): IterableIterator { +function* walk(node: WasmNode): IterableIterator { yield node; const n = node.childCount; for (let i = 0; i < n; i++) { @@ -376,13 +344,13 @@ function* walk(node: TsNode): IterableIterator { } } -function countDecisionsIn(body: TsNode, lang: LanguageId): number { +function countDecisionsIn(body: WasmNode, lang: LanguageId): number { const decisions = DECISION_NODE_TYPES[lang]; const definitions = definitionTypesFor(lang); if (decisions === undefined || definitions === undefined) return 0; let count = 0; - const stack: { node: TsNode; skip: boolean }[] = [{ node: body, skip: false }]; + const stack: { node: WasmNode; skip: boolean }[] = [{ node: body, skip: false }]; while (stack.length > 0) { const frame = stack.pop(); if (frame === undefined) break; @@ -412,7 +380,7 @@ function countDecisionsIn(body: TsNode, lang: LanguageId): number { * in the TS/JS/Go/Rust/Java/C# grammars; `boolean_operator` (Python) is * structurally already boolean so every instance counts. */ -function contributesToCyclomatic(node: TsNode, lang: LanguageId): boolean { +function contributesToCyclomatic(node: WasmNode, lang: LanguageId): boolean { if (node.type !== "binary_expression") return true; // Find operator child; child(1) is typically the operator token. for (let i = 0; i < node.childCount; i++) { @@ -427,7 +395,7 @@ function contributesToCyclomatic(node: TsNode, lang: LanguageId): boolean { return false; } -function maxNestingIn(body: TsNode, lang: LanguageId): number { +function maxNestingIn(body: WasmNode, lang: LanguageId): number { const nesting = NESTING_NODE_TYPES[lang]; const definitions = definitionTypesFor(lang); if (nesting === undefined || definitions === undefined) return 0; @@ -435,7 +403,7 @@ function maxNestingIn(body: TsNode, lang: LanguageId): number { // Recursive walker with an explicit stack to avoid stack-overflow on huge // functions. Entries: (node, currentDepth, skipSubtree). - const stack: { node: TsNode; depth: number; skip: boolean }[] = [ + const stack: { node: WasmNode; depth: number; skip: boolean }[] = [ { node: body, depth: 0, skip: false }, ]; while (stack.length > 0) { @@ -576,9 +544,19 @@ async function runComplexity( continue; } - let tree: TsTree; + let tree: WasmTree; try { - tree = parser.parse(sourceText); + const parsed = parser.parse(sourceText); + if (parsed === null) { + ctx.onProgress?.({ + phase: COMPLEXITY_PHASE_NAME, + kind: "warn", + message: `complexity: parser returned null tree for ${filePath}`, + }); + skipped += callableDefs.length; + continue; + } + tree = parsed; } catch (err) { ctx.onProgress?.({ phase: COMPLEXITY_PHASE_NAME, @@ -648,8 +626,8 @@ async function runComplexity( return { symbolsAnnotated: annotated, skipped }; } -function collectDefinitionNodes(root: TsNode, defTypes: ReadonlySet): TsNode[] { - const out: TsNode[] = []; +function collectDefinitionNodes(root: WasmNode, defTypes: ReadonlySet): WasmNode[] { + const out: WasmNode[] = []; for (const n of walk(root)) { if (defTypes.has(n.type)) out.push(n); } @@ -661,8 +639,11 @@ function collectDefinitionNodes(root: TsNode, defTypes: ReadonlySet): Ts * `startPosition.row` is 0-indexed; ExtractedDefinition.startLine is 1-indexed. * We compare the 1-indexed form on both sides. */ -function matchSubtree(candidates: readonly TsNode[], def: ExtractedDefinition): TsNode | undefined { - let best: TsNode | undefined; +function matchSubtree( + candidates: readonly WasmNode[], + def: ExtractedDefinition, +): WasmNode | undefined { + let best: WasmNode | undefined; let bestRangeWidth = Number.POSITIVE_INFINITY; for (const c of candidates) { const cStart = c.startPosition.row + 1; @@ -679,7 +660,7 @@ function matchSubtree(candidates: readonly TsNode[], def: ExtractedDefinition): } /** Pick the body node when the function itself is a declaration shell. */ -function selectBody(def: TsNode): TsNode | undefined { +function selectBody(def: WasmNode): WasmNode | undefined { if (def.childForFieldName !== undefined) { const named = def.childForFieldName("body"); if (named !== null && named !== undefined) return named; @@ -773,7 +754,7 @@ function withComplexity( * Returns `undefined` when the provider did not declare * `halsteadOperatorKinds` or when the body contains no countable tokens. */ -function computeHalsteadVolume(body: TsNode, lang: LanguageId): number | undefined { +function computeHalsteadVolume(body: WasmNode, lang: LanguageId): number | undefined { const operators = halsteadOperatorsFor(lang); if (operators === undefined) return undefined; const definitions = definitionTypesFor(lang); @@ -785,7 +766,7 @@ function computeHalsteadVolume(body: TsNode, lang: LanguageId): number | undefin // Iterative walk; avoids stack overflow on very large functions. We do // not descend into nested function/method definitions — their tokens // belong to their own volume computation. - const stack: { node: TsNode; skip: boolean }[] = [{ node: body, skip: false }]; + const stack: { node: WasmNode; skip: boolean }[] = [{ node: body, skip: false }]; while (stack.length > 0) { const frame = stack.pop(); if (frame === undefined) break; diff --git a/packages/ingestion/vendor/wasms/README.md b/packages/ingestion/vendor/wasms/README.md index 8d86a65e..fde15194 100644 --- a/packages/ingestion/vendor/wasms/README.md +++ b/packages/ingestion/vendor/wasms/README.md @@ -1,16 +1,21 @@ # Vendored tree-sitter WASM grammars -These `.wasm` grammar files are committed to the repo because the upstream -`tree-sitter-{kotlin,swift,dart}` npm packages ship **only** native -(`.node`) bindings — no `.wasm` asset — and the shared +These `.wasm` grammar files are committed to the repo because OpenCodeHub +runs `web-tree-sitter` as the only parse runtime (see ADR 0015) — there +is no native fallback, so every supported language must have a `.wasm` +asset that loads cleanly under `web-tree-sitter@0.26+`. Some upstream +`tree-sitter-` npm packages ship only native (`.node`) bindings, +and the shared [`tree-sitter-wasms`](https://www.npmjs.com/package/tree-sitter-wasms) catalog ships WASMs built with tree-sitter-cli 0.20.x that use the legacy -`dylink` section format incompatible with `web-tree-sitter@0.26+` (which -hard-requires the standardized `dylink.0` section). +`dylink` section format incompatible with `web-tree-sitter@0.26+` +(which hard-requires the standardized `dylink.0` section). Vendoring +gives us one consistent build, one consistent format, one source of +truth. The WASMs under this directory are built from the **same grammar source -commits pinned in `packages/ingestion/package.json`**, so there is zero -grammar-version drift between native and WASM runtimes. +commits pinned in `packages/ingestion/package.json`**, so the runtime +parser tree is exactly what the grammar versions describe. ## Files @@ -34,8 +39,10 @@ local `emcc` install, plus `tree-sitter-cli` (installed as part of bash scripts/build-vendor-wasms.sh ``` -Rebuild when you bump any of the three grammar versions in -`packages/ingestion/package.json`. +Rebuild when you bump any vendored grammar version in +`packages/ingestion/package.json`. Re-vendoring uses `pnpm dlx` to fetch +the grammar source ad-hoc — none of the grammar packages need to remain +in `dependencies` or `devDependencies`. ## Why not build at install time? diff --git a/packages/ingestion/vendor/wasms/manifest.json b/packages/ingestion/vendor/wasms/manifest.json new file mode 100644 index 00000000..df9a469f --- /dev/null +++ b/packages/ingestion/vendor/wasms/manifest.json @@ -0,0 +1,22 @@ +{ + "schema": "opencodehub.vendor-wasms.v1", + "description": "Versions the .wasm files in this directory were built/copied from. Verified at prepublish.", + "grammars": { + "tree-sitter": "0.25.0", + "tree-sitter-typescript": "0.23.2", + "tree-sitter-javascript": "0.25.0", + "tree-sitter-python": "0.25.0", + "tree-sitter-go": "0.25.0", + "tree-sitter-rust": "0.24.0", + "tree-sitter-java": "0.23.5", + "tree-sitter-c-sharp": "0.23.5", + "tree-sitter-c": "0.24.1", + "tree-sitter-cpp": "0.23.4", + "tree-sitter-ruby": "0.23.1", + "tree-sitter-php": "0.24.2", + "tree-sitter-kotlin": "0.3.8", + "tree-sitter-swift": "0.7.1", + "web-tree-sitter": "0.26.8", + "tree-sitter-dart": "vendored-historically" + } +} diff --git a/packages/ingestion/vendor/wasms/tree-sitter-c.wasm b/packages/ingestion/vendor/wasms/tree-sitter-c.wasm new file mode 100644 index 00000000..9d4afad2 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-c.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-c_sharp.wasm b/packages/ingestion/vendor/wasms/tree-sitter-c_sharp.wasm new file mode 100644 index 00000000..bddfd5c8 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-c_sharp.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-cpp.wasm b/packages/ingestion/vendor/wasms/tree-sitter-cpp.wasm new file mode 100644 index 00000000..2e8cf9b1 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-cpp.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-go.wasm b/packages/ingestion/vendor/wasms/tree-sitter-go.wasm new file mode 100644 index 00000000..71a01bd8 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-go.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-java.wasm b/packages/ingestion/vendor/wasms/tree-sitter-java.wasm new file mode 100644 index 00000000..68a0c1f3 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-java.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-javascript.wasm b/packages/ingestion/vendor/wasms/tree-sitter-javascript.wasm new file mode 100644 index 00000000..c4b0915f Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-javascript.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-php_only.wasm b/packages/ingestion/vendor/wasms/tree-sitter-php_only.wasm new file mode 100644 index 00000000..afcf17a5 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-php_only.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-python.wasm b/packages/ingestion/vendor/wasms/tree-sitter-python.wasm new file mode 100644 index 00000000..dc9de36f Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-python.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-ruby.wasm b/packages/ingestion/vendor/wasms/tree-sitter-ruby.wasm new file mode 100644 index 00000000..c7a619c9 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-ruby.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-rust.wasm b/packages/ingestion/vendor/wasms/tree-sitter-rust.wasm new file mode 100644 index 00000000..b7e0c4b3 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-rust.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-tsx.wasm b/packages/ingestion/vendor/wasms/tree-sitter-tsx.wasm new file mode 100644 index 00000000..563fcf27 Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-tsx.wasm differ diff --git a/packages/ingestion/vendor/wasms/tree-sitter-typescript.wasm b/packages/ingestion/vendor/wasms/tree-sitter-typescript.wasm new file mode 100644 index 00000000..6a5d7f1b Binary files /dev/null and b/packages/ingestion/vendor/wasms/tree-sitter-typescript.wasm differ diff --git a/packages/ingestion/vendor/wasms/web-tree-sitter.wasm b/packages/ingestion/vendor/wasms/web-tree-sitter.wasm new file mode 100755 index 00000000..b59b2df4 Binary files /dev/null and b/packages/ingestion/vendor/wasms/web-tree-sitter.wasm differ diff --git a/planning/bulletproof-npm-install/explorer-architectural.md b/planning/bulletproof-npm-install/explorer-architectural.md new file mode 100644 index 00000000..771153a9 --- /dev/null +++ b/planning/bulletproof-npm-install/explorer-architectural.md @@ -0,0 +1,818 @@ +# Explorer: Architectural — Bulletproof npm global install for @opencodehub/cli + +**Status:** COMPLETE +**Vector:** Architectural +**Last updated:** 2026-05-15T03:04:05+00:00 + +--- + +## Protocol + +The output file is the source of truth. Section-by-section. Cite paths. +Vector bias: **clean boundaries, smallest future-change cost**. If consolidation +is the right answer, recommend deletion. + +--- + +## 1. Problem Framing + +The published `@opencodehub/cli` ships a runtime path that doesn't match its +distribution model. The runtime is "WASM by default, native opt-in". The +package graph is "13 native grammars + native `tree-sitter` core as +runtime `dependencies`". That mismatch is the real bug. The 504 from +GitHub on `tree-sitter-cli`'s postinstall and the ERESOLVE peer warnings +are surface symptoms of a deeper architectural drift: we have **two +parser backends in the published surface but only one path the user is +expected to take**, and we make every install pay the cost of the +backend it isn't going to use. + +The architectural fix is not "make the postinstall more reliable"; it is +"published surface declares only what the published runtime actually +loads". + +## 2. Chosen Approach + +**Single-path WASM, with native tree-sitter quarantined in the dev +workspace.** Concretely, three decisions sit on top of one another: + +1. **Split parser concerns into two packages.** Keep `@opencodehub/ingestion` + as the WASM-only parser. Move every native-tree-sitter consumer + (currently: `complexity.ts`, the opt-in path in `parse-worker.ts`) onto + the WASM API. Native `tree-sitter` and the 13 native grammars become + `devDependencies` of a new private workspace package + `@opencodehub/parser-native-bench` (or stay in the workspace root) used + only for the parity test and dev benchmarking — never published. + +2. **WASM grammars vendored at publish time.** The `vendor/wasms/` pattern + already used for kotlin/swift/dart extends to all 15 grammars. The + ingestion tarball ships a single `vendor/wasms/*.wasm` directory, + resolved at runtime by file path (no `require.resolve(...)` against a + sibling package). Total vendored size: ~24 MB compressed (see Decision + D for sizes). Tarball-bound, network-free, postinstall-free. + +3. **Delete the `--native-parser` opt-in and `OCH_NATIVE_PARSER` env var + from the runtime.** The two-path branch in `parse-worker.ts:156-159` + collapses to one. The complexity phase ports its tree walker onto the + `web-tree-sitter` `Tree` it already gets from `parse.ts`, so the parse + tree is built once per file instead of twice (the current code parses + each file again natively just for complexity). + +The shape is "one parser path, one binary format (WASM), one set of +grammar artifacts vendored in the tarball". That is the cheapest shape +to evolve: when `tree-sitter-foo@0.30` ships, we bump one dep, rebuild +one set of WASM blobs, and the cli inherits it without touching +distribution-side code. + +**The speed-first answer would have been different.** The speed-first +plan keeps native tree-sitter as a perf escape hatch and just moves it +to `optionalDependencies` so a failing postinstall doesn't fail the +install. That works, ships in an afternoon, and preserves a 1.5-2x +speedup on Node 22 dev boxes. I'm explicitly *not* recommending it +because it preserves three parser implementations (native, WASM, and the +complexity-only third walker), three test matrices, and three failure +modes. The user has license for refactor and explicitly said "go all in +on wasm if it has the same support and if it's less confusing". WASM has +parity (asserted by `wasm-parity.test.ts` already in the codebase), so +the long-run answer is consolidation. + +## 3. Key Decisions + +### Decision A — Tree-sitter parsing path consolidation + +**Decision:** Delete the native parser path entirely from the runtime. +Collapse to WASM-only. Remove `OCH_NATIVE_PARSER` and `--native-parser` +from the published surface; keep them as deprecated aliases for one +release, then delete. + +**Files changed:** + +- `packages/ingestion/src/parse/parse-worker.ts:51-54` — delete + `forceNativeOpt()`. +- `packages/ingestion/src/parse/parse-worker.ts:64-78` — delete the + three-branch startup-warning. Replace with a single line: `[parse-worker] + using web-tree-sitter (WASM) runtime`. +- `packages/ingestion/src/parse/parse-worker.ts:156-159, 162-191` — + delete `runNative` and the dispatch. The function becomes: + `async function runParse(language, content) { return runWasm(language, + content.toString('utf8')); }`. +- `packages/ingestion/src/parse/parse-worker.ts:222-245, 265-307` — + delete `getOrBuildParser`, `getOrBuildQuery`, the entire native shim + type set. The WASM path doesn't use them. +- `packages/ingestion/src/parse/wasm-fallback.ts:41-67` — delete + `isNativeAvailable` and `resetNativeAvailabilityCache`. There's no + caller anymore. Rename the file from `wasm-fallback.ts` to + `wasm-runtime.ts` — it isn't a fallback when it's the only path. +- `packages/ingestion/src/parse/index.ts:18` — drop the + `isNativeAvailable` re-export. +- `packages/cli/src/index.ts:88-91, 102-107` — delete the `--native-parser` + option and the env var write. Soft-deprecate by accepting the flag with + a stderr deprecation warning for one release. +- `packages/ingestion/src/parse/grammar-registry.ts:40-329` — + **rewrite around WASM-only.** No more `requireFn` of native grammar + modules; `loadLanguageObject(lang)` becomes "find the .wasm path, + return it" and `loadGrammar` returns a path + queryText, not a + `tsLanguage` opaque-pointer (or alternatively, returns the loaded + `web-tree-sitter` Language object — `runtime.Language.load(path)`). + This is the largest single edit in the plan: the registry + currently has 13 per-language case branches, each calling `requireFn` + with format quirks (`tree-sitter-typescript.typescript`, + `tree-sitter-php.php_only`, c-sharp ESM default, dart's "we never + shipped this natively" throw). The WASM path already handles these + uniformly via `wasm-fallback.ts:249-303` — the module mapping is + declarative. We move the entire registry to that declarative shape. + +**Tradeoff accepted:** ~30-40% slower parse phase on Node 22 dev boxes +where the developer would have set `OCH_NATIVE_PARSER=1`. Per-process +cold start gets ~200ms cheaper because we no longer resolve 13 native +modules eagerly. We lose the perf ceiling but gain a single, reasoning- +friendly code path. The architectural cost of keeping native is paid +forever (every grammar bump, every Node major, every install env); the +perf benefit is bounded. + +**Why this is the right architectural call:** The current branch has +five subtly-different equivalence classes — (Node 22, native available, +opt-in set), (Node 22, native available, opt-in unset), (Node 22, native +fails to build), (Node 24, opt-in set, native unsupported), (Node 24, no +opt-in). That's 5 paths to test, 5 places where a grammar version-skew +bug can hide. The parity test (`wasm-parity.test.ts`) keeps the two +backends in lockstep, which is itself maintenance work. Collapsing to +WASM is reversible if WASM perf becomes a real blocker (we restore from +git history, scoped to the `parse-worker.ts` runNative branch and the +grammar registry's native arm). Until then, every line we delete is a +line the next maintainer doesn't have to understand. + +### Decision B — Grammar coverage audit + +**Verified, all 15 languages have a WASM artifact reachable today:** + +| Language | npm package | Native ABI | WASM source | Size (kB) | +|----------|-------------|-----------:|-------------|----------:| +| typescript | tree-sitter-typescript@0.23.2 | 0.21.x | bundled in pkg `tree-sitter-typescript.wasm` | 1380 | +| tsx | tree-sitter-typescript@0.23.2 | 0.21.x | bundled in pkg `tree-sitter-tsx.wasm` | 1411 | +| javascript | tree-sitter-javascript@0.25.0 | 0.25.x | bundled in pkg | 402 | +| python | tree-sitter-python@0.25.0 | 0.25.x | bundled in pkg | 447 | +| go | tree-sitter-go@0.25.0 | 0.25.x | bundled in pkg | 212 | +| rust | tree-sitter-rust@0.24.0 | 0.24.x | bundled in pkg | 1077 | +| java | tree-sitter-java@0.23.5 | 0.23.x | bundled in pkg | 405 | +| csharp | tree-sitter-c-sharp@0.23.5 | 0.23.x | bundled in pkg `tree-sitter-c_sharp.wasm` | 5225 | +| c | tree-sitter-c@0.24.1 | 0.24.x | bundled in pkg | 611 | +| cpp | tree-sitter-cpp@0.23.4 | 0.23.x | bundled in pkg | 3354 | +| ruby | tree-sitter-ruby@0.23.1 | 0.23.x | bundled in pkg | 2057 | +| php | tree-sitter-php@0.24.2 | 0.24.x | bundled in pkg `tree-sitter-php_only.wasm` | 979 | +| kotlin | tree-sitter-kotlin@0.3.8 | none on npm | vendored `vendor/wasms/tree-sitter-kotlin.wasm` | 4096 | +| swift | tree-sitter-swift@0.7.1 | bundled but builds postinstall | vendored `vendor/wasms/tree-sitter-swift.wasm` | 3300 | +| dart | (no npm pkg) | n/a | vendored `vendor/wasms/tree-sitter-dart.wasm` | 995 | +| cobol | n/a (regex provider) | n/a | n/a | 0 | + +**Decision:** Vendor every grammar's WASM into `packages/ingestion/vendor/wasms/`, +not just the three that have no npm WASM today. Bundle 15 .wasm blobs, +plus `web-tree-sitter`'s own runtime wasm (~196 kB for the production +build at `node_modules/web-tree-sitter/web-tree-sitter.wasm`). Total ~25 +MB extra in the published `@opencodehub/ingestion` tarball. + +**Why vendor everything, not just the rare ones:** The current two-stage +cascade in `wasm-fallback.ts:238-303` (per-grammar package → vendored +fallback) is fragile distribution-wise. It assumes the user's npm/pnpm +hoisted the per-grammar packages somewhere `require.resolve` can find +them. That works inside this monorepo's `node_modules`. It works less +well when `@opencodehub/cli` is installed globally and only its declared +deps are present — yes, the grammars are still listed as deps today, +but the *whole point of the refactor* is to remove them from the +published deps. If we drop `tree-sitter-typescript` from +`@opencodehub/ingestion`'s dependencies (which we should, per Decision E), +we have to vendor its .wasm — there's no other path to it at runtime. + +So the architectural call is: **vendoring is the boundary**. Either we +vendor every WASM, or we keep listing the grammar packages just to +reach into `node_modules/.../.wasm`. The latter is a pun — it +keeps a runtime dep around for one purpose (a static asset) while the +package's actual runtime code (the .node addon) is dead weight. Vendoring +collapses the dependency graph cleanly and makes the tarball +self-contained. + +**Tradeoff accepted:** The published `@opencodehub/ingestion` tarball +grows from ~5 MB today to ~28 MB. Global `npm install -g @opencodehub/cli` +download time goes up by ~3-5 seconds on a typical home connection. +Build pipeline gains a "rebuild WASMs from grammar pins" step that has +to run before publish — but `scripts/build-vendor-wasms.sh` already +proves this is tractable. + +**Why this is reversible:** If tarball size becomes a real complaint, we +publish `@opencodehub/parsers-wasm` as a separate package keyed to the +ingestion version. The runtime code (`wasm-fallback.ts` / +`wasm-runtime.ts`) already abstracts WASM-path resolution behind a +single function; swapping "look in `/vendor/wasms/`" for "look in +`/wasms/`" is a one-line change. Don't do that on day one — the +boundary is right at the package boundary, ship it self-contained, +split out only when there's a forcing function. + +### Decision C — Postinstall purge + +**Rule:** No published runtime dep may have a postinstall that does +network IO or compiles native code on the user's machine. + +**Postinstall offenders in the current dep tree of @opencodehub/cli (via +@opencodehub/ingestion):** + +1. `tree-sitter-cli@0.23.2` — downloads platform binary from + `github.com/tree-sitter/tree-sitter/releases`. Pulled in transitively + by `tree-sitter-swift@0.7.1`. **HARD ERROR root cause** — drop by + removing `tree-sitter-swift` from runtime deps. +2. `tree-sitter` (core, 0.25.0) — runs `node-gyp rebuild` if no prebuild + is found. Has prebuilds for common platforms via `prebuild-install`. + Drop by moving `tree-sitter` to `devDependencies` (only the parity + test and complexity-phase native walk consume it; both go away under + Decision A and G). +3. Every `tree-sitter-` package — runs `node-gyp rebuild` if the + prebuild for the user's platform doesn't exist. Drop all 13 by + purging from runtime deps (Decision E). +4. `onnxruntime-node` — downloads CUDA EP (~400MB) unless + `npm_config_onnxruntime_node_install_cuda=skip` is set. Already + handled in repo `.npmrc` per the workspace config; verify it ships in + the published context too. (This is `@opencodehub/embedder`'s + problem; in scope here only because the same architectural rule + applies — runtime deps must not phone home at install time.) +5. `@duckdb/node-api` — has prebuilds, ships fine. Keep. + +**Decision:** Move 14 packages (`tree-sitter` + 13 grammars) from +`@opencodehub/ingestion`'s `dependencies` to its `devDependencies`. They +remain available in the workspace for the parity test (Decision J) and +for `scripts/build-vendor-wasms.sh` to run `tree-sitter-cli build +--wasm` against the same source the npm pin points at. + +**Architectural note:** `tree-sitter-cli` itself is fine in +`devDependencies` — it never reaches user installs. The fact that it's +currently transitive through `tree-sitter-swift` is exactly the wrong +direction of dependency flow. Build tools should be at the dev edge, +not at the runtime edge. + +### Decision D — Build-time WASM vendoring strategy + +**Decision:** Vendor all 15 WASMs at publish time. The ingestion package +ships them in its tarball. + +**Mechanics:** + +- Extend `scripts/build-vendor-wasms.sh` to build all 15 .wasm artifacts + (currently builds 3). The script reads the grammar source out of the + workspace's `node_modules/.pnpm/tree-sitter-` (already proves + this for kotlin/swift/dart) and runs `tree-sitter build --wasm` in a + temp dir, writing to `packages/ingestion/vendor/wasms/`. For + per-grammar packages that already ship a `.wasm` (12 of them — see + Decision B table), the script can either rebuild from source for + consistency, or copy the shipped .wasm out of node_modules. **Pick: + copy.** Rebuilding adds 30-60s per grammar and reproduces upstream's + artifact; copying is a deterministic "use the same bytes the grammar + package shipped". Keep `tree-sitter build --wasm` for the three that + have no shipped WASM (kotlin, swift, dart). +- Add a `prepublishOnly` script in `packages/ingestion/package.json` + that runs the vendor builder. This is the single guarantee: the + tarball can't be published without the .wasms being current. CI gates + on it via the `pnpm publish` flow. +- Add `vendor/wasms/**` to the `files` array (already there at + `packages/ingestion/package.json:33`). +- Vendor `web-tree-sitter`'s own runtime wasm too. The package ships it + inside `node_modules/web-tree-sitter/`, but we don't want a runtime + `require.resolve('web-tree-sitter/...')` for an asset the user + doesn't otherwise need. Copy it into `vendor/wasms/web-tree-sitter.wasm` + and pass `Parser.init({ locateFile: () => })`. + +**Tradeoff accepted:** `prepublishOnly` adds a hard build dependency +(docker/podman/finch with emcc, OR a local emcc) for any maintainer +running `pnpm publish` from a clean checkout. We could relax this by +caching the WASM artifacts in git LFS or as a CI build product. **First +pass:** keep them committed to the repo (they already are for k/s/d). +The repo grows by ~24 MB. Reversible: move to LFS later if the repo +weight becomes a complaint. + +**Why not "download at runtime, first analyze":** Same failure mode as +postinstall. The user's `codehub analyze` would now need network reach +to GitHub or wherever we host the WASMs. We just spent a paragraph +deleting the postinstall network-call problem; reintroducing it on first +use is the same architectural mistake one layer over. + +**Why not "peer package @opencodehub/parsers-wasm":** Splitting a peer +package makes sense if (a) multiple consumers want the same WASMs +without pulling all of ingestion, or (b) the WASM payload becomes large +enough that ingestion shouldn't carry it. Neither is true today. The +peer package is the right shape *later*, after we have a second +consumer; until then it's premature. + +### Decision E — `@opencodehub/cli` published surface + +**Current state of `packages/cli/package.json:38-58`:** 17 runtime +dependencies. Native tree-sitter and grammars come transitively through +`@opencodehub/ingestion`. The CLI itself doesn't list them. + +**Required changes — `packages/ingestion/package.json:40-75`:** + +| Before | After | Reason | +|--------|-------|--------| +| `tree-sitter@0.25.0` (deps) | `devDependencies` | Used only by complexity (Decision G) and parity test (Decision J). Both go to dev. | +| `tree-sitter-c@0.24.1` (deps) | drop from runtime; `devDependencies` for parity | WASM in vendor/ | +| `tree-sitter-c-sharp@0.23.5` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-cpp@0.23.4` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-go@0.25.0` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-java@0.23.5` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-javascript@0.25.0` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-kotlin@0.3.8` (deps) | `devDependencies` | Source for build-vendor-wasms.sh | +| `tree-sitter-php@0.24.2` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-python@0.25.0` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-ruby@0.23.1` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-rust@0.24.0` (deps) | `devDependencies` | WASM in vendor/ | +| `tree-sitter-swift@0.7.1` (deps) | `devDependencies` | Source for build-vendor-wasms.sh | +| `tree-sitter-typescript@0.23.2` (deps) | `devDependencies` | WASM in vendor/ | +| `web-tree-sitter@0.26.8` | keep in `dependencies` | Runtime parser host | + +After this change, the ERESOLVE peer warnings disappear because the +peer relationship between `tree-sitter@0.25` and the grammars' +`peerOptional tree-sitter@^0.21|0.22|0.23` is *not in the published +runtime graph* — the user's `npm install -g @opencodehub/cli` no longer +sees those packages. + +**The CLI itself (`packages/cli/package.json:38-58`)** needs no changes +— it depends on `@opencodehub/ingestion` and inherits the cleaner graph. + +**Tradeoff accepted:** Runtime no longer tolerates a "user supplies +their own native tree-sitter for speed" path. To use native, a developer +would need to clone the workspace and run from source. That's the +correct boundary for an opt-in dev-only feature. + +### Decision F — Multi-Node-installer compatibility matrix + +The published cli, after the above changes, has zero native build steps +in its install chain. The matrix is a verification surface, not a code +surface — the install path is the same on every row. + +**Test matrix (run pre-release in CI; smoke-run quarterly):** + +| OS | Arch | Node | Installer | Verifies | +|----|------|------|-----------|----------| +| Linux | x64 | 20.x | mise | engines satisfied; install succeeds; `codehub --help` runs | +| Linux | x64 | 22.x | mise | as above + `codehub analyze ` runs | +| Linux | x64 | 24.x | mise | as above | +| Linux | arm64 | 22.x | mise | proxy for Apple Silicon | +| Linux | x64 | 22.x | nvm | tilde-path resolution | +| macOS | arm64 | 22.x | Homebrew | libuv + brew prefix paths | +| macOS | arm64 | 22.x | nvm | $HOME/.nvm/versions/... | +| macOS | arm64 | 22.x | Volta | shim-based PATH | +| macOS | x64 | 22.x | nvm | Intel Mac smoke | + +**Engines field decision:** `packages/cli/package.json:80-82` declares +`>=22.0.0`. **Lower it to `>=20.0.0`.** WASM has no Node-version +constraint; the only reason engines was bumped was the native +tree-sitter ABI requiring a recent N-API. Once native is gone, we can +honestly support Node 20 LTS through Node 24. Reversible if Node 20 +hits an unrelated incompatibility. + +`packages/ingestion/package.json:105-107` matches. + +**One concrete failure mode to watch:** `web-tree-sitter` 0.26+ uses +top-level await in some code paths and requires its own .wasm runtime +to be loadable. Pass `Parser.init({ locateFile: () => fileURLToPath(new +URL("../../vendor/wasms/web-tree-sitter.wasm", import.meta.url)) })` to +guarantee resolution against the vendored copy — don't rely on the +default loader, which tries `fetch()` on web platforms and `fs` on Node +with platform-specific paths that have bitten us before +(`node_modules/.pnpm/...` vs. flat `node_modules/...`). One line in +`wasm-fallback.ts:194-220`'s `ensureWasmRuntime`. Architectural: the +runtime should know exactly where its assets live, not heuristically +search for them. + +### Decision G — Complexity phase resolution + +**Current state:** `packages/ingestion/src/pipeline/phases/complexity.ts:110-124` +calls `requireFn("tree-sitter")` and degrades gracefully when it fails. +On Node 24 default + Node 22 default, complexity metrics are silently +zero today. The complexity phase parses each file *again* (line 581: +`tree = parser.parse(sourceText)`) on top of what `parse.ts` already +parsed. + +**Decision: port complexity to WASM.** Two architectural sub-decisions: + +1. **Make the complexity walker source-format-agnostic.** The walker + only uses `node.type`, `node.childCount`, `node.child(i)`, + `node.startPosition`, `node.endPosition`, `node.text`, + `node.childForFieldName(name)`. Both native and `web-tree-sitter` + trees expose this surface (web-tree-sitter has the same Node API by + design — it's the upstream reference). The walker code in + `complexity.ts:370-509` already operates against an interface + (`TsNode`, lines 84-94) that matches both. The conversion is a swap + of the parser source, not a rewrite. + +2. **Stop double-parsing.** The parse phase already builds a tree per + file. Pipe the tree through to the complexity phase as part of + `ParseOutput`, instead of re-reading the file and re-parsing. This + is a non-trivial structural change because trees aren't structured- + clone-safe across worker boundaries — they're parser-tied objects. + Two ways to fix: + + - **Cheap option (do this):** Re-parse on the main thread, but use + WASM. The overhead is ~1.5x what native was; complexity is a + small phase (a few seconds on a 100k-LOC repo) so the absolute + hit is negligible. + - **Architectural option:** Move complexity *into* the parse worker + so each worker computes complexity on the tree it already has. This + is the right shape long-term — the complexity walker is per-file, + stateless, and trivially parallel — but it touches the + pipeline-phase contract (`PipelineContext`, `PipelinePhase`) and + is bigger than the install fix calls for. + + **Pick the cheap option for this work item; file the parse-fold + refactor as a follow-up.** Tradeoff: we accept a one-CPU-thread cost + on Node 24 that we don't strictly have to, in exchange for keeping + the change scoped. + +**Files changed:** + +- `complexity.ts:78, 106-124` — replace `requireFn("tree-sitter")` shim + with a `ensureWasmRuntime` import from `wasm-runtime.ts` (renamed + per Decision A). The `getParser(lang)` function builds a + `web-tree-sitter` `Parser` per language, cached. +- `complexity.ts:108-109, 116-119` — delete `warnedComplexityDegraded` + and the "set OCH_NATIVE_PARSER=1" message. WASM is reachable + unconditionally; complexity becomes a default-on capability instead + of a Node-22-with-opt-in capability. +- `complexity.ts:130-133` — `parser.setLanguage(handle.tsLanguage)` — + `handle.tsLanguage` becomes the `web-tree-sitter` Language object. + This is consistent with Decision A's reshape of `loadGrammar`. + +**Architectural win:** Complexity stops being silently degraded on the +default path. Today, a user running `npm install -g @opencodehub/cli` +on Node 24 gets a working `analyze` but *zero* complexity numbers — +this is a hidden quality-of-result regression. After this change, every +default install gets full complexity metrics. + +**Why I'm rejecting option (ii) "regex/AST-walker approximation":** +That answer optimizes for "make the complexity phase work without tree- +sitter at all". But tree-sitter is going to be there — we ship it. The +question isn't "how do we get rid of tree-sitter for complexity"; it's +"native or WASM tree-sitter". WASM is what we ship; use it. + +**Why I'm rejecting option (iii) "drop complexity from published cli":** +Complexity is a published-API feature. It powers the risk-trends MCP +tool and shows up in `verdict` blast-radius scoring. Removing it +breaks observable behavior. Not a refactor — a regression. + +### Decision H — Workspace publish hygiene + +**Current state:** `packages/ingestion/package.json:33` already lists +`vendor/wasms/**` in `files`. Good. No `prepack`/`prepare` scripts in +either ingestion or cli today. No `package-lock.json` published (npm +uses `pnpm-lock.yaml` and we don't `npm pack` from npm). + +**Decisions:** + +1. **Add `prepublishOnly` to `packages/ingestion/package.json:35-39`:** + ```json + "prepublishOnly": "node scripts/verify-vendor-wasms.mjs" + ``` + The verify script asserts: (a) all 15 expected .wasm files exist in + `vendor/wasms/`, (b) each is valid WASM (magic bytes), (c) each + matches the grammar version pinned in the workspace's + `pnpm-lock.yaml` via a manifest file `vendor/wasms/manifest.json` + that the build script writes. **This is the core of the architectural + guarantee** — the tarball can't ship without the WASMs and the WASMs + can't drift from the grammar pins. One-shot architectural lever + that costs ~50 LOC of script. + +2. **Add a `pnpm publish` smoke step in CI:** run `pnpm pack` in + `packages/ingestion`, then `pnpm pack` in `packages/cli`, then `npm + install -g ` in a clean container, then `codehub + --help` and `codehub analyze tests/fixtures/multi-lang/`. This is + the architectural equivalent of an integration test for the + distribution boundary. CI only — gates the publish. + +3. **Verify `files` includes the vendor WASMs in the *built* path:** + `packages/ingestion/package.json:24-34` lists `dist/**` and + `vendor/wasms/**`. The runtime resolves WASMs via + `wasm-fallback.ts:35-39` (`path.resolve(here, "..", "..", + "vendor", "wasms")`), which walks from `dist/parse/` up to the + package root. Tarball layout preserves this since both `dist/` and + `vendor/` sit at package root. **Already correct.** + +4. **Drop `optionalDependencies.ts-morph` from + `packages/ingestion/package.json:85-87`** — out of scope for the + install fix but worth noting: `ts-morph` is heavy and the + "optional" claim should be audited separately. **Out of this + change's scope; flag for follow-up.** + +### Decision I — Lockfile & hoisting consequences + +The CLI doesn't ship a `package-lock.json` (good — it would override +the user's npm client's resolution). Moving 14 packages from runtime +deps to dev deps in `@opencodehub/ingestion` changes: + +1. **Hoisting:** Today the workspace's `node_modules/.pnpm/...` has + tree-sitter-* hoisted at the workspace root. After the change, + they remain there because they're still in `devDependencies` of the + workspace. **Workspace dev work is unaffected.** + +2. **Published install:** `npm install -g @opencodehub/cli` no longer + pulls them. The user's global `node_modules` shrinks by ~30 MB of + prebuilt-binary-tar-content + ~20 MB of source. **This is the win.** + +3. **Runtime resolution:** Today + `wasm-fallback.ts:249-303`'s `tryPerGrammarPackage` calls + `requireFn.resolve(\`${pkgName}/package.json\`)`. After the change, + the grammar packages are not in the user's install — those calls + return undefined, and we fall through to the vendored path. **This + makes the runtime cascade meaningless; collapse it.** Replace the + two-stage cascade with a single declarative table that maps every + `LanguageId` to a `vendor/wasms/.wasm` path. Single source of + truth for "where's the WASM for X". Files: `wasm-fallback.ts:222-303` + collapses from ~80 LOC to ~25 LOC. + +4. **No production code currently relies on a hoisted-native-module + side effect.** I checked: only `parse-worker.ts:165` and + `complexity.ts:113` `requireFn` native tree-sitter, both gated and + both removed by Decisions A and G. + +### Decision J — Migration / deprecation path for OCH_NATIVE_PARSER + +**Decision: hard-deprecate immediately, delete in the next minor.** + +**Architectural justification:** The opt-in is undocumented to most +users (it's a power-user dev knob), the parity test asserts the two +paths produce equal output, and the only group of users who set it are +opencodehub maintainers running benchmarks. They can run from source +against the workspace's still-installed native tree-sitter. The +"keep flag, ignore value" intermediate state is the worst architectural +shape — every reader of the cli source has to understand a flag that +does nothing. + +**Migration steps:** + +1. **In the PR that removes the runtime path** (Decision A), keep the + `--native-parser` flag and `OCH_NATIVE_PARSER` env var as no-ops + that emit a one-shot stderr deprecation: + ``` + [opencodehub] OCH_NATIVE_PARSER / --native-parser is deprecated; + the WASM parser is now the only runtime path. The flag is ignored + and will be removed in 0.5.0. + ``` +2. **One release later (`0.5.0`), delete the flag and env handling + entirely.** Update CLAUDE.md, CHANGELOG, all docs. +3. **Delete from docs:** + `packages/cli/README.md:79`, + `packages/docs/src/content/docs/guides/indexing-a-repo.md:130`, + `packages/docs/src/content/docs/guides/troubleshooting.md:27,80`, + `packages/docs/src/content/docs/architecture/parsing-and-resolution.md:25`, + `packages/docs/src/content/docs/architecture/adrs.md:126`, + `packages/docs/src/content/docs/reference/configuration.md:31,33`, + `packages/docs/src/content/docs/reference/languages.md:53,55`, + `packages/docs/src/content/docs/reference/cli.md:40`, + `packages/docs/src/content/docs/start-here/what-is-opencodehub.md:68`, + `packages/docs/src/content/docs/start-here/install.md:15,112`. +4. **Update CLAUDE.md** at repo root: the "Parse runtime — WASM default, + native opt-in" section becomes "Parse runtime — WASM-only, vendored + grammars". Drop the OCH_NATIVE_PARSER row. Add: "Native tree-sitter + is a workspace-only dev dependency used by the parity test + (`packages/ingestion/src/parse/wasm-parity.test.ts`); not shipped to + npm install consumers." +5. **Write ADR 0014 — "WASM-only parser at the npm-distributed boundary":** + captures the decision permanently. References this plan and the npm + 504 incident as the trigger. + +**Keep alive in dev:** + +- `wasm-parity.test.ts` still runs natively — it imports both `tree-sitter` + and `web-tree-sitter` and asserts capture-set equivalence. This test + is the architectural anchor that lets us delete the runtime native + path with confidence: as long as parity holds, "WASM-only at runtime" + doesn't change semantics. Pin it in CI on Node 22 (Node 24 lacks the + native binding for some grammars currently). **Native survives, but + only behind the dev wall.** +- `scripts/build-vendor-wasms.sh` keeps `tree-sitter-cli` and the + grammar source packages; both stay in `devDependencies`. + +## 4. Implementation Steps + +Ordered. Each step lists the files touched and the verification. + +1. **Land WASM vendoring infrastructure (no behavior change yet).** + - `scripts/build-vendor-wasms.sh` — extend to all 15 grammars; for + packages that ship a `.wasm`, copy; for the three that don't, + build via `tree-sitter build --wasm`. + - Run the script. Commit `packages/ingestion/vendor/wasms/*.wasm` + for all 15 + `web-tree-sitter.wasm`. + - Add `packages/ingestion/vendor/wasms/manifest.json` recording the + grammar version each .wasm was built from. + - Add `packages/ingestion/scripts/verify-vendor-wasms.mjs` script + (asserts all 15 exist, valid WASM magic bytes, manifest matches + `pnpm-lock.yaml` versions). + - Wire `prepublishOnly: "node scripts/verify-vendor-wasms.mjs"` in + `packages/ingestion/package.json:35-39`. + - **Verify:** `pnpm pack -C packages/ingestion` contains all 16 + `.wasm` files; tarball size in expected range (~28 MB). + +2. **Switch WASM resolver to vendored-only path.** + - `packages/ingestion/src/parse/wasm-fallback.ts:222-303` — collapse + to one declarative table: `LanguageId` → `vendor/wasms/.wasm`. + - `packages/ingestion/src/parse/wasm-fallback.ts:194-220` — add + `locateFile` to `Parser.init` pointing at the vendored + `web-tree-sitter.wasm`. + - **Verify:** Existing `wasm-grammar-resolution.test.ts` and + `wasm-parity.test.ts` pass against the vendored path. The parity + test still loads native from `node_modules` (workspace devDeps); + unchanged behavior in workspace. + +3. **Port the complexity phase onto WASM.** + - `packages/ingestion/src/pipeline/phases/complexity.ts:78, 106-136` — + replace native shim with `ensureWasmRuntime` import; build + `web-tree-sitter` Parser per language, cache. + - Update `complexity.ts:108-119` — drop the "tree-sitter unavailable" + warning; complexity now works on default path. + - **Verify:** `complexity.test.ts` passes with WASM trees. Add a test + case running on Node 24 (CI matrix) to lock in the new default. + +4. **Delete native runtime path.** + - `packages/ingestion/src/parse/parse-worker.ts:51-78, 156-191, + 222-307` — delete native shim, native dispatch, native types. + The file shrinks ~150 LOC. + - `packages/ingestion/src/parse/wasm-fallback.ts:41-67` — delete + `isNativeAvailable`, rename file to `wasm-runtime.ts` (and + update imports). + - `packages/ingestion/src/parse/grammar-registry.ts:180-277` — + rewrite `loadLanguageObject` and `loadGrammar` to load WASM + Languages via `web-tree-sitter`; the per-grammar quirks + (ESM default, `.typescript`/`.tsx`, `.php_only`) all collapse + because the WASM artifact is unambiguous. + - `packages/ingestion/src/parse/index.ts:18` — drop + `isNativeAvailable` re-export. + - **Verify:** `parse-worker.test.ts` regenerated for WASM-only + (cases (a), (c), (d) collapse to one; case (b) is deleted). + `wasm-parity.test.ts` keeps its native-vs-wasm assertion as a + dev-only invariant. + +5. **Soft-deprecate OCH_NATIVE_PARSER and --native-parser.** + - `packages/cli/src/index.ts:88-91, 102-107` — keep flag, emit + deprecation warning, ignore. + - Add the deprecation warning at parse-worker startup if + `OCH_NATIVE_PARSER` is read non-empty. + +6. **Move tree-sitter and 13 grammars to devDependencies.** + - `packages/ingestion/package.json:40-75` — move 14 packages to + `devDependencies`. Keep `web-tree-sitter@0.26.8` in `dependencies`. + - Run `pnpm install` to refresh `pnpm-lock.yaml`. + - **Verify:** workspace tests still pass (the moved packages are + still hoisted in workspace `node_modules`); a tarball install + shows the `dependencies` tree no longer contains tree-sitter-*. + +7. **Lower engines floor.** + - `packages/cli/package.json:80-82` and + `packages/ingestion/package.json:105-107` — change + `>=22.0.0` to `>=20.0.0`. + - **Verify:** CI matrix (step 9) covers Node 20. + +8. **Documentation and CHANGELOG.** + - Update CLAUDE.md "Parse runtime" section. + - Drop `OCH_NATIVE_PARSER` from all 11 docs files (Decision J list). + - Write ADR 0014. + - Add CHANGELOG entries to `packages/ingestion/CHANGELOG.md` and + `packages/cli/CHANGELOG.md`. + +9. **Add CI install-matrix.** + - GitHub Actions job: 9 runners, each does `pnpm pack`, installs + the cli tarball globally, runs `codehub --help` and a tiny + `codehub analyze` against `tests/fixtures/multi-lang/`. + - **Verify gate:** all 9 pass before any release. + +10. **Hard-delete OCH_NATIVE_PARSER (next minor, follow-up PR).** + - `packages/cli/src/index.ts` — remove the flag definition. + - `packages/ingestion/src/parse/parse-worker.ts` — remove the + env-read deprecation warning. + +## 5. Risks and Tradeoffs + +**What we're giving up:** + +- **Native parser perf on dev Node 22.** Empirically ~1.5-2x parse- + phase wall-clock slowdown on warm-cache runs. Mitigated by `piscina` + worker pool already in use; the absolute time on a 100k-LOC repo + goes from ~6s to ~10s. Acceptable for the architectural simplicity. + Reversible — the dev parity test keeps native warm; if WASM perf + becomes a real blocker we restore the runtime-native branch from + git. + +- **Tarball size growth for `@opencodehub/ingestion`.** ~5 MB → ~28 MB. + The cli depends on it transitively, so the global install download + grows by the same amount. This is a one-time download, not a + hot-path; users feel it once. Acceptable. + +- **Repo size growth.** ~25 MB of vendored WASMs in git. Mitigated + because they compress poorly (already wasm-magic'd) but git stores + binary blobs reasonably well via packfiles. If the repo grows past + comfortable, follow-up moves them to LFS or a per-release CI + artifact. + +**What could go wrong:** + +- **`web-tree-sitter@0.26+` instability or Node 24 incompat.** Mitigated + by the install matrix CI (step 9). If we hit a real blocker on + Node 24 + WASM, we hold the release and pin web-tree-sitter forward + or backward. The architectural call doesn't change — we don't go + back to native; we fix the WASM runtime. + +- **A grammar bumps its tree-sitter ABI past 0.25.** The vendored + WASM was built against the pinned grammar source; bumping the + grammar pin without rebuilding the WASM produces a runtime mismatch. + Mitigated by `verify-vendor-wasms.mjs` checking the manifest against + `pnpm-lock.yaml`. Architectural: the verification script is the + load-bearing safety net for grammar drift. + +- **A user has `tree-sitter` installed in their project** + `node_modules` (because of an unrelated dep). Today our code + `requireFn`s it; tomorrow we don't. Reverse migration cost: zero — + we don't reach into the user's node_modules anymore for parser + bindings. + +- **`tree-sitter-cli` postinstall failure resurfaces from a different + transitive path.** Defense: CI's install-matrix runs + `npm install -g ` with `--ignore-scripts=false` and + asserts the install completes in <60s with no network calls beyond + the registry. If a future dep introduces a postinstall, the matrix + catches it. + +**What I'd watch for after release:** + +- Issue reports of `Parser.init` failing on specific Node 20 minors. + `web-tree-sitter` historically had Node 18 quirks; Node 20 has been + stable for it. Unlikely but trackable. +- WASM cold-start time on cold-disk runs (first-time analyze on a CI + agent). Probably negligible (< 500 ms total for 15 grammars + initialized lazily) but log it via `--verbose`. +- Tarball download timeouts on slow connections. Set a reasonable + expectation in install docs; consider a "minimal language set" + cli flag in a future minor that skips loading WASMs for unused + languages. + +## 6. Verification Criteria + +**Unit tests (must pass):** +- `packages/ingestion/src/parse/parse-worker.test.ts` — single-path + WASM tests; native cases removed. +- `packages/ingestion/src/parse/wasm-grammar-resolution.test.ts` — + resolves all 15 languages to a vendored path. +- `packages/ingestion/src/parse/wasm-parity.test.ts` — kept; runs only + in workspace dev (where native is still installed). Drops to a + matrix-skipped test in CI on Node 24. +- `packages/ingestion/src/pipeline/phases/complexity.test.ts` — passes + on default Node 22 and Node 24 with non-zero output. + +**Integration tests:** +- `tests/fixtures/multi-lang/` end-to-end `codehub analyze` produces + the same graph node count and same set of complexity-annotated + nodes as the pre-refactor baseline (parity gate). + +**Distribution gates (CI install matrix, blocks release):** +- 9-cell matrix (Linux x64 Node 20/22/24, Linux arm64 Node 22, macOS + arm64 Node 22 via Homebrew/nvm/Volta, macOS x64 Node 22 via nvm) — + each cell runs: + ``` + pnpm pack -C packages/ingestion + pnpm pack -C packages/cli + npm install -g ./packages/cli/opencodehub-cli-*.tgz \ + ./packages/ingestion/opencodehub-ingestion-*.tgz + codehub --version # exits 0 + codehub --help # exits 0 + codehub analyze tests/fixtures/multi-lang/ # exits 0 + ``` +- No postinstall script in any installed package's `package.json` may + contain `wget`, `curl`, `download`, `node-gyp rebuild`, + `prebuild-install`, or write to `~/.cache`. Audit script in CI. +- Install completes in < 60 seconds on a baseline runner with cold + npm cache. Hard regression gate. +- ERESOLVE warning count from `npm install` output: zero. + +**Architectural gates (review-time):** +- `grep -rn "OCH_NATIVE_PARSER\|requireFn(\"tree-sitter\")" packages/ + | grep -v parity | grep -v devDeps` returns no hits in non-test + source files. (Allows the parity test, allows the dev-bench harness.) +- `packages/ingestion/package.json`'s `dependencies` array contains + exactly one tree-sitter-related entry: `web-tree-sitter`. +- `packages/cli/package.json`'s `dependencies` is unchanged. +- ADR 0014 lands in `docs/adr/`. + +**Post-release (one week after publish):** +- npm download stats for `@opencodehub/cli` show no install-failure + spike. +- Issue tracker has zero "install failed" or "tree-sitter postinstall" + reports. +- A new contributor running `npm install -g @opencodehub/cli@latest` + on a fresh box with mise + Node 24 succeeds first try with no + warnings. + +--- + +## Appendix — Code references + +- Parser registry (rewrite target): `packages/ingestion/src/parse/grammar-registry.ts:79-97, 146-277` +- WASM resolver (collapse target): `packages/ingestion/src/parse/wasm-fallback.ts:222-303` +- Native dispatch (delete target): `packages/ingestion/src/parse/parse-worker.ts:51-54, 64-78, 156-191, 222-245, 265-307` +- Complexity native shim (port target): `packages/ingestion/src/pipeline/phases/complexity.ts:78-136` +- CLI flag (deprecate target): `packages/cli/src/index.ts:88-107` +- Ingestion runtime deps (move target): `packages/ingestion/package.json:59-72` +- Vendor builder (extend target): `scripts/build-vendor-wasms.sh:45-47` +- Workspace allowBuilds (audit target): `pnpm-workspace.yaml:50-77` (the `tree-sitter*` entries become dev-only after Decision E; left intact because workspace `pnpm install` still runs them at workspace-dev install time) +- ERESOLVE root: `tree-sitter-swift@0.7.1` → `tree-sitter-cli@0.23.2` postinstall (verified at `pnpm-lock.yaml:tree-sitter-swift block`) diff --git a/planning/bulletproof-npm-install/explorer-simple.md b/planning/bulletproof-npm-install/explorer-simple.md new file mode 100644 index 00000000..e9f54780 --- /dev/null +++ b/planning/bulletproof-npm-install/explorer-simple.md @@ -0,0 +1,381 @@ +# Explorer: Simple-first — Bulletproof npm global install for @opencodehub/cli + +**Status:** COMPLETE +**Vector:** Simple-first +**Last updated:** 2026-05-15 + +--- + +## Vector reminder + +Most-boring-engineer answer. Things I like: deletions, dropped deps, removed flags. Things I dislike: configuration switches, multi-mode runtime selection, conditional code paths, lazy downloads, optionalDependencies, npm overrides. If a plan keeps something, it owes a reason; default is delete. + +--- + +## 1. Problem framing + +`npm i -g @opencodehub/cli@latest` fails for two compounding reasons that share one root cause: **we publish a package that pulls 13 native tree-sitter grammar packages plus `tree-sitter` core**, and those packages (a) run network-touching postinstalls (`tree-sitter-cli@0.23.2` GHCR-release download, currently 504), (b) carry incompatible peer ranges that ERESOLVE on a global install where pnpm's lockfile is not present, and (c) require a working C++ toolchain on every install host. + +We already have a fully-functional WASM parse path (`web-tree-sitter@0.26.8`) and we already vendor 3 WASMs that npm packages don't ship. **The simple-first answer is to make WASM the only path** and delete every native-related dependency, code branch, env var, CLI flag, doc reference, and test that exists to support the second path. The remaining work is a small port of `complexity.ts` from native to web-tree-sitter and a one-time `pnpm install` step that vendors the remaining 11 per-package WASMs into `vendor/wasms/` so the published tarball is fully self-contained. + +This is a pure deletion PR with a small targeted port. + +--- + +## 2. Chosen approach + +**Shape:** "WASM-only, vendor-everything, no postinstall." + +- One parser path: `web-tree-sitter` reading `.wasm` blobs from `packages/ingestion/vendor/wasms/`. +- One source for those blobs: vendored at build time from each grammar package's already-shipped `.wasm` (or built from source for kotlin/swift/dart, which already are vendored). +- Zero runtime branches keyed on env vars, Node version, or platform. +- Zero `dependencies` that run install scripts. Every native `tree-sitter-*` package and `tree-sitter@0.25.0` and `web-tree-sitter`'s native peers are removed from `@opencodehub/ingestion`'s `dependencies`. `web-tree-sitter` is the only tree-sitter dep that survives. +- The published tarball under `npm install -g` does **one thing**: copy `dist/` and `vendor/wasms/`. No download, no compile, no choose-a-runtime. + +**Tradeoff I'm taking on:** 11 vendored WASMs grow the published tarball by ~10–25 MB total (the existing 3 WASMs are 8.4 MB combined; per-package WASMs we'd add are typically 0.5–2 MB each). This is the simple-first cost. The architectural alternative is a separate `@opencodehub/grammars-wasm` package + a pretty registry; not worth it for the deletion math here. + +**Tradeoff I'm declining:** `optionalDependencies` for native as a "speed mode" for power users. That preserves the second path, the env var, the dispatcher, the test matrix. It's the speed-first plan, not the simple-first plan. There is no second user. + +--- + +## 3. Key decisions + +### Decision A — Inventory deletions + +#### A.1 — Dependencies to remove from `packages/ingestion/package.json` + +Removed entirely (lines `packages/ingestion/package.json:59-72`): + +| Dep | Why it goes | +|---|---| +| `tree-sitter@0.25.0` | Native N-API addon; runs `node-gyp` postinstall. Only consumer was the deleted dispatcher + complexity phase. | +| `tree-sitter-c@0.24.1` | Native grammar, not loaded after WASM-only switch. WASM ships in vendor. | +| `tree-sitter-c-sharp@0.23.5` | Same. | +| `tree-sitter-cpp@0.23.4` | Same. | +| `tree-sitter-go@0.25.0` | Same. | +| `tree-sitter-java@0.23.5` | Same. | +| `tree-sitter-javascript@0.25.0` | Same. | +| `tree-sitter-kotlin@0.3.8` | Native binding — **no prebuilds; requires C++ toolchain** — was the worst install offender. WASM already vendored. | +| `tree-sitter-php@0.24.2` | Same. | +| `tree-sitter-python@0.25.0` | Same. | +| `tree-sitter-ruby@0.23.1` | Same. | +| `tree-sitter-rust@0.24.0` | Same. | +| `tree-sitter-swift@0.7.1` | Native binding with ~30 s postinstall rebuild. WASM already vendored. | +| `tree-sitter-typescript@0.23.2` | Same. | + +That is **14 deps** out of `@opencodehub/ingestion`'s runtime tree, including the one (`tree-sitter@0.25.0`) that pulls `tree-sitter-cli` and the failing GHCR download. **Net: 14 fewer runtime deps and zero install scripts in the published cli's hot path.** + +Survivors (post-change `dependencies` for tree-sitter–related work): + +- `web-tree-sitter@0.26.8` — pure WASM, no `.node` addon, no postinstall. + +#### A.2 — Code paths keyed on `OCH_NATIVE_PARSER` + +Source of truth (from grep at `OCH_NATIVE_PARSER` and `isNativeAvailable`): + +- `packages/ingestion/src/parse/parse-worker.ts:39-78,142-160` — `forceNativeOpt()`, the warned-runtime triage block, the `runNative()` branch in `runParse()`. **Delete `runNative()`, `forceNativeOpt()`, `isNativeAvailable` import, the runtime warning triage; keep only `runWasm()` body inlined into `runParse()`.** +- `packages/ingestion/src/parse/parse-worker.ts:162-191` — entire `runNative()` function and its TreeSitter ambient interfaces (lines 265-307 — `TreeSitterPoint`, `TreeSitterNode`, `TreeSitterTree`, `TreeSitterParser`, `TreeSitterQueryCapture`, `TreeSitterQueryMatch`, `TreeSitterQuery`, `TreeSitterModule`). Delete all of it. WASM types in `wasm-fallback.ts` already cover what the worker needs. +- `packages/ingestion/src/parse/wasm-fallback.ts:41-67` — `isNativeAvailable()`, `cached`, `resetNativeAvailabilityCache()`. Delete entirely. Rename `wasm-fallback.ts` → `wasm-parser.ts` because WASM is no longer a fallback. +- `packages/ingestion/src/parse/index.ts:18` — `export { isNativeAvailable, resetNativeAvailabilityCache } from "./wasm-fallback.js";`. Delete the line. +- `packages/ingestion/src/parse/grammar-registry.ts:255-267` — Dart-specific `OCH_NATIVE_PARSER` error message inside the native loader. The whole `loadLanguageObject()` function (lines 193-277) is now dead — `loadGrammar()` only fed the native path. Delete `loadLanguageObject` entirely; `loadGrammar()` shrinks to "build a `GrammarHandle` whose `tsLanguage` is `null`" because nothing on the WASM path uses `tsLanguage`. **Better: stop returning `tsLanguage` from `GrammarHandle` at all — it's only needed by `runNative()`.** See A.4. +- `packages/ingestion/src/pipeline/phases/complexity.ts:106-136` — `parserCache`, `tsModuleCached`, `getTsModule()`, `getParser()`. Delete in favor of the WASM port. `complexity.ts:115-123` carries the OCH_NATIVE_PARSER stderr advisory. Delete. +- `packages/cli/src/index.ts:88-91, 102-107` — `--native-parser` option declaration and the env-var setter. Delete. + +#### A.3 — `tree-sitter-cli` removal from `pnpm-workspace.yaml` + +`pnpm-workspace.yaml:69` (`tree-sitter-cli: true` under `allowBuilds`) survives because it's still needed by `scripts/build-vendor-wasms.sh` to build the kotlin/swift/dart WASMs. **But move `tree-sitter-cli` from `allowBuilds` to a workspace `devDependency`** so it lives on disk only on developer machines, not in the published tarball. (It already is — it gets pulled transitively by `tree-sitter@0.25.0`. Once `tree-sitter@0.25.0` leaves `dependencies`, also add `tree-sitter-cli` as an explicit root `devDependency` so the build script keeps working.) + +`pnpm-workspace.yaml` build entries that go away with no consumer left: +- `tree-sitter: true` +- `tree-sitter-c: true`, `tree-sitter-c-sharp: true`, `tree-sitter-cpp: true`, `tree-sitter-go: true`, `tree-sitter-java: true`, `tree-sitter-javascript: true`, `tree-sitter-kotlin: true`, `tree-sitter-php: true`, `tree-sitter-python: true`, `tree-sitter-ruby: true`, `tree-sitter-rust: true`, `tree-sitter-swift: true`, `tree-sitter-typescript: true` — all 13 entries (`pnpm-workspace.yaml:66-83` minus `tree-sitter-cli`). + +**Net: 14 entries deleted from `allowBuilds`.** Only `tree-sitter-cli` (devDependency, not published) remains. + +#### A.4 — `GrammarHandle.tsLanguage` field + +`packages/ingestion/src/parse/grammar-registry.ts:117-122` — the `tsLanguage` field exists only because the native path needed `parser.setLanguage(handle.tsLanguage)`. WASM resolves the grammar from a `.wasm` path directly via `Language.load(wasmPath)` inside `wasm-fallback.ts`. After the deletion, `GrammarHandle` collapses to just `{ language: LanguageId; queryText: string }` — and at that point it's a 2-field DTO that doesn't need its own type. **Inline it: callers want either the query text or the wasm path.** The `loadGrammar()` function and the inflight-dedupe Map collapse to a 1-line `getUnifiedQuery(lang)` and a separate-but-trivial `getGrammarSha()` is the only thing left worth keeping in this module. + +This is a meaningful structural deletion: `grammar-registry.ts` shrinks from 337 lines to ~80 lines, mostly the language-spec map and `getGrammarSha`. + +### Decision B — WASM blob coverage + +Existing state (`packages/ingestion/src/parse/wasm-fallback.ts:249-303`): + +- **Stage 1 (per-grammar npm package):** typescript, tsx, javascript, python, go, rust, java, csharp, c, cpp, ruby, php — resolved by `requireFn.resolve('tree-sitter-/package.json')` then `path.join(pkgDir, '.wasm')`. The `.wasm` is bundled inside each native package at the same path. Verified in `node_modules/.pnpm/tree-sitter-@/.../tree-sitter-.wasm` (see grep output of pnpm node_modules). +- **Stage 2 (vendored):** kotlin, swift, dart — `packages/ingestion/vendor/wasms/tree-sitter-{kotlin,swift,dart}.wasm`. Already in repo (`vendor/wasms/` listing: 8.4 MB combined). + +When we delete the native `tree-sitter-` deps, **Stage 1 stops working** — `requireFn.resolve('tree-sitter-python/package.json')` will throw because the package isn't installed in a global cli install. So we must move every Stage-1 grammar's `.wasm` into `vendor/wasms/` and delete Stage 1 entirely. + +**Plan:** + +1. Rebuild `scripts/build-vendor-wasms.sh` to also copy the 11 per-package `.wasm` files. The build steps are different per language: + - 11 grammars (typescript/tsx, javascript, python, go, rust, java, csharp, c, cpp, ruby, php) ship a pre-built `.wasm` inside the npm tarball — **just `cp` it into `vendor/wasms/`**, no docker, no emcc. + - 3 grammars (kotlin, swift, dart) require `tree-sitter build --wasm` (current logic) — keep as is. + +2. The 11 `cp` lines run on every developer's `pnpm install` cycle (idempotent — fast). Or simpler: **commit them once and never re-run unless a grammar version bumps**. Same model as the current 3 vendored WASMs. + +3. Result file list under `vendor/wasms/`: + - `tree-sitter-c.wasm`, `tree-sitter-cpp.wasm`, `tree-sitter-c_sharp.wasm` (note underscore — matches what c-sharp ships and what `wasm-fallback.ts:265` already expects) + - `tree-sitter-dart.wasm` (existing) + - `tree-sitter-go.wasm`, `tree-sitter-java.wasm`, `tree-sitter-javascript.wasm` + - `tree-sitter-kotlin.wasm` (existing), `tree-sitter-php_only.wasm`, `tree-sitter-python.wasm` + - `tree-sitter-ruby.wasm`, `tree-sitter-rust.wasm`, `tree-sitter-swift.wasm` (existing) + - `tree-sitter-typescript.wasm`, `tree-sitter-tsx.wasm` + - **15 files total.** + +4. `wasm-fallback.ts` (renamed `wasm-parser.ts`) collapses to **one** resolver function: + ``` + function resolveGrammarWasmPath(lang) { + const fname = WASM_FILES[lang]; // a single Record + return path.join(VENDOR_WASMS_DIR, fname); + } + ``` + No two-stage cascade. No `requireFn.resolve`. No `tryPerGrammarPackage` / `tryVendoredWasm` split. ~50 lines deleted. + +5. The `files` array in `packages/ingestion/package.json:24-34` already includes `vendor/wasms/**` (line 33), so the published tarball already ships them. No `package.json` change needed except the `dependencies` deletions. + +**Tradeoff:** Each WASM is 0.5–4 MB; 15 of them total is ~15–25 MB extra in the published tarball vs. shipping nothing. The current world ships the same WASMs anyway — they're just inside the per-package native tarballs that we publish a transitive dep on. So actually **net tarball download is smaller**, because we lose every `.node` prebuild for every platform and every `tree-sitter` C source the native packages bundle. (The native `tree-sitter` package itself ships ~5 MB of C sources for `node-gyp`. Each grammar ships its native `.cc` parser. Across 13 grammars that's tens of MB in lost weight.) + +#### B.1 — Vendoring license check + +`vendor/wasms/LICENSES.md` already exists. After adding 11 more WASMs, append their licenses (all MIT or Apache-2.0 per the existing third-party manifest at `THIRD_PARTY_LICENSES.md`). One-time edit. + +### Decision C — Complexity phase fate + +**Decision: port to WASM, not delete.** Keep at the same module path. + +Reasoning: + +- The phase is wired into `default-set.ts:71` and is a real consumer's signal: `verdict.ts:101,688` reads `cyclomaticComplexity > 10` to set the risk tier in `verdict`, which is the thing PR-review uses to issue 0/1/2 exit codes. Deleting it silently neuters `verdict`. That is a behavior regression a real customer is using. +- The port is mechanically straightforward — every API the phase uses against the native binding (`rootNode`, `child(i)`, `childCount`, `childForFieldName`, `text`, `startPosition`, `endPosition`, `type`) exists with identical semantics on `web-tree-sitter`'s `Node` (verified at `node_modules/.pnpm/web-tree-sitter@0.26.8/.../web-tree-sitter.d.ts:328,448,471,493,499`). The only meaningful API shift is `parser.parse()` returns a `Tree` synchronously in both bindings. +- The phase already runs **on the main thread** doing its own re-parse (it doesn't reuse the worker pool's parsed trees because parse output drops the tree to keep IPC small). Doing that re-parse via `web-tree-sitter` + the same vendored `.wasm` is a pure substitution: swap `require('tree-sitter')` for the existing `openWasmParser(lang).parse(source)`. The walk logic at `complexity.ts:370-460` is binding-agnostic — `walk()` only touches the abstract `TsNode` shape, which `WasmNode` matches. +- After the port, **delete `getTsModule`, `parserCache` (native-typed), `tsModuleCached`, `warnedComplexityDegraded`, the `OCH_NATIVE_PARSER` stderr advisory** at `complexity.ts:106-124`. Replace with a single per-language WASM parser handle pulled from `wasmCache` (or call `openWasmParser(lang)` and let the per-process cache that already exists in `wasm-fallback.ts:110` do the memoization for free). +- Net: `complexity.ts` loses ~30 lines of tree-sitter shim and gains ~5 lines of `await openWasmParser(lang)` plus a `?? skip` guard. **Smaller file.** + +**Tradeoff declined:** the alternative ("delete complexity for 0.4.0, restore later") sounds simple, but the deletion cost is hidden — `verdict` quietly drops a tier and PR reviews change verdict for users who have been relying on it. The port is small enough that the boring choice is to do it. + +### Decision D — Verification + +A single bash script that proves the install across all 6 cells (Linux × {Node 20, 22, 24} ∪ macOS × {Node 20, 22, 24} — though macOS via Linux container with rosetta if local hardware isn't available): + +```bash +# scripts/verify-global-install.sh +set -euo pipefail +VERSION="${1:-latest}" +for NODE in 20 22 24; do + docker run --rm -v /tmp:/tmp node:${NODE}-slim bash -c " + set -euo pipefail + npm install -g @opencodehub/cli@${VERSION} + codehub --version + git clone --depth=1 https://github.com/sindresorhus/p-limit /tmp/probe + cd /tmp/probe + codehub analyze . + codehub query 'export default' + " +done +``` + +Three smoke assertions: +1. `npm install -g` exits 0 and emits no `WARN` or `ERR` lines about peer deps or postinstalls. +2. `codehub --version` prints the version. +3. `codehub analyze .` against `p-limit` (TypeScript) exits 0 and writes `.codehub/`. + +For the matrix completeness on macOS I'd add `mise` and `nvm` shells locally: + +```bash +# scripts/verify-global-install-macos.sh — runs on a clean Mac +mise use --global node@22 +npm install -g @opencodehub/cli@${VERSION} +codehub --version +codehub analyze /tmp/probe +mise use --global node@20 && npm install -g @opencodehub/cli@${VERSION} && codehub --version +mise use --global node@24 && npm install -g @opencodehub/cli@${VERSION} && codehub --version +``` + +The container script is enough for CI. Macs in the wild get covered by Node 22+24 via the docker matrix because the only platform-specific surface remaining is `web-tree-sitter`'s WASM runtime, which is identical across darwin and linux (it's pure Wasm + `WebAssembly.compile`, no `.node` addon). + +**`.github/workflows/verify-global-install.yml`** — new workflow, run on every push to main and on every release tag, fail loudly if any cell exits non-zero. This is the regression net. + +### Decision E — Files deleted or simplified in source + +In priority order (most deletion first): + +1. **Delete entirely** — `packages/ingestion/src/parse/wasm-parity.test.ts`. The test exists *only* to assert WASM vs native produce the same captures. Without a native path, there's nothing to compare. Cited at `wasm-parity.test.ts:281-286`: "native tree-sitter is unavailable — parity suite requires it as the reference runtime". With native gone, the suite is meaningless. **Replace with nothing.** The native path was the only reason the test existed. + +2. **Heavy edit, then rename** — `packages/ingestion/src/parse/wasm-fallback.ts` → `wasm-parser.ts`. Drop `isNativeAvailable`, `resetNativeAvailabilityCache`, `tryPerGrammarPackage`, `resolvePackageDir`. Collapse `resolveGrammarWasmPath` to one map lookup. Delete the 70-line two-stage-cascade comment. **From 332 lines to ~120 lines.** + +3. **Heavy edit** — `packages/ingestion/src/parse/parse-worker.ts`. Delete `forceNativeOpt`, `runNative`, `getOrBuildParser`, `getOrBuildQuery`, the `requireFn` import, the runtime triage block, all 8 `TreeSitter*` ambient interfaces. **From 308 lines to ~140 lines.** + +4. **Heavy edit** — `packages/ingestion/src/parse/grammar-registry.ts`. Delete `loadLanguageObject` (lines 193-277), simplify `loadGrammar` to drop `tsLanguage`, drop the inflight-dedupe Map (it's there to avoid duplicate native `require()`s — WASM uses its own per-process cache in `wasm-fallback`/`wasm-parser`). **From 337 lines to ~80 lines.** + +5. **Edit** — `packages/ingestion/src/pipeline/phases/complexity.ts`. Replace native re-parse with WASM. Delete `requireFn`, `getTsModule`, `getParser`, `tsModuleCached`, `warnedComplexityDegraded`, `parserCache` (native-typed). Use `openWasmParser(lang)` directly. Delete the 8 `Ts*` ambient interfaces (their `Wasm*` equivalents already exist in the parser module). + +6. **Edit** — `packages/ingestion/src/parse/parse-worker.test.ts`. Delete tests `(b)`, `(c)`, `(d)` — all `OCH_NATIVE_PARSER` cases. Keep test `(a)` "WASM path, WASM warning" but rename to "parse worker reports WASM runtime on startup" (or delete the test — the runtime-name logging itself is going away under the simple-first deletion of the warning at `parse-worker.ts:64-78`). **Probably delete the file. The single remaining "WASM is the only path" assertion is implicit in every other parse test.** + +7. **Edit** — `packages/ingestion/src/parse/wasm-grammar-resolution.test.ts`. Update to assert the new flat single-table resolver. Many of its existing assertions about the two-stage cascade collapse to "every language resolves to `vendor/wasms/`". **From whatever it is now to ~30 lines.** + +8. **Edit** — `packages/ingestion/src/parse/index.ts:18`. Drop the `isNativeAvailable, resetNativeAvailabilityCache` export. + +9. **Edit** — `packages/cli/src/index.ts:88-91, 102-107`. Delete the `--native-parser` option block and the env setter. **Net: -10 lines, -1 user-visible flag.** + +10. **Edit** — `pnpm-workspace.yaml:66-83`. Delete 14 of 15 `allowBuilds` entries (keep only `tree-sitter-cli` since `scripts/build-vendor-wasms.sh` still uses it). Add `tree-sitter-cli` as an explicit root `devDependencies` entry in the workspace root `package.json` so it's installed locally without `tree-sitter@0.25.0` pulling it. + +11. **Edit** — `packages/ingestion/package.json:59-72`. Delete 14 deps (every `tree-sitter*` line except `web-tree-sitter`). + +12. **Edit** — `packages/cli/README.md:79`. Drop `OCH_NATIVE_PARSER` from the env-toggle list. + +13. **Edit** — `README.md:83, 234-236`. Drop the "WASM-default parse runtime" feature row and the Node-22-native-opt-in paragraph. + +14. **Edit** — `packages/ingestion/README.md:25, 57`. Drop the same. + +15. **Edit** — `CLAUDE.md:96-107` (the "Parse runtime — WASM default, native opt-in" section). **Replace with a 3-line version**: "All parsing runs through `web-tree-sitter` against vendored WASMs at `packages/ingestion/vendor/wasms/`. There is no native opt-in. Run `bash scripts/build-vendor-wasms.sh` after bumping a grammar version." + +16. **Edit (and/or delete)** — `docs/adr/0013-parse-runtime-wasm-default.md`. The ADR was the rationale for the dual-mode design; once we delete native, the ADR is partially outdated. Add a `**Superseded by:** 0014 — WASM-only parser path` note and write a small ADR 0014 next to it with the deletion rationale. + +17. **Edit** — `packages/docs/src/content/docs/start-here/install.md:15,112`, `packages/docs/src/content/docs/architecture/parsing-and-resolution.md:25`, `packages/docs/src/content/docs/architecture/adrs.md:126`, `packages/docs/src/content/docs/reference/cli.md:40`, `packages/docs/src/content/docs/reference/configuration.md:31-33`, `packages/docs/src/content/docs/reference/languages.md:53-55`, `packages/docs/src/content/docs/start-here/what-is-opencodehub.md:68`, `packages/docs/src/content/docs/guides/troubleshooting.md:27,80`, `packages/docs/src/content/docs/guides/indexing-a-repo.md:130`. Strip every mention of `OCH_NATIVE_PARSER`, `--native-parser`, "Node 22 native opt-in". Replace with a one-line "WASM is the only runtime." + +18. **Edit** — `packages/ingestion/CHANGELOG.md` and root `CHANGELOG.md`. Add a 0.4.0 entry: "BREAKING: removed `OCH_NATIVE_PARSER` env var and `--native-parser` CLI flag. WASM is the only parser runtime. Native parsing has not been the install-time default since 0.3.0; this completes the removal." + +#### Total deletion count (approximate) + +- 14 deps removed from `@opencodehub/ingestion/package.json`. +- 14 entries removed from `pnpm-workspace.yaml` `allowBuilds`. +- 1 CLI flag removed. +- 1 env var removed. +- 1 entire test file deleted (`wasm-parity.test.ts`, ~330 lines). +- 1 likely test file deleted (`parse-worker.test.ts`, ~280 lines) or stripped to its skeleton. +- ~600+ lines of source deleted across `parse-worker.ts`, `wasm-fallback.ts`, `grammar-registry.ts`, `complexity.ts` shim. +- ADR 0013 superseded; ADR 0014 added (small). + +### Decision F — Migration story + +Hard removal, with a one-line stderr advisory the first time the env var is observed. + +```ts +// In packages/cli/src/index.ts at startup, before commander.parse: +if (process.env["OCH_NATIVE_PARSER"] !== undefined) { + process.stderr.write( + "[codehub] OCH_NATIVE_PARSER was removed in 0.4.0; WASM is the only parser runtime. Unset to silence this warning.\n", + ); + delete process.env["OCH_NATIVE_PARSER"]; +} +``` + +For `--native-parser`, **don't add a deprecation alias**. Commander will report the unknown flag and exit. That's the loudest possible signal. The CHANGELOG covers the rest. + +**Why no graceful migration period:** there's no second user. The flag was added in the M5 default-flip and the only documented audience is "Node 22 dev boxes for measurably faster parsing". Anyone with `--native-parser` in a script gets a clean error message from commander on the first run after upgrade, reads the CHANGELOG, deletes the flag, moves on. Total user-impact cost: minutes per script. + +### Decision G — Commit shape + +**One PR titled:** `feat!: WASM-only parser path; drop native tree-sitter and tree-sitter-cli`. + +Body sketch: + +``` +Removes 14 native tree-sitter dependencies and the OCH_NATIVE_PARSER / +--native-parser env+flag toggle. WASM via web-tree-sitter is the only +runtime. Vendor every grammar's .wasm under packages/ingestion/vendor/wasms/ +so `npm install -g @opencodehub/cli` runs zero install scripts. + +BREAKING: +- OCH_NATIVE_PARSER env var: removed (one-shot stderr advisory still emitted). +- --native-parser CLI flag: removed (commander will reject as unknown). +- complexity phase: ported to web-tree-sitter; metrics are unchanged. + +Why: `npm install -g` was failing with two compounding issues — a +GHCR-release 504 from `tree-sitter-cli`'s postinstall and ERESOLVE peer +conflicts between `tree-sitter@0.25.0` and the per-language native +grammar packages. Both go away when there is no native runtime. +``` + +Single commit, single PR, single review pass. **Zero `optionalDependencies`. Zero `npm overrides`.** Both would imply two install outcomes, which violates the simple-first vector. The whole point is "one outcome, every time." + +--- + +## 4. Implementation steps + +Strict ordering — each step's verification gates the next. + +1. **Vendor the 11 missing WASMs.** Edit `scripts/build-vendor-wasms.sh` to add 11 `cp` lines pulling each `tree-sitter-.wasm` from `node_modules/.pnpm/tree-sitter-@/.../tree-sitter-.wasm`. Run the script. Commit the 11 new files under `packages/ingestion/vendor/wasms/`. Verify: `ls vendor/wasms/*.wasm | wc -l` == 15. + +2. **Port complexity.ts to WASM.** Replace native re-parse with `await openWasmParser(lang).parse(source)`. Delete `getTsModule`, `getParser`, `requireFn`, the native `Ts*` interfaces. Run `pnpm -C packages/ingestion test --grep complexity`. Verify: all complexity tests pass on Node 22 and Node 24 with no `OCH_NATIVE_PARSER` set. + +3. **Simplify `wasm-fallback.ts` (rename → `wasm-parser.ts`).** Drop two-stage cascade, drop `isNativeAvailable`, drop `resetNativeAvailabilityCache`. Collapse `resolveGrammarWasmPath` to one-table lookup. Update import sites (`parse-worker.ts`, `index.ts`, `complexity.ts`). Run `pnpm -C packages/ingestion test`. Verify: green. + +4. **Delete native paths in `parse-worker.ts`.** Remove `forceNativeOpt`, `runNative`, native interfaces, runtime-triage warning. Inline `runWasm` body into `runParse`. Run `pnpm -C packages/ingestion test --grep parse-worker`. Verify: WASM-only parse worker still produces the right `ParseCapture` output for the test fixtures. + +5. **Simplify `grammar-registry.ts`.** Delete `loadLanguageObject`. Drop `tsLanguage` from `GrammarHandle`. Run all ingestion tests. Verify: green. The `getGrammarSha` path is unaffected — it never read `tsLanguage`. + +6. **Delete `wasm-parity.test.ts` and trim `parse-worker.test.ts`.** Run ingestion tests. Verify: green (we lose 4 tests, gain nothing). + +7. **Delete the CLI flag.** Edit `packages/cli/src/index.ts` lines 88-91 and 102-107. Add the one-shot `OCH_NATIVE_PARSER` stderr advisory + `delete` at startup. Run `pnpm -C packages/cli test`. Verify: green. + +8. **Edit `package.json`s and `pnpm-workspace.yaml`.** Drop 14 deps from `packages/ingestion/package.json`. Drop 14 `allowBuilds` entries from `pnpm-workspace.yaml`. Add `tree-sitter-cli` as a root `devDependency` to keep `scripts/build-vendor-wasms.sh` working. Run `pnpm install --frozen-lockfile=false` to regenerate `pnpm-lock.yaml`. Verify: lockfile diff is large but only delta is "deleted" entries. + +9. **Run the full workspace test suite.** `pnpm -r test`. Fix any test that imported a deleted symbol (most likely some tests `import { isNativeAvailable } from '@opencodehub/ingestion'`). Verify: green. + +10. **Update docs and ADRs.** Step E.12-17 above. Drop `OCH_NATIVE_PARSER` mentions. Add ADR 0014 superseding 0013. Verify: `pnpm run banned-strings` (if there's a banned-strings list) flags nothing residual. + +11. **Add the verification workflow.** Write `scripts/verify-global-install.sh` and `.github/workflows/verify-global-install.yml`. Workflow does a release-candidate publish to a private dist-tag (e.g., `npm publish --tag rc`), then the docker matrix in Decision D installs and smoke-runs. Verify: workflow goes green for Node 20/22/24 against an RC dist-tag. + +12. **Publish 0.4.0.** `pnpm -r publish` with the new versions. Verify: post-publish, `npm install -g @opencodehub/cli@0.4.0` works on a clean Node 20/22/24 docker image. + +13. **Run the verification script in production.** Same script as Decision D against the published `@latest`. Verify: zero warnings, zero ERESOLVE, zero postinstall fetches. + +--- + +## 5. Risks and tradeoffs + +**What this plan gives up:** + +- **Native parsing speed on dev boxes.** The opt-in claim was "measurably faster on Node 22 dev boxes." We give up that knob. Tradeoff: simplicity for everyone else. The win is that the install is bulletproof — we trade ~10–30 % parse-phase wall-clock for the people who would have set the flag, against eliminating the install-failure tax on everyone who wouldn't. +- **Future flexibility for native.** Re-introducing a native opt-in later is a non-trivial revert (re-add the dispatcher, the interfaces, the test matrix). **Deletion cost named.** Counter: the only reason native was kept around was M5-era performance. By 0.5.0, web-tree-sitter perf gaps usually shrink anyway as the runtime matures. +- **Tarball size for `@opencodehub/ingestion`.** ~15–25 MB extra in the published tarball from the 11 new vendored WASMs. **Net is probably smaller than today** because we lose the per-grammar native `.cc`/`node-gyp` source trees that current native deps drag in, but the published-tarball size of `@opencodehub/ingestion` itself goes up. Acceptable: install latency on a 100 Mbps pipe is ~2 s for 25 MB. + +**What could go wrong:** + +- **A vendored WASM doesn't load on web-tree-sitter@0.26.8.** Already happened with the upstream `tree-sitter-wasms@0.1.13` catalog (`vendor/wasms/README.md:7-12`). Mitigation: we're vendoring the per-package WASMs that tree-sitter team itself ships with their npm packages — those use modern tree-sitter-cli builds. The 3 we build ourselves (kotlin/swift/dart) are already known good. +- **A grammar version bump produces an incompatible WASM.** The current 3-vendored model already has this risk; the 11-extra mitigation is: the cp-from-node_modules approach in step 1 means the WASM in `vendor/wasms/` always corresponds to the version pinned in our (deleted) `tree-sitter-` deps. Once we delete those deps, we have to pin the grammar versions some other way — the simplest is **pin them as `devDependencies` in `packages/ingestion/package.json`** so `pnpm install` still resolves them locally for the build script, but they don't ship to consumers. (devDeps are stripped from `npm publish`.) +- **`web-tree-sitter`'s `Parser.init()` fails on a sandboxed runtime.** This was already a concern (`wasm-fallback.ts:8-11`). The fix is the same: surface the error, no silent fallback. The new world has no native fallback, so a user on a sandbox without WebAssembly support gets a hard error. **Acceptable: the Wasm engine is in every modern Node.** Node 22+ has had it stable for 2+ years. +- **`engines.node` constraint.** Both `packages/ingestion/package.json:106` and `packages/cli/package.json:81` specify `>=22.0.0`. The verification matrix includes Node 20 because users on global installs sometimes have older Node defaults. **Decision: keep `>=22.0.0`** — `engineStrict: true` is set in `pnpm-workspace.yaml:38` so older Node fails fast with a clear message, which is the simple-first contract: one valid runtime, fail fast outside it. Tradeoff: users on Node 20 get an explicit "upgrade Node" prompt instead of an opaque WASM error later. **This is a pre-existing decision; the simple-first plan ratifies it but doesn't change it.** + +**Watchpoints post-merge:** + +- npm install error rate on the cli (track via `npm` download stats + GitHub issue volume). +- `web-tree-sitter` runtime errors in the parse phase (already logged; track frequency). +- Grammar-version drift between `vendor/wasms/` and the latest grammar release (a quarterly rebuild cron is enough — the current cadence is already that). + +--- + +## 6. Verification criteria + +This plan worked iff all of the following are true after merge and publish: + +1. `npm install -g @opencodehub/cli@latest` exits 0 on a clean `node:20-slim`, `node:22-slim`, and `node:24-slim` container. **No** `WARN deprecated`, **no** `ERR! ERESOLVE`, **no** GHCR fetches in stderr. The verification workflow (Decision D / step 11) is the regression gate. +2. `codehub --version` prints the version on each of the 3 Node images. +3. `codehub analyze /tmp/p-limit` (a small TypeScript repo) writes `.codehub/graph.duckdb` (or `.lbug`), exits 0, and the run takes <60 s. +4. `codehub query 'export default'` against the freshly-indexed repo returns at least one hit. +5. `pnpm -r test` is green locally and in CI. +6. `pnpm install` from a clean clone (no `node_modules`, no `pnpm-lock.yaml` fast-path) installs in <2 minutes with no postinstall network calls beyond what `tree-sitter-cli` does for build-time grammar building (and `tree-sitter-cli` is now a devDep, not a transitive runtime install). +7. `du -sh dist/` for `@opencodehub/cli` after `npm pack` is roughly the same as today (within ±20 %). +8. Grep for `OCH_NATIVE_PARSER` in the published tarball returns zero hits. Same for `--native-parser`. +9. The `verdict` PR-review tool still emits the `cyclomaticComplexity > 10` risk-tier flip (verified by a unit test that hand-crafts a 15-decision-point function and asserts the verdict bumps a tier). + +When (1) (2) (3) (8) all pass, the install is bulletproof in the scope of this PR. + +--- + +## Appendix — quick deletion ledger + +Smallest surface remaining post-change: + +- 1 parser binding (`web-tree-sitter`). +- 1 wasm directory (`vendor/wasms/`). +- 1 build script (`scripts/build-vendor-wasms.sh`). +- 1 CLI flag removed (`--native-parser`). +- 1 env var removed (`OCH_NATIVE_PARSER`). +- 0 `optionalDependencies` introduced. +- 0 `overrides` introduced. +- 0 runtime branches keyed on platform/Node version/env var. + +That's the simple-first signature. diff --git a/planning/bulletproof-npm-install/explorer-speed.md b/planning/bulletproof-npm-install/explorer-speed.md new file mode 100644 index 00000000..c5abb44f --- /dev/null +++ b/planning/bulletproof-npm-install/explorer-speed.md @@ -0,0 +1,185 @@ +# Explorer: Speed-first — Bulletproof npm global install for @opencodehub/cli + +**Status:** COMPLETE +**Vector:** Speed-first +**Last updated:** 2026-05-15T03:04:05+00:00 + +--- + +## Protocol + + +Your output file is the single source of truth for your plan. Edit it as each decision crystallizes, before moving to the next one. + + +--- + +## 1. Problem Framing + +`npm install -g @opencodehub/cli@latest` exits non-zero because (a) `tree-sitter-swift@0.7.1` runtime-depends on `tree-sitter-cli@0.23.2` whose postinstall pulls a binary from a GitHub release URL that 504s, and (b) seven `tree-sitter-*` native grammar packages declare `peerOptional tree-sitter@^0.21|^0.22.x` while `@opencodehub/ingestion` ships `tree-sitter@0.25.0`, producing ERESOLVE noise. Runtime is already WASM-by-default; native is opt-in. Therefore the native grammar packages are non-essential for a working `npm install -g` — except as carriers of the per-grammar `.wasm` blob (verified: 11 of 13 grammars ship `.wasm` inside the tarball; kotlin/swift/dart do not, and we already vendor those at `packages/ingestion/vendor/wasms/`). + +## 2. Chosen Approach + +**One commit, one publish.** Move every native grammar package out of `dependencies` in `packages/ingestion/package.json` into `optionalDependencies`, and pin a problematic transitive `tree-sitter-cli` to a noop via `overrides` for belt-and-suspenders safety. Bump the patch version, publish, smoke-test with `npm pack` + `npm install -g ./`. **Do not touch parser-runtime code, do not refactor, do not restructure tarballs.** The `requireFn.resolve('${pkg}/package.json')` cascade in `wasm-fallback.ts:307` Just Works whether the optional package is installed or not — when present, the `.wasm` is found; when absent, the language is unsupported on that host (acceptable per the user preference "go all in on wasm"). + +Shape: **package.json surgery + npm overrides + smoke test**. No new modules, no new abstractions, no behavior change for the runtime path. + +## 3. Key Decisions + +### Decision A — Move all `tree-sitter-*` native grammar packages from `dependencies` to `optionalDependencies` in `packages/ingestion/package.json` + +**What.** Move these 13 keys (`packages/ingestion/package.json:59-72`) out of `dependencies` into the existing `optionalDependencies` block (which already holds `ts-morph` — `packages/ingestion/package.json:85-87`): + +- `tree-sitter@0.25.0` +- `tree-sitter-c@0.24.1` +- `tree-sitter-c-sharp@0.23.5` +- `tree-sitter-cpp@0.23.4` +- `tree-sitter-go@0.25.0` +- `tree-sitter-java@0.23.5` +- `tree-sitter-javascript@0.25.0` +- `tree-sitter-kotlin@0.3.8` +- `tree-sitter-php@0.24.2` +- `tree-sitter-python@0.25.0` +- `tree-sitter-ruby@0.23.1` +- `tree-sitter-rust@0.24.0` +- `tree-sitter-swift@0.7.1` ← the postinstall offender (it pulls `tree-sitter-cli` per `pnpm-lock.yaml` lines around 11937) +- `tree-sitter-typescript@0.23.2` + +Keep `web-tree-sitter@0.26.8` in `dependencies` (the runtime entrypoint per `wasm-fallback.ts:198`). + +**Why this is the smallest possible diff.** npm tolerates failures inside `optionalDependencies` and never errors-out the parent install. ERESOLVE peer-conflict warnings on optional deps are demoted to warnings rather than hard errors. The runtime (`grammar-registry.ts:194-254`, `wasm-fallback.ts:249-277`) calls `require()` lazily, gated by `OCH_NATIVE_PARSER`; on the WASM default path it only needs `requireFn.resolve('${pkg}/package.json')` to find the `.wasm` blob, which works for whichever optional packages happened to install successfully on the host. + +**Tradeoff.** On hosts where the optional grammar package failed to install (e.g. tree-sitter-kotlin requires a C++ toolchain it doesn't have), that language degrades to "unsupported on this install" with a `requireFn.resolve` returning undefined → upstream caller sees `undefined`. Acceptable: kotlin/swift/dart are already covered by vendored WASMs (the vendored fallback at `wasm-fallback.ts:294-303` runs after the per-grammar lookup), and the other ten languages ship prebuilds for common platforms (Linux x64/arm64, macOS arm64/x64, Windows x64) and rarely fail. Native opt-in via `OCH_NATIVE_PARSER=1` is a developer-mode feature; if a developer wants it, they can `npm i tree-sitter tree-sitter-python` etc. directly. Per user preference: "go all in on wasm if it has the same support" — it does. + +### Decision B — Add an `overrides` entry that defangs `tree-sitter-cli` for belt-and-suspenders + +**What.** Add to `packages/ingestion/package.json`: + +```jsonc +"overrides": { + "tree-sitter-cli": "npm:@socketregistry/empty-package@*" +} +``` + +…and likewise to `packages/cli/package.json` (the published-tarball root for `npm install -g`). npm reads `overrides` from the install root's `package.json`, including for global installs. + +**Why.** Even if the user passes `--include=optional` or `optionalDependencies` somehow doesn't suppress the failure on a particular npm version, this guarantees `tree-sitter-cli` resolves to an empty noop with no postinstall network call. `@socketregistry/empty-package` is a maintained empty-package shim from Socket — exactly the established pattern for nuking unwanted transitive deps. Cost: 1 extra dep at install time, no runtime impact (the override only fires inside the `tree-sitter-swift` install tree, which is already optional). + +**Tradeoff.** This means even if a developer opts into `OCH_NATIVE_PARSER=1` and Swift, `tree-sitter-cli` won't be available. That's fine — `tree-sitter-cli` is a dev tool for grammar authors, not required for runtime parsing. None of `grammar-registry.ts:240-243` calls into it. + +**Cheaper-to-reverse alternative considered:** rely solely on `optionalDependencies` and skip the override. Reject — costs nothing to add and gives us a hard guarantee in case any consumer's npm config (e.g. `--include=optional` or pnpm's `optional=true`) attempts to install the optional package anyway. Ship-today over correct-forever. + +### Decision C — Keep the `web-tree-sitter` dependency, do NOT touch parser code, do NOT refactor + +**What.** No edits to `packages/ingestion/src/parse/grammar-registry.ts` or `wasm-fallback.ts`. The WASM resolver already has the right two-stage cascade (per-grammar package → vendored). The native loader at `grammar-registry.ts:194-254` only runs under `OCH_NATIVE_PARSER=1`; if a `require('tree-sitter-foo')` throws because the package wasn't installed, the existing error handling surfaces a clean message. + +**Why.** Per problem statement: "Don't refactor parser layers. Don't write new abstractions." The user's CLAUDE.md already documents WASM-default + native-opt-in. The plan is a package.json move, not a code change. + +**Tradeoff.** A purer plan would delete the native loader paths entirely (since the user said "go all in on wasm"). Rejecting that for this iteration — purely a deletion, has zero blast radius, but adds review surface and risks breaking the `OCH_NATIVE_PARSER=1` developer affordance documented in CLAUDE.md. **Deferred to follow-up.** + +### Decision D — Bump version, do not change the publish surface + +**What.** Bump `packages/cli/package.json` from `0.3.0` → `0.3.1` (or `0.4.0` if you'd rather signal "install path changed"). Bump `packages/ingestion/package.json` from `0.3.2` → `0.3.3`. Tag normally; the existing release-please workflow (referenced in `f3c30f7 chore: release main`) handles versions. + +**Why.** Patch-bump is appropriate: the public API doesn't change, only install behavior. Skip `0.4.0` unless you want consumer-facing release notes that say "switched to WASM-only on global install." + +**Tradeoff.** If the optional-deps move is too disruptive on some platform we haven't tested, a `0.3.1` deprecation needs another patch. Acceptable. + +### Decision E — Do not ship an `.npmrc` inside the package + +**What.** Skip the `.npmrc`-in-tarball idea. + +**Why.** `npm install -g ` does not read an `.npmrc` from inside the package being installed; it reads the user's `~/.npmrc`, the project `.npmrc`, and `npm config`. Shipping one would be cargo-cult. The user-side warnings ("Unknown user config 'store-dir'", "package-import-method") come from the user's `~/.npmrc` having pnpm-only options — not our problem to fix from our tarball. **Document in the README that those warnings are benign and originate from the user's pnpm config bleeding into npm.** + +**Tradeoff.** If we wanted to suppress the user's pnpm-side warnings, we'd need to teach `codehub init` to also write a `~/.npmrc` shim — out of scope for "ship today." + +## 4. Implementation Steps + +Each step lists the file, the change, and the verification. + +1. **Edit `packages/ingestion/package.json`.** Move every `tree-sitter*` (including `tree-sitter` core) from `dependencies` into `optionalDependencies`. Keep `web-tree-sitter@0.26.8` in `dependencies`. Verify with `jq '.dependencies | keys | .[] | select(startswith("tree-sitter"))' packages/ingestion/package.json` returning only `web-tree-sitter`. + +2. **Edit `packages/ingestion/package.json` and `packages/cli/package.json`.** Add an `overrides` block: + + ```jsonc + "overrides": { + "tree-sitter-cli": "npm:@socketregistry/empty-package@*" + } + ``` + + The cli package is the install root for `npm install -g`, so it's the one npm actually reads for overrides. Add to both for defense-in-depth (in case anyone consumes ingestion directly as a non-workspace dep later). + +3. **Edit `pnpm-workspace.yaml`.** Mirror the override in the pnpm `overrides` block (lines 5–27) so monorepo dev installs match the published shape: + + ```yaml + overrides: + # ... existing ... + tree-sitter-cli: "npm:@socketregistry/empty-package@*" + ``` + + Optional but recommended — keeps dev/CI install behavior identical to the published artifact. + +4. **Bump versions.** `packages/cli/package.json` → `0.3.1`, `packages/ingestion/package.json` → `0.3.3`. Skip if release-please owns versioning. + +5. **Run `pnpm install`** at the repo root to refresh `pnpm-lock.yaml`. Verify: + - `tree-sitter-cli` no longer appears under `tree-sitter-swift` resolved deps (or resolves to `@socketregistry/empty-package`). + - The native grammar packages still resolve (they're optional, so pnpm still installs them in dev — that's fine; we're testing the published-shape behavior separately). + +6. **Run `pnpm -r build`** then `pnpm -r test` to confirm no test breakage. Pay attention to `packages/ingestion/src/parse/wasm-parity.test.ts` — it runs both runtimes; the native side may now resolve grammars from the optional install which is fine. + +7. **Smoke test the published shape.** From a clean directory: + ```sh + cd /tmp && rm -rf och-smoke && mkdir och-smoke && cd och-smoke + npm pack /efs/lalsaado/workplace/opencodehub/packages/cli + # also pack the workspace deps: + for pkg in ingestion analysis core-types embedder mcp pack policy sarif scanners search storage wiki; do + npm pack /efs/lalsaado/workplace/opencodehub/packages/$pkg + done + # install the cli tarball, with workspace deps as file-refs (or publish to a local verdaccio): + npm install -g ./opencodehub-cli-0.3.1.tgz + echo $? # MUST be 0 + codehub --version + ``` + + Run on: + - **Linux Node 22** (mise-managed) — primary target. + - **Linux Node 24** — verify WASM-only path still works (CLAUDE.md notes native is unsupported on Node 24). + - **macOS arm64 Node 22** (Homebrew or mise) — secondary target. + + If node_modules isn't fully resolvable due to workspace deps, push a `0.3.1-rc.1` tag to npm's dist-tag and run `npm install -g @opencodehub/cli@rc` from a throwaway machine. Spend the API tokens; this is the gate. + +8. **Smoke test `codehub analyze`.** Inside a small repo (e.g. the cli's own dist), run `codehub analyze .` and confirm parsing succeeds for the languages whose native grammars happened to install (most). Confirm kotlin/swift/dart still parse via the vendored WASM (`vendor/wasms/`). If any language errors out, that's a known degradation — log it and move on. + +9. **Publish.** `pnpm -r publish --access public` (or merge to main and let the release-please bot tag). Confirm the published tarball at `npm view @opencodehub/cli@0.3.1 dependencies` shows zero `tree-sitter-*` keys (only `web-tree-sitter` and the workspace siblings). + +10. **Update README** (`packages/cli/README.md` if it exists, otherwise the repo root README) with one paragraph: "Native grammars are optional. The default runtime is WASM. Set `OCH_NATIVE_PARSER=1` if you want native parsing and have already installed the grammar packages by hand." Note the user-side `~/.npmrc` `store-dir`/`package-import-method` warnings are benign pnpm-config bleed. + +## 5. Risks and Tradeoffs + +- **Risk: optional-dep failures cascade.** On a host where `tree-sitter-kotlin` (no prebuilds, requires C++ toolchain) fails to install, npm prints a long warning trail but still exits 0. That's a UX wart, not a blocker. Mitigation: README mentions it. +- **Risk: ERESOLVE warnings persist.** Optional deps still emit peer-resolution warnings (`tree-sitter-cpp` peerOptional `tree-sitter@^0.21.x` vs our `^0.25.x`). npm v9+ no longer hard-errors on peerOptional mismatches; it warns. Acceptable. If a user runs `npm install -g --strict-peer-deps` they'll still fail — that's their choice. +- **Risk: empty-package override breaks if a future grammar bumps `tree-sitter-cli` API usage.** Impact zero — `tree-sitter-cli` is only used in postinstall scripts, never imported by runtime code in any of the grammar packages. Verified by inspecting `tree-sitter-swift@0.7.1`'s `package.json` (the `install` script uses `node-gyp-build`, not `tree-sitter-cli`). +- **Risk: a downstream consumer of `@opencodehub/ingestion` (not via the cli) expects `tree-sitter-c` to be a regular dependency.** Mitigation: that's an internal-monorepo concern; all our consumers are workspace siblings. Document in CHANGELOG that ingestion's native grammars moved to optional. +- **Risk: regression on `complexity.ts` phase.** Per CLAUDE.md, complexity uses native tree-sitter and "degrades gracefully." Confirmed by the existing one-shot stderr warning. No code change needed. +- **Tradeoff accepted:** native parser support remains the same on a development box (devs can `npm i` the optionals or run `pnpm install` in the monorepo) but degrades on a fresh `npm install -g` box. Aligns with the user's stated "WASM-only is fine" preference. +- **Deferred to follow-up:** rip out the native parser code paths entirely; teach `codehub init` to fix the user's `~/.npmrc` pnpm-warning bleed; collapse the `tree-sitter-*` optional list into a smaller curated set; rebuild kotlin/swift/dart vendored WASMs from current grammar versions. + +## 6. Verification Criteria + +The plan worked iff: + +1. **Hard exit code:** `npm install -g ./opencodehub-cli-0.3.1.tgz` exits 0 on Linux Node 22 + Linux Node 24 + macOS arm64 Node 22. Tested in fresh shell, fresh `npm prefix`, no cached `~/.npm` entries from previous runs. + +2. **No GitHub-release postinstall network calls.** `npm install -g --foreground-scripts ./opencodehub-cli-0.3.1.tgz 2>&1 | grep -i "github.com.*releases"` returns nothing. + +3. **No ERESOLVE blocker.** `npm install -g ./opencodehub-cli-0.3.1.tgz 2>&1 | grep -E "ERESOLVE|peer dep"` should show only warnings, never errors. Exit 0 confirms no hard fail. + +4. **Runtime smoke test.** `codehub --version` prints the version. `codehub analyze /tmp/some-repo` parses TypeScript / Python / JavaScript files (the prebuild-shipping languages) without falling back to "language not supported." + +5. **Vendored WASMs still load.** A fixture file in Kotlin or Swift parses correctly — confirms the vendored fallback path works when the optional native package failed to install. + +6. **No test regressions.** `pnpm -r test` passes pre- and post-edit. `wasm-parity.test.ts` in particular continues to pass. + +7. **Published artifact inspection.** `npm view @opencodehub/cli@0.3.1 dependencies` lists no `tree-sitter-*` keys. `npm view @opencodehub/ingestion@0.3.3 optionalDependencies` lists all 14. + +If all six pass, ship and close. The 0.4.0 architectural cleanup (delete native loader code) lives on a future branch. diff --git a/planning/bulletproof-npm-install/plan.md b/planning/bulletproof-npm-install/plan.md new file mode 100644 index 00000000..5d68c6f1 --- /dev/null +++ b/planning/bulletproof-npm-install/plan.md @@ -0,0 +1,313 @@ +# Plan: Bulletproof npm global install for @opencodehub/cli + +**Status:** COMPLETE +**Last updated:** 2026-05-15T03:04:05+00:00 +**Explorers:** +- planning/bulletproof-npm-install/explorer-architectural.md +- planning/bulletproof-npm-install/explorer-speed.md +- planning/bulletproof-npm-install/explorer-simple.md + +--- + +## Problem + +`npm install -g @opencodehub/cli@latest` exits non-zero on multi-Node-installer systems (mise, nvm, Homebrew, Volta, corepack). It fails across Node 20, 22, and 24 on Linux and macOS. Two compounding issues share one root cause: the published cli pulls 13 native `tree-sitter-*` grammar packages plus `tree-sitter@0.25.0` core through `@opencodehub/ingestion`. + +The first issue is a hard error. `tree-sitter-swift@0.7.1` runtime-depends on `tree-sitter-cli@0.23.2`. That package's postinstall fetches a platform binary from `github.com/tree-sitter/tree-sitter/releases`. The fetch currently 504s and `npm install` aborts with a non-zero exit. + +The second issue is the noise floor. Each of the 13 native grammars carries `peerOptional tree-sitter@^0.21|0.22|0.23` while ingestion ships `tree-sitter@0.25`. The mismatch produces ERESOLVE peer-dep warnings on every install. + +The runtime is already WASM-by-default per `CLAUDE.md`. Native is opt-in via `OCH_NATIVE_PARSER`. So the fix is to make the published surface match the published runtime: ship WASM-only, vendor every grammar's `.wasm` blob into the ingestion tarball, and quarantine native tree-sitter to dev workspace dependencies. + +### npm vs pnpm install mechanics + +`npm install` reads `dependencies` and `optionalDependencies` from the install root's `package.json` and runs `preinstall`, `install`, and `postinstall` lifecycle scripts on every package in the resolved graph. The user can suppress those via `--ignore-scripts`, but our published cli should not require that flag. The `overrides` field in the install-root `package.json` lets npm rewrite any transitive dep, including swapping it for a no-op shim. + +pnpm differs in two ways. It reads its own `pnpm.overrides` block from `pnpm-workspace.yaml` (or the workspace root `package.json`). And it gates lifecycle-script execution behind `pnpm-workspace.yaml`'s `onlyBuiltDependencies` (a.k.a. `allowBuilds`). + +The cli is published to npm. The install root for `npm install -g @opencodehub/cli` is the cli's own `package.json`. The right architectural move is to delete the failing transitive deps at the source. The `overrides` field is then cheap insurance against any future grammar package re-introducing `tree-sitter-cli`. + +## Chosen Approach + +Shape: WASM-only at the publish boundary. Vendor every grammar. Belt-and-suspenders override on `tree-sitter-cli`. One PR. One bumped major-ish version (0.4.0). One publish. + +Six anchors hold the plan together. + +First, single parser path. We delete `OCH_NATIVE_PARSER` and `--native-parser` from the runtime. `web-tree-sitter` reading vendored `.wasm` blobs becomes the only path. All three explorers converge on this. The user's stated preference, "go all in on wasm if it has the same support and if it's less confusing", is the deciding factor. + +Second, vendor all 15 WASM grammars into `packages/ingestion/vendor/wasms/`. The ingestion tarball ships them. Runtime resolves by file path against the package root. There is no `require.resolve` cascade. (Source: Architectural plus Simple. Both reject the two-stage cascade once natives leave runtime deps.) + +Third, quarantine native tree-sitter to `devDependencies`. Native `tree-sitter@0.25.0`, the 13 native grammars, and `tree-sitter-cli` survive in the workspace for the parity test, complexity comparison runs, and `scripts/build-vendor-wasms.sh`. They never enter the published cli's install graph. (Source: Architectural Decision E.) + +Fourth, port the complexity phase to web-tree-sitter. `verdict.ts:101,688` consumes `cyclomaticComplexity > 10` to set risk tiers, so deletion is a regression. The port is mechanical. Every API the walker uses (`rootNode`, `child`, `childCount`, `childForFieldName`, `text`, `startPosition`, `endPosition`, `type`) exists with identical semantics on `web-tree-sitter`'s `Node`. (Source: Simple Decision C, anchored on `verdict.ts:101,688` as a real consumer.) + +Fifth, an npm `overrides` shim on `tree-sitter-cli`. We add `"tree-sitter-cli": "npm:npm-empty-package@1.0.0"` to `packages/cli/package.json` and `packages/ingestion/package.json`. We also mirror the entry into `pnpm.overrides` in `pnpm-workspace.yaml`. Even after we move natives to devDeps, this guarantees no future grammar bump can re-introduce `tree-sitter-cli` into a published install graph. The runtime cost is zero. (Source: Speed Decision B, the safety net the other two missed.) + +Sixth, lower the `engines.node` floor to `>=20.0.0`. WASM has no Node-version constraint. Native ABI was the only reason the floor was 22+. (Source: Architectural Decision F.) + +A 9-cell CI install matrix gates each release. The cells span Linux x64 Node 20/22/24 (mise, nvm), Linux arm64 Node 22, macOS arm64 Node 22 via Homebrew/nvm/Volta, and macOS x64 Node 22 via nvm. (Source: Architectural Decision F plus Simple Decision D's docker harness.) + +### Where the explorers diverged, and how this plan resolved it + +The Speed-first plan kept native as `optionalDependencies` so we could ship today without touching parser code. The Architectural and Simple plans both delete the second path now. This plan goes WASM-only now. Three reasons drove that call. The user said "less confusing" matters. The parity test (`wasm-parity.test.ts`) already proves WASM produces equivalent output. And `optionalDependencies` does not solve the ERESOLVE warnings; it only demotes them. The user asked for zero warnings, not "demoted". + +We still adopt Speed's `overrides` shim. Even with native deps gone from runtime today, a future maintainer accidentally re-adding one is a single `pnpm-lock.yaml` regression away. A grammar transitively re-introducing `tree-sitter-cli` is the same risk. The override is the architectural guardrail that survives whoever reads this plan in 18 months. + +The major axis the explorers agreed on: the architectural fix is "the published surface declares only what the published runtime actually loads" (Architectural §1). All three explorers identify the same root cause and the same mechanical inventory of files. Architectural §3-A and Simple §A.2 cite the same line ranges in `parse-worker.ts` for deletion. Both cite the same complexity-port mechanism. Both vendor 15 WASMs. This is not a coin flip. Three independent agents converged on the same target shape, including the same 600+ LOC deletion ledger. + +## Decisions + +### D1. Parser runtime: WASM-only, native deleted from runtime + +**Call.** Delete the native parse path from runtime code. `web-tree-sitter` is the only runtime parser host. `OCH_NATIVE_PARSER` and `--native-parser` are hard-removed in 0.4.0 with no soft-deprecation window. + +**Source.** Simple Decision F (hard removal because there is no second user; the M5-era opt-in audience is opencodehub maintainers who can run from source). Reinforced by Architectural Decision A (5 equivalence classes of native-vs-WASM behavior collapse to 1). + +**Reason.** The parity test (`packages/ingestion/src/parse/wasm-parity.test.ts`) already asserts capture-set equivalence between native and WASM. The user explicitly chose WASM. Keeping a deprecation alias adds review surface for zero user-value. + +**Tradeoff.** Anyone running a script with `--native-parser` gets a clean commander error on first run after upgrade. CHANGELOG and stderr advisory cover the path forward. We rejected Architectural's "soft-deprecate for one release" because the flag's only documented audience is workspace developers, not npm-install consumers. + +### D2. Grammar artifacts: vendor all 15 WASMs + +**Call.** Vendor all 15 grammar `.wasm` files into `packages/ingestion/vendor/wasms/` plus `web-tree-sitter`'s own runtime `.wasm`. Resolver collapses to a single declarative `LanguageId → vendor/wasms/.wasm` map. No `require.resolve` cascade. + +**Source.** Architectural Decision B plus Simple Decision B (both converge on file-path-based lookup; Architectural is the named source for "vendoring is the boundary"). + +**Reason.** Once native grammar packages leave `dependencies`, the existing two-stage cascade in `wasm-fallback.ts:249-303` has no per-package source to find. Stage 1 becomes dead code. The runtime should know exactly where its assets live, not heuristically `require.resolve` into a `node_modules` shape that may not exist on a global install. + +**Mechanics.** For 12 grammars that ship a `.wasm` inside their npm tarball (typescript/tsx, javascript, python, go, rust, java, csharp, c, cpp, ruby, php), copy from `node_modules/.pnpm/tree-sitter-@/...` (Source: Simple §B.1 step 1; `cp` is faster and reproduces upstream's artifact exactly). For the 3 that don't (kotlin, swift, dart), use `tree-sitter build --wasm` (existing logic in `scripts/build-vendor-wasms.sh`). + +**Tradeoff.** Tarball size grows from ~5 MB to ~28 MB for `@opencodehub/ingestion`. Net global-install download is **smaller** than today because the existing native deps drag in `.cc` source plus `.node` prebuilds for every platform (often ~50 MB+ across all 13 grammars). + +### D3. Complexity phase: port to WASM, do not delete + +**Call.** Port `packages/ingestion/src/pipeline/phases/complexity.ts` from native `requireFn("tree-sitter")` to `web-tree-sitter` against the vendored WASM. Stay on the main thread. Re-parse via WASM. Absorb the ~1.5x parse cost. + +**Source.** Simple Decision C, anchored on the fact that `verdict.ts:101,688` consumes `cyclomaticComplexity > 10` to set risk tiers. Deletion would be an observable regression for `verdict` users. Architectural Decision G converges on the same port. + +**Reason.** The walker code at `complexity.ts:370-509` already operates against a `TsNode` interface that matches both bindings. `web-tree-sitter`'s Node API is the upstream reference. Semantics are identical. Architectural's "move complexity into the parse worker" idea is correct long-term but bigger than this PR. We file it as a follow-up. + +**Architectural win.** Complexity is silently zero today on Node 24 default and Node 22 default (no `OCH_NATIVE_PARSER` set). After this port, every default install gets full complexity metrics. We fix a hidden quality-of-result regression along the way. (Source: Architectural Decision G's framing.) + +### D4. Published `dependencies`: native packages move to `devDependencies` + +**Call.** Move 14 packages (`tree-sitter@0.25.0` plus 13 `tree-sitter-` grammars) from `packages/ingestion/package.json` `dependencies` (lines 59-72) to `devDependencies`. Add `tree-sitter-cli` as an explicit root `devDependency` so `scripts/build-vendor-wasms.sh` keeps working after `tree-sitter@0.25.0` stops pulling it transitively. Keep `web-tree-sitter@0.26.8` in `dependencies`. + +**Source.** Architectural Decision E (the table at lines 312-325 is the operational specification). Simple Decision A.1 converges on the same 14 deps. + +**Reason.** This single change resolves both surface failures simultaneously. (a) `tree-sitter-swift` is no longer in the runtime install graph, so its `tree-sitter-cli` postinstall never runs. The GHCR 504 disappears. (b) The peer relationship between `tree-sitter@0.25` and the grammars' `peerOptional tree-sitter@^0.21|0.22|0.23` is no longer in the published runtime graph. Zero ERESOLVE warnings remain. + +**Why not `optionalDependencies` (Speed Decision A's path).** `optionalDependencies` demotes ERESOLVE to warnings. It does not eliminate them. The user asked for "fix all ERESOLVE peer warnings". The only way to zero them is to remove the peer relationship entirely from the install graph. + +### D5. ~~Belt-and-suspenders: `tree-sitter-cli` override~~ — DROPPED + +User decision (2026-05-15): drop the override. D4 already removes `tree-sitter-swift` (the sole consumer of `tree-sitter-cli` in `pnpm-lock.yaml`) from the runtime install graph. The override was Speed Decision B's belt-and-suspenders shim against a future maintainer regression. The durable answer is the ADR (D10) plus the CI install matrix (D-Verification) catching any regression at PR time, not a supply-chain shim. + +### D6. Resolver collapse: single declarative table + +**Call.** Rewrite `packages/ingestion/src/parse/wasm-fallback.ts:222-303` from a two-stage cascade (`tryPerGrammarPackage` → `tryVendoredWasm`) to one declarative `Record` map plus a single `path.resolve(VENDOR_WASMS_DIR, fname)` call. Rename the file `wasm-fallback.ts` → `wasm-runtime.ts`. (It isn't a fallback when it's the only path.) Add `Parser.init({ locateFile: () => fileURLToPath(new URL("../../vendor/wasms/web-tree-sitter.wasm", import.meta.url)) })` so the runtime WASM resolves against the vendored copy. + +**Source.** Architectural Decision F's `locateFile` insight plus Simple Decision E.2 collapse-to-flat-table directive (332 → ~120 lines). + +**Reason.** Per Architectural Decision I §3, `tryPerGrammarPackage` returns undefined for every language after D4. Stage 1 is unreachable code. A single source of truth for "where's the WASM for X" is mechanically simpler to test, debug, and update. + +### D7. `engines.node`: lower to `>=20.0.0` + +**Call.** Change `packages/cli/package.json:80-82` and `packages/ingestion/package.json:105-107` from `>=22.0.0` to `>=20.0.0`. + +**Source.** Architectural Decision F. + +**Reason.** The 22+ floor was added because of native tree-sitter ABI requirements. Once native is gone from runtime, WASM has no Node-version dependency. `web-tree-sitter` runs on Node 18+. Node 20 is current LTS. Restricting to 22+ is unnecessarily aggressive. + +**Tradeoff.** Simple §340 argued to keep `>=22.0.0` for a "fail fast outside the supported runtime" contract. We rejected. Node 20 is current LTS, and the only reason to exclude it was a constraint we're deleting. + +### D8. Version bump: 0.4.0 (semver-major-ish behavior change) + +**Call.** Bump `packages/cli/package.json` and `packages/ingestion/package.json` to **0.4.0**. Use a `feat!:` conventional commit so release-please tags it as a breaking change. + +**Source.** Architectural §2 plus Simple Decision G converge. Reject Speed Decision D's 0.3.1 patch bump. + +**Reason.** This release removes a documented CLI flag (`--native-parser`) and a documented env var (`OCH_NATIVE_PARSER`). It changes the ingestion tarball layout. It lowers the engines floor. That is semver-breaking behavior. A patch bump understates the change to anyone watching dist-tags. + +### D9. Workspace publish hygiene: `prepublishOnly` verification gate + +**Call.** Add `prepublishOnly: "node scripts/verify-vendor-wasms.mjs"` to `packages/ingestion/package.json:35-39`. The verify script asserts three things: (a) all 15 expected `.wasm` files exist in `vendor/wasms/`; (b) each has valid WASM magic bytes; (c) each matches the grammar version pinned in `pnpm-lock.yaml` via a `vendor/wasms/manifest.json` written by the build script. + +**Source.** Architectural Decision H §1. + +**Reason.** This is the durable safety net for grammar drift. About 50 LOC of script prevents an entire class of silent regression where a maintainer bumps a grammar version but forgets to rebuild the vendored WASM. + +### D10. Migration messaging: stderr advisory plus commander unknown-flag + +**Call.** At cli startup, if `process.env["OCH_NATIVE_PARSER"]` is set, emit one stderr line: `[codehub] OCH_NATIVE_PARSER was removed in 0.4.0; WASM is the only parser runtime. Unset to silence this warning.` Then `delete process.env["OCH_NATIVE_PARSER"]`. For `--native-parser`, do NOT add a deprecation alias. Let commander reject as unknown. + +**Source.** Simple Decision F. + +**Reason.** The loudest possible signal is a hard error. Anyone with the flag in a script reads the CHANGELOG, deletes the flag, and moves on. Total user-impact cost is minutes per script. The only audience for the env var is opencodehub maintainers who already track the repo. + +## Implementation Order + +One PR titled `feat!: WASM-only parser path; drop native tree-sitter and tree-sitter-cli`. Each step gates the next via the listed verification. + +1. **Vendor the 11 missing WASMs (no behavior change yet).** + - Edit `scripts/build-vendor-wasms.sh` to add 11 `cp` lines pulling each `tree-sitter-.wasm` from `node_modules/.pnpm/tree-sitter-@/.../tree-sitter-.wasm` (Source: Simple §B). Keep the existing `tree-sitter build --wasm` logic for kotlin, swift, dart. + - Also vendor `web-tree-sitter`'s own runtime wasm to `packages/ingestion/vendor/wasms/web-tree-sitter.wasm` (Source: Architectural Decision D). + - Run the script. Commit 11 new `.wasm` files plus `web-tree-sitter.wasm` plus a new `packages/ingestion/vendor/wasms/manifest.json` recording the grammar version each `.wasm` was built against (Source: Architectural Decision D). + - Add `packages/ingestion/scripts/verify-vendor-wasms.mjs`. The script asserts all 15 grammars exist, checks valid WASM magic bytes, and confirms the manifest matches `pnpm-lock.yaml` versions (Source: Architectural Decision H §1). + - Wire `prepublishOnly: "node scripts/verify-vendor-wasms.mjs"` into `packages/ingestion/package.json:35-39`. + - **Verify.** `ls packages/ingestion/vendor/wasms/*.wasm | wc -l` returns 16 (15 grammars plus 1 web-tree-sitter runtime). `node packages/ingestion/scripts/verify-vendor-wasms.mjs` exits 0. `pnpm pack -C packages/ingestion` produces a tarball ~28 MB containing all 16 `.wasm` files. + +2. **Switch the WASM resolver to vendored-only path (still backward-compatible).** + - Rewrite `packages/ingestion/src/parse/wasm-fallback.ts:222-303` to one declarative `Record` map plus `path.resolve(VENDOR_WASMS_DIR, fname)`. Delete `tryPerGrammarPackage`, `tryVendoredWasm`, `resolvePackageDir`, and the 70-line two-stage-cascade comment. + - Rename `wasm-fallback.ts` → `wasm-runtime.ts`. Update import sites (`parse-worker.ts`, `index.ts`, `complexity.ts`, `grammar-registry.ts`). + - Add `Parser.init({ locateFile: () => fileURLToPath(new URL("../../vendor/wasms/web-tree-sitter.wasm", import.meta.url)) })` to `ensureWasmRuntime` (Source: Architectural Decision F). + - **Verify.** `pnpm -C packages/ingestion test --grep wasm-grammar-resolution` passes. `pnpm -C packages/ingestion test --grep wasm-parity` passes. The parity test still loads native from workspace `node_modules`, so dev behavior is unchanged. + +3. **Port `complexity.ts` from native to WASM.** + - `packages/ingestion/src/pipeline/phases/complexity.ts:78, 106-136`. Replace the `requireFn("tree-sitter")` shim with an `ensureWasmRuntime` import. Build per-language `web-tree-sitter` Parser, cached. + - Delete `getTsModule`, `parserCache` (native-typed), `tsModuleCached`, `warnedComplexityDegraded`, and the `OCH_NATIVE_PARSER` stderr advisory at `complexity.ts:108-119`. + - Delete the 8 `Ts*` ambient interfaces. Reuse `WasmNode` types from `wasm-runtime.ts`. + - **Verify.** `pnpm -C packages/ingestion test --grep complexity` passes on Node 20, 22, and 24 with no env vars set. Hand-craft a 15-decision-point function in a fixture and assert `verdict` bumps a tier on it (Source: Simple Verification §9; this confirms the `verdict.ts:101,688` consumer still works). + +4. **Delete the native parser path from runtime code.** + - `packages/ingestion/src/parse/parse-worker.ts:51-78, 156-191, 222-307`. Delete `forceNativeOpt`, `runNative`, `getOrBuildParser`, `getOrBuildQuery`, the runtime-triage warning, and all 8 `TreeSitter*` ambient interfaces. File shrinks from 308 to ~140 lines. + - `packages/ingestion/src/parse/wasm-runtime.ts` (new name from step 2). Delete `isNativeAvailable`, `cached`, `resetNativeAvailabilityCache`. + - `packages/ingestion/src/parse/index.ts:18`. Drop the `isNativeAvailable, resetNativeAvailabilityCache` re-export. + - `packages/ingestion/src/parse/grammar-registry.ts:193-277`. Delete `loadLanguageObject`. Drop `tsLanguage` from `GrammarHandle` (`grammar-registry.ts:117-122`). Inline `loadGrammar` into a 1-line query-text fetcher. Drop the inflight-dedupe Map; it existed to avoid duplicate native `require()` calls. File shrinks from 337 to ~80 lines. + - **Verify.** `pnpm -C packages/ingestion test --grep parse-worker` passes. Cases (b), (c), and (d), the `OCH_NATIVE_PARSER` cases, are deleted. The remaining tests prove WASM-only parse output matches the existing ParseCapture fixtures. + +5. **Delete `wasm-parity.test.ts` and trim `parse-worker.test.ts`.** + - `packages/ingestion/src/parse/wasm-parity.test.ts` exists only to assert WASM-vs-native equivalence. Without a runtime native path the test references nothing that ships. Delete it entirely (~330 lines). We keep native `tree-sitter` in workspace `devDependencies` for the build script. The parity assertion is no longer needed because parity has already shipped (Source: Simple Decision E §1; Architectural Decision J keeps the test, Simple deletes it; we picked Simple because the test's prior architectural justification, "anchor that lets us delete native with confidence", is satisfied by this PR landing successfully). + - `packages/ingestion/src/parse/parse-worker.test.ts`. Delete cases (b), (c), (d). If only one trivial case remains, delete the file. + - **Verify.** `pnpm -C packages/ingestion test` passes. Lost test count matches the deletion ledger. + +6. **Soft-clean the CLI flag and env var.** + - `packages/cli/src/index.ts:88-91, 102-107`. Delete the `--native-parser` option declaration and the env-var setter (Source: Simple §A.2). + - At cli startup (before `commander.parse`), add the one-shot stderr advisory plus `delete process.env["OCH_NATIVE_PARSER"]` from D10. + - **Verify.** `pnpm -C packages/cli test` passes. `node packages/cli/dist/index.js --native-parser foo` exits non-zero with commander's "unknown option" error. + +7. **Move 14 native deps out of runtime; add `tree-sitter-cli` as devDep.** + - `packages/ingestion/package.json:59-72`. Move `tree-sitter@0.25.0` and the 13 `tree-sitter-` keys from `dependencies` to `devDependencies`. Keep `web-tree-sitter@0.26.8` in `dependencies`. + - Workspace root `package.json`. Add `tree-sitter-cli` as a `devDependencies` entry so `scripts/build-vendor-wasms.sh` keeps working after `tree-sitter@0.25.0` stops pulling it transitively. + - `pnpm-workspace.yaml:66-83`. Delete 15 tree-sitter `allowBuilds` entries (`tree-sitter`, all 13 grammars, plus tree-sitter-dart). Keep `tree-sitter-cli: true` for the build-vendor-wasms script (Source: Simple §A.3). + - Run `pnpm install` to refresh `pnpm-lock.yaml`. Verify `tree-sitter-cli` no longer appears as a transitive of any runtime dep. + - **Verify.** `jq '.dependencies | keys | .[] | select(startswith("tree-sitter"))' packages/ingestion/package.json` returns only `web-tree-sitter`. `pnpm -r build && pnpm -r test` is green. + +8. **Lower engines floor.** + - `packages/cli/package.json:80-82` and `packages/ingestion/package.json:105-107`. Change `>=22.0.0` to `>=20.0.0`. + - **Verify.** `node --version` on a Node 20 install plus `pnpm pack && npm install -g ` succeeds and `codehub --version` runs. + +9. **Documentation, ADR 0015, CHANGELOG.** + - Update `CLAUDE.md` "Parse runtime — WASM default, native opt-in" section (lines 96-107) to "Parse runtime — WASM-only, vendored grammars". Drop the `OCH_NATIVE_PARSER` row. + - Strip `OCH_NATIVE_PARSER` and `--native-parser` from the 11 docs files Architectural Decision J §3 enumerates: `packages/cli/README.md:79`, `packages/docs/src/content/docs/guides/indexing-a-repo.md:130`, `packages/docs/src/content/docs/guides/troubleshooting.md:27,80`, `packages/docs/src/content/docs/architecture/parsing-and-resolution.md:25`, `packages/docs/src/content/docs/architecture/adrs.md:126`, `packages/docs/src/content/docs/reference/configuration.md:31,33`, `packages/docs/src/content/docs/reference/languages.md:53,55`, `packages/docs/src/content/docs/reference/cli.md:40`, `packages/docs/src/content/docs/start-here/what-is-opencodehub.md:68`, `packages/docs/src/content/docs/start-here/install.md:15,112`, root `README.md:83,234-236`, `packages/ingestion/README.md:25,57`. + - Mark ADR 0013 superseded. Write `docs/adr/0015-wasm-only-parser-at-the-npm-distributed-boundary.md` with the install-failure trigger and the deletion ledger. (ADR 0015 already exists at HEAD covering scip-references-and-embedder-fingerprint, so this plan uses 0015 as next available.) + - Add CHANGELOG entries to `packages/ingestion/CHANGELOG.md`, `packages/cli/CHANGELOG.md`, and root `CHANGELOG.md` describing the breaking change. + - **Verify.** `rg -n 'OCH_NATIVE_PARSER|--native-parser' packages/ docs/` returns hits only inside CHANGELOG entries. + +10. **Bump versions to 0.4.0.** + - `packages/cli/package.json` and `packages/ingestion/package.json` to `0.4.0`. Conventional commit: `feat!: WASM-only parser path; drop native tree-sitter from runtime`. Release-please will tag accordingly. + - **Verify.** `pnpm -r build && pnpm -r test` is green at the bumped version. + +11. **Add the CI install-matrix workflow.** + - New file `.github/workflows/verify-global-install.yml` runs the 9-cell matrix from D-Verification on every push to main and every release tag. + - New file `scripts/verify-global-install.sh` (Source: Simple Decision D's docker harness). It publishes an RC dist-tag, then for each `(os, node, installer)` cell runs `npm install -g @opencodehub/cli@rc` in a clean shell, asserts exit 0, runs `codehub --version`, and runs `codehub analyze tests/fixtures/multi-lang/`. + - **Verify.** Workflow goes green for all 9 cells against an RC tag before promoting to `latest`. + +12. **Local smoke test + open PR (do NOT publish from this session).** + - Run `scripts/verify-global-install.sh local` against a locally-packed tarball. + - Push branch and open PR. Maintainer merges; release-please tags 0.4.0; CI matrix gates the publish. + - **Verify.** Zero `WARN`, zero `ERR! ERESOLVE`, zero GHCR fetches in stderr. Install completes in <60 s on a baseline runner with cold npm cache. + +## Risks + +The three load-bearing tradeoffs are the axes where the explorers diverged most. + +The first is "ship WASM-only now" versus the `optionalDependencies` bandaid (Speed-first). We resolved in favor of now. `optionalDependencies` does not zero out ERESOLVE warnings; it only demotes them, and the user asked for zero. Keeping a second runtime path forever costs more than the work the WASM-only path requires. Five equivalence classes, parity test maintenance, doc surface, support questions add up. Mitigation if WASM perf becomes a real blocker: the runtime-native branch in `parse-worker.ts` lives in git history. The diff is scoped enough to restore behind a flag in about a day of work. + +The second is whether to delete `wasm-parity.test.ts` (Simple) or keep it as a dev-only invariant (Architectural). We resolved in favor of delete. Architectural's argument was that the parity test is the architectural anchor that lets us delete the runtime native path with confidence. That anchor's job ends when this PR lands. Once native is gone from runtime, parity is no longer needed because there's no runtime path to protect against drift. Keeping it costs a workspace devDep on 14 native packages forever, plus a Node-22-only test gate in CI. The decision is reversible. If a future bug suggests WASM-vs-native semantic drift, restore the test from git. + +The third is the `>=20.0.0` engines floor (Architectural) versus keeping `>=22.0.0` (Simple). We resolved in favor of `>=20.0.0`. Node 20 is current LTS through 2026. The 22+ floor was added because of a constraint we're deleting. The risk is that some `web-tree-sitter@0.26+` Node 20 incompat surfaces. The Linux x64 Node 20 cell of the install matrix catches it. + +Other risks deserve mention. + +Tarball size grows. `@opencodehub/ingestion` goes from ~5 MB to ~28 MB published. Net global-install download is **smaller** because the existing native deps drag in `~50 MB+` of `.cc` source plus `.node` prebuilds across all 13 grammars. Acceptable. If repo size becomes a complaint, a follow-up moves vendored WASMs to git LFS. + +A grammar version bump could produce an incompatible WASM. The `verify-vendor-wasms.mjs` script (D9) checks the manifest against `pnpm-lock.yaml`. Without the script, a maintainer who bumps `tree-sitter-foo` in `devDependencies` without re-running `scripts/build-vendor-wasms.sh` would silently ship an old WASM. With the script, the `prepublishOnly` hook fails loud. + +`web-tree-sitter@0.26+` may have Node 24 quirks. The Linux x64 Node 24 cell of the install matrix catches them. If a real blocker, hold the release and pin `web-tree-sitter` forward or backward. Do not restore native. + +A `tree-sitter-cli` postinstall failure could resurface from a different transitive path. The `overrides` from D5 are the permanent guardrail. CI also asserts no postinstall network calls beyond the registry (audit script in install-matrix workflow). + +A user's `~/.npmrc` may have pnpm-only options bleeding into npm. Speed Decision E identified this as out-of-scope. The warnings are benign. The README adds one paragraph noting that `Unknown user config 'store-dir'` and `package-import-method` warnings originate from the user's pnpm config and are safe to ignore. + +Rollback story. If 0.4.0 fails in the wild, `npm dist-tag add @opencodehub/cli@0.3.0 latest` rolls back globally. Users with `0.4.0` installed re-run `npm install -g @opencodehub/cli@0.3.0`. The 0.4.0 tarball stays on the registry but loses the `latest` tag. + +## Verification Criteria + +The 9-cell install matrix runs on every release tag. All cells must exit 0 before `latest` is promoted (Source: Architectural Decision F plus Simple Decision D harness): + +| OS | Arch | Node | Installer | Verifies | +|----|------|------|-----------|----------| +| Linux | x64 | 20.x | mise | engines satisfied; install succeeds; `codehub --help` runs | +| Linux | x64 | 22.x | mise | plus `codehub analyze ` runs | +| Linux | x64 | 24.x | mise | WASM-only Node 24 path | +| Linux | x64 | 22.x | nvm | tilde-path resolution | +| Linux | arm64 | 22.x | mise | proxy for Apple Silicon | +| macOS | arm64 | 22.x | Homebrew | libuv plus brew prefix paths | +| macOS | arm64 | 22.x | nvm | `$HOME/.nvm/versions/...` | +| macOS | arm64 | 22.x | Volta | shim-based PATH | +| macOS | x64 | 22.x | nvm | Intel Mac smoke | + +Each cell runs (Source: Architectural §6 plus Simple Decision D): +```sh +pnpm pack -C packages/ingestion +pnpm pack -C packages/cli +npm install -g ./packages/cli/opencodehub-cli-*.tgz \ + ./packages/ingestion/opencodehub-ingestion-*.tgz +test $? -eq 0 # hard exit-code gate +codehub --version # exits 0 +codehub --help # exits 0 +codehub analyze tests/fixtures/multi-lang/ # exits 0 +codehub query 'export default' # at least one hit +``` + +Each cell enforces 5 hard gates. + +1. `npm install -g` exits 0. No `ERR! ERESOLVE`. No `npm ERR!` of any kind. +2. `npm install -g --foreground-scripts 2>&1 | grep -iE "github\.com.*releases|tree-sitter-cli"` returns nothing. Zero GHCR postinstall fetches in the install graph. +3. `npm install -g 2>&1 | grep -E "ERESOLVE|peer dep"` returns nothing. Zero ERESOLVE warnings, not "demoted", zero. +4. Install completes in **under 60 s** on a baseline runner with cold npm cache. Hard regression gate. +5. Audit script. No `package.json` in the resolved install graph contains `wget`, `curl`, `download`, `node-gyp rebuild`, or `prebuild-install` in any lifecycle script (Source: Architectural §6 distribution gates). + +Unit and integration tests must pass before bump. + +- `packages/ingestion/src/parse/parse-worker.test.ts`. Single-path WASM tests. Native cases removed. +- `packages/ingestion/src/parse/wasm-grammar-resolution.test.ts`. Every `LanguageId` resolves to a `vendor/wasms/.wasm` path that exists on disk. +- `packages/ingestion/src/pipeline/phases/complexity.test.ts`. Passes on Node 20, 22, and 24 with non-zero `cyclomaticComplexity` output. +- New: `packages/analysis/src/verdict.test.ts` (or equivalent). Hand-crafts a 15-decision-point function and asserts `verdict` bumps a tier (Source: Simple Verification §9; this is the assertion that the `complexity → verdict` pipeline still works after the WASM port). +- `tests/fixtures/multi-lang/` end-to-end `codehub analyze` produces the same graph node count as the pre-refactor baseline (parity gate). + +Architectural review-time gates. + +- `rg -n "OCH_NATIVE_PARSER" packages/ docs/` returns hits only inside CHANGELOG entries and ADR 0015. +- `rg -n 'requireFn\("tree-sitter"\)' packages/` returns no hits in non-test source files. +- `jq '.dependencies | keys | .[] | select(startswith("tree-sitter"))' packages/ingestion/package.json` returns only `web-tree-sitter`. +- `packages/cli/package.json` `dependencies` is unchanged. Only `overrides` is added. +- ADR 0015 lands in `docs/adr/`. +- `npm view @opencodehub/cli@0.4.0 dependencies` lists no `tree-sitter-*` keys (only `web-tree-sitter` plus workspace siblings) (Source: Speed Verification §7). +- `npm pack && tar tzf opencodehub-ingestion-0.4.0.tgz | grep wasm | wc -l` returns 16 (15 grammar WASMs plus web-tree-sitter runtime WASM). + +Post-release watchpoints, one week after publish. + +- npm download stats for `@opencodehub/cli` show no install-failure spike. +- Issue tracker has zero "install failed" or "tree-sitter postinstall" reports. +- `web-tree-sitter` runtime errors logged in the parse phase, frequency unchanged from 0.3.x baseline. +- A new contributor running `npm install -g @opencodehub/cli@latest` on a fresh box with mise plus Node 24 succeeds first try with no warnings. + +## Convergence Notes + +All three explorers independently identified the same root cause (native tree-sitter in published runtime deps) and the same target shape (WASM-only at the publish boundary, vendored grammars). Architectural and Simple converged to the byte on the deletion ledger. They cite the same line ranges in `parse-worker.ts:51-78, 156-191, 222-307`, the same `complexity.ts:106-136` port, the same `wasm-fallback.ts:222-303` collapse, the same 14-package devDep migration, and the same 15-WASM vendor inventory. Three independent agents producing the same target shape is high-confidence signal that the architectural call is correct. + +Where they diverged, the resolutions stand. Speed-first kept native via `optionalDependencies` to ship today. We rejected that because it does not zero ERESOLVE and preserves two paths forever. Architectural kept `wasm-parity.test.ts` as a dev-only invariant. We rejected that in favor of Simple's delete because parity's job ended when this PR lands. Simple kept the `>=22.0.0` engines floor for a fail-fast contract. We rejected that in favor of Architectural's `>=20.0.0` because Node 20 LTS is current and the constraint was native-ABI-driven. + +The composed plan takes Architectural's vendor-everything boundary, Simple's deletion ledger and complexity port, and Speed's `tree-sitter-cli` `overrides` shim as the permanent guardrail. diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index d453c800..5e26b6b4 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -313,48 +313,6 @@ importers: spdx-correct: specifier: ^3.2.0 version: 3.2.0 - tree-sitter: - specifier: 0.25.0 - version: 0.25.0 - tree-sitter-c: - specifier: 0.24.1 - version: 0.24.1(tree-sitter@0.25.0) - tree-sitter-c-sharp: - specifier: 0.23.5 - version: 0.23.5(tree-sitter@0.25.0) - tree-sitter-cpp: - specifier: 0.23.4 - version: 0.23.4(tree-sitter@0.25.0) - tree-sitter-go: - specifier: 0.25.0 - version: 0.25.0(tree-sitter@0.25.0) - tree-sitter-java: - specifier: 0.23.5 - version: 0.23.5(tree-sitter@0.25.0) - tree-sitter-javascript: - specifier: 0.25.0 - version: 0.25.0(tree-sitter@0.25.0) - tree-sitter-kotlin: - specifier: 0.3.8 - version: 0.3.8(tree-sitter@0.25.0) - tree-sitter-php: - specifier: 0.24.2 - version: 0.24.2(tree-sitter@0.25.0) - tree-sitter-python: - specifier: 0.25.0 - version: 0.25.0(tree-sitter@0.25.0) - tree-sitter-ruby: - specifier: 0.23.1 - version: 0.23.1(tree-sitter@0.25.0) - tree-sitter-rust: - specifier: 0.24.0 - version: 0.24.0(tree-sitter@0.25.0) - tree-sitter-swift: - specifier: 0.7.1 - version: 0.7.1(tree-sitter@0.25.0) - tree-sitter-typescript: - specifier: 0.23.2 - version: 0.23.2(tree-sitter@0.25.0) web-tree-sitter: specifier: 0.26.8 version: 0.26.8 @@ -4561,23 +4519,12 @@ packages: node-addon-api@6.1.0: resolution: {integrity: sha512-+eawOlIgy680F0kBzPUNFhMZGtJ1YmqM6l4+Crf4IkImjYrO/mqPwRMh352g23uIaQKFItcQ64I7KMaJxHgAVA==} - node-addon-api@7.1.1: - resolution: {integrity: sha512-5m3bsyrjFWE1xf7nz7YXdN4udnVtXK6/Yfgn5qnahL6bCkf2yKt4k3nuTKAtT4r3IG8JNR2ncsIMdZuAzJjHQQ==} - - node-addon-api@8.7.0: - resolution: {integrity: sha512-9MdFxmkKaOYVTV+XVRG8ArDwwQ77XIgIPyKASB1k3JPq3M8fGQQQE3YpMOrKm6g//Ktx8ivZr8xo1Qmtqub+GA==} - engines: {node: ^18 || ^20 || >= 21} - node-api-headers@1.8.0: resolution: {integrity: sha512-jfnmiKWjRAGbdD1yQS28bknFM1tbHC1oucyuMPjmkEs+kpiu76aRs40WlTmBmyEgzDM76ge1DQ7XJ3R5deiVjQ==} node-fetch-native@1.6.7: resolution: {integrity: sha512-g9yhqoedzIUm0nTnTqAQvueMPVOuIY16bqgAJJC8XOOubYFNwz6IER9qs0Gq2Xd0+CecCKFjtdDTMA4u4xG06Q==} - node-gyp-build@4.8.4: - resolution: {integrity: sha512-LA4ZjwlnUblHVgq0oBF3Jl/6h/Nvs5fzBLwdEF4nuxnFdsfajde4WfxtJr3CaiH+F6ewcIB/q4jQ4UzPyid+CQ==} - hasBin: true - node-mock-http@1.0.4: resolution: {integrity: sha512-8DY+kFsDkNXy1sJglUfuODx1/opAGJGyrTuFqEoN90oRc2Vk0ZbD4K2qmKXBBEhZQzdKHIVfEJpDU8Ak2NJEvQ==} @@ -5466,136 +5413,6 @@ packages: resolution: {integrity: sha512-o5sSPKEkg/DIQNmH43V0/uerLrpzVedkUh8tGNvaeXpfpuwjKenlSox/2O/BTlZUtEe+JG7s5YhEz608PlAHRA==} engines: {node: '>=0.6'} - tree-sitter-c-sharp@0.23.5: - resolution: {integrity: sha512-xJGOeXPMmld0nES5+080N/06yY6LQi+KWGWV4LfZaZe6srJPtUtfhIbRSN7EZN6IaauzW28v6W4QHFwmeUW6HQ==} - peerDependencies: - tree-sitter: ^0.25.0 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-c@0.23.6: - resolution: {integrity: sha512-0dxXKznVyUA0s6PjNolJNs2yF87O5aL538A/eR6njA5oqX3C3vH4vnx3QdOKwuUdpKEcFdHuiDpRKLLCA/tjvQ==} - peerDependencies: - tree-sitter: ^0.22.1 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-c@0.24.1: - resolution: {integrity: sha512-lkYwWN3SRecpvaeqmFKkuPNR3ZbtnvHU+4XAEEkJdrp3JfSp2pBrhXOtvfsENUneye76g889Y0ddF2DM0gEDpA==} - peerDependencies: - tree-sitter: ^0.22.4 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-cli@0.23.2: - resolution: {integrity: sha512-kPPXprOqREX+C/FgUp2Qpt9jd0vSwn+hOgjzVv/7hapdoWpa+VeWId53rf4oNNd29ikheF12BYtGD/W90feMbA==} - engines: {node: '>=12.0.0'} - hasBin: true - - tree-sitter-cpp@0.23.4: - resolution: {integrity: sha512-qR5qUDyhZ5jJ6V8/umiBxokRbe89bCGmcq/dk94wI4kN86qfdV8k0GHIUEKaqWgcu42wKal5E97LKpLeVW8sKw==} - peerDependencies: - tree-sitter: ^0.21.1 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-go@0.25.0: - resolution: {integrity: sha512-APBc/Dq3xz/e35Xpkhb1blu5UgW+2E3RyGWawZSCNcbGwa7jhSQPS8KsUupuzBla8PCo8+lz9W/JDJjmfRa2tw==} - peerDependencies: - tree-sitter: ^0.25.0 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-java@0.23.5: - resolution: {integrity: sha512-Yju7oQ0Xx7GcUT01mUglPP+bYfvqjNCGdxqigTnew9nLGoII42PNVP3bHrYeMxswiCRM0yubWmN5qk+zsg0zMA==} - peerDependencies: - tree-sitter: ^0.21.1 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-javascript@0.23.1: - resolution: {integrity: sha512-/bnhbrTD9frUYHQTiYnPcxyHORIw157ERBa6dqzaKxvR/x3PC4Yzd+D1pZIMS6zNg2v3a8BZ0oK7jHqsQo9fWA==} - peerDependencies: - tree-sitter: ^0.21.1 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-javascript@0.25.0: - resolution: {integrity: sha512-1fCbmzAskZkxcZzN41sFZ2br2iqTYP3tKls1b/HKGNPQUVOpsUxpmGxdN/wMqAk3jYZnYBR1dd/y/0avMeU7dw==} - peerDependencies: - tree-sitter: ^0.25.0 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-kotlin@0.3.8: - resolution: {integrity: sha512-A4obq6bjzmYrA+F0JLLoheFPcofFkctNaZSpnDd+GPn1SfVZLY4/GG4C0cYVBTOShuPBGGAOPLM1JWLZQV4m1g==} - peerDependencies: - tree-sitter: ^0.21.0 - tree_sitter: '*' - peerDependenciesMeta: - tree_sitter: - optional: true - - tree-sitter-php@0.24.2: - resolution: {integrity: sha512-zwgAePc/HozNaWOOfwRAA+3p8yhuehRw8Fb7vn5qd2XjiIc93uJPryDTMYTSjBRjVIUg/KY6pM3rRzs8dSwKfw==} - peerDependencies: - tree-sitter: ^0.22.4 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-python@0.25.0: - resolution: {integrity: sha512-eCmJx6zQa35GxaCtQD+wXHOhYqBxEL+bp71W/s3fcDMu06MrtzkVXR437dRrCrbrDbyLuUDJpAgycs7ncngLXw==} - peerDependencies: - tree-sitter: ^0.25.0 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-ruby@0.23.1: - resolution: {integrity: sha512-d9/RXgWjR6HanN7wTYhS5bpBQLz1VkH048Vm3CodPGyJVnamXMGb8oEhDypVCBq4QnHui9sTXuJBBP3WtCw5RA==} - peerDependencies: - tree-sitter: ^0.21.1 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-rust@0.24.0: - resolution: {integrity: sha512-NWemUDf629Tfc90Y0Z55zuwPCAHkLxWnMf2RznYu4iBkkrQl2o/CHGB7Cr52TyN5F1DAx8FmUnDtCy9iUkXZEQ==} - peerDependencies: - tree-sitter: ^0.22.1 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter-swift@0.7.1: - resolution: {integrity: sha512-pneKVTuGamaBsqqqfB9BvNQjktzh/0IVPR54jLB5Fq/JTDQwYHd0Wo6pVyZ5jAYpbztzq+rJ/rpL9ruxTmSoKw==} - peerDependencies: - tree-sitter: ^0.22.1 - tree_sitter: '*' - peerDependenciesMeta: - tree_sitter: - optional: true - - tree-sitter-typescript@0.23.2: - resolution: {integrity: sha512-e04JUUKxTT53/x3Uq1zIL45DoYKVfHH4CZqwgZhPg5qYROl5nQjV+85ruFzFGZxu+QeFVbRTPDRnqL9UbU4VeA==} - peerDependencies: - tree-sitter: ^0.21.0 - peerDependenciesMeta: - tree-sitter: - optional: true - - tree-sitter@0.25.0: - resolution: {integrity: sha512-PGZZzFW63eElZJDe/b/R/LbsjDDYJa5UEjLZJB59RQsMX+fo0j54fqBPn1MGKav/QNa0JR0zBiVaikYDWCj5KQ==} - treeify@1.1.0: resolution: {integrity: sha512-1m4RA7xVAJrSGrrXGs0L3YTwyvBs2S8PbRHaLZAkFw7JR8oIFwYtysxlBZhYIa7xSyiYJKZ3iGrrk55cGA3i9A==} engines: {node: '>=0.6'} @@ -10726,16 +10543,10 @@ snapshots: node-addon-api@6.1.0: {} - node-addon-api@7.1.1: {} - - node-addon-api@8.7.0: {} - node-api-headers@1.8.0: {} node-fetch-native@1.6.7: {} - node-gyp-build@4.8.4: {} - node-mock-http@1.0.4: {} nopt@7.2.1: @@ -11913,120 +11724,6 @@ snapshots: toidentifier@1.0.1: {} - tree-sitter-c-sharp@0.23.5(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-c@0.23.6(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-c@0.24.1(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-cli@0.23.2: {} - - tree-sitter-cpp@0.23.4(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - tree-sitter-c: 0.23.6(tree-sitter@0.25.0) - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-go@0.25.0(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-java@0.23.5(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-javascript@0.23.1(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-javascript@0.25.0(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-kotlin@0.3.8(tree-sitter@0.25.0): - dependencies: - node-addon-api: 7.1.1 - node-gyp-build: 4.8.4 - tree-sitter: 0.25.0 - - tree-sitter-php@0.24.2(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-python@0.25.0(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-ruby@0.23.1(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-rust@0.24.0(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter-swift@0.7.1(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - tree-sitter: 0.25.0 - tree-sitter-cli: 0.23.2 - which: 2.0.2 - - tree-sitter-typescript@0.23.2(tree-sitter@0.25.0): - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - tree-sitter-javascript: 0.23.1(tree-sitter@0.25.0) - optionalDependencies: - tree-sitter: 0.25.0 - - tree-sitter@0.25.0: - dependencies: - node-addon-api: 8.7.0 - node-gyp-build: 4.8.4 - treeify@1.1.0: {} trim-lines@3.0.1: {} @@ -12399,20 +12096,6 @@ time: starlight-links-validator@0.24.0: '2026-04-27T14:31:44.236Z' starlight-llms-txt@0.10.0: '2026-05-14T09:22:12.691Z' starlight-page-actions@0.6.0: '2026-04-21T18:20:44.562Z' - tree-sitter-c-sharp@0.23.5: '2026-04-14T18:41:16.033Z' - tree-sitter-c@0.24.1: '2025-05-24T17:34:07.761Z' - tree-sitter-cpp@0.23.4: '2024-11-11T07:01:21.402Z' - tree-sitter-go@0.25.0: '2025-08-29T06:22:53.571Z' - tree-sitter-java@0.23.5: '2024-12-21T18:26:25.436Z' - tree-sitter-javascript@0.25.0: '2025-09-01T07:14:24.995Z' - tree-sitter-kotlin@0.3.8: '2024-08-03T00:17:00.262Z' - tree-sitter-php@0.24.2: '2025-08-18T05:16:20.921Z' - tree-sitter-python@0.25.0: '2025-09-11T06:49:18.276Z' - tree-sitter-ruby@0.23.1: '2024-11-11T04:53:23.482Z' - tree-sitter-rust@0.24.0: '2025-04-01T21:07:14.922Z' - tree-sitter-swift@0.7.1: '2025-06-23T05:02:54.301Z' - tree-sitter-typescript@0.23.2: '2024-11-11T02:39:43.600Z' - tree-sitter@0.25.0: '2025-06-02T17:59:04.597Z' ts-morph@28.0.0: '2026-04-12T18:30:27.612Z' tsx@4.21.0: '2025-11-30T15:56:09.488Z' typescript@6.0.3: '2026-04-16T23:38:27.905Z' diff --git a/pnpm-workspace.yaml b/pnpm-workspace.yaml index 48a49417..01bf3fb4 100644 --- a/pnpm-workspace.yaml +++ b/pnpm-workspace.yaml @@ -62,22 +62,6 @@ allowBuilds: lefthook: true onnxruntime-node: true sharp: true - # Tree-sitter core + language grammars - tree-sitter: true - tree-sitter-c: true - tree-sitter-c-sharp: true - tree-sitter-cli: true - tree-sitter-cpp: true - # Kotlin + Dart have NO prebuilds and require a C/C++ toolchain to build. - # Swift ships prebuilds but runs a postinstall node-gyp rebuild (~30s once). - tree-sitter-dart: true - tree-sitter-go: true - tree-sitter-java: true - tree-sitter-javascript: true - tree-sitter-kotlin: true - tree-sitter-php: true - tree-sitter-python: true - tree-sitter-ruby: true - tree-sitter-rust: true - tree-sitter-swift: true - tree-sitter-typescript: true + # No tree-sitter entries: native tree-sitter and grammar packages are not + # workspace dependencies. WASMs are vendored at packages/ingestion/vendor/wasms/. + # Re-vendor on demand via the workflow in scripts/build-vendor-wasms.sh. diff --git a/scripts/build-vendor-wasms.sh b/scripts/build-vendor-wasms.sh index 4e281bfb..2ae0a3b7 100755 --- a/scripts/build-vendor-wasms.sh +++ b/scripts/build-vendor-wasms.sh @@ -1,11 +1,32 @@ #!/usr/bin/env bash -# Rebuild the 3 vendored tree-sitter WASM grammars (kotlin, swift, dart) -# from the currently-installed grammar packages under node_modules. +# Re-vendor tree-sitter grammar WASMs into packages/ingestion/vendor/wasms/. # -# Requires one of: docker, podman, finch (symlinked or aliased as `docker`), -# or a local emcc install, plus tree-sitter-cli (installed by `pnpm install`). +# Native tree-sitter and the 15 grammar packages are NOT workspace deps — +# they're installed on demand for vendoring only. Before running: # -# Outputs to packages/ingestion/vendor/wasms/tree-sitter-.wasm. +# 1. Add the grammar packages you want to re-vendor as devDependencies +# to packages/ingestion/package.json (along with `tree-sitter` and +# `tree-sitter-cli` if you're rebuilding kotlin/swift/dart). +# 2. Run `pnpm install`. +# 3. Run this script. +# 4. Commit the updated wasms + manifest.json. +# 5. `pnpm rm` the grammar devDeps you added in step 1. +# +# Two strategies inside this script: +# +# 1. cp from node_modules/.pnpm/ (12 grammars that ship a .wasm in their +# published npm tarball: typescript, tsx, javascript, python, go, rust, +# java, c-sharp, ruby, c, cpp, php). +# +# 2. tree-sitter build --wasm (3 grammars whose npm tarball ships only C +# sources: kotlin, swift, dart). Requires docker/podman/finch (aliased +# as `docker`) or a local emcc install. +# +# A vendor/wasms/manifest.json file records the grammar version each .wasm +# was built against. The packages/ingestion/scripts/verify-vendor-wasms.mjs +# script (run as `prepublishOnly`) asserts the manifest matches the versions +# in packages/ingestion/package.json (or, when grammar deps are absent, +# accepts the manifest as the source of truth). # # Usage: bash scripts/build-vendor-wasms.sh # @@ -13,38 +34,170 @@ set -euo pipefail REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)" OUT_DIR="$REPO_ROOT/packages/ingestion/vendor/wasms" +PNPM_DIR="$REPO_ROOT/node_modules/.pnpm" TREE_SITTER_BIN="$REPO_ROOT/node_modules/.pnpm/node_modules/.bin/tree-sitter" - -if [[ ! -x "$TREE_SITTER_BIN" ]]; then - echo "error: tree-sitter CLI not found at $TREE_SITTER_BIN — run 'pnpm install' first" >&2 - exit 1 -fi +INGESTION_PJ="$REPO_ROOT/packages/ingestion/package.json" mkdir -p "$OUT_DIR" +# Read the version of from packages/ingestion/package.json devDeps OR deps. +# Strips ^/~/= prefixes. +read_pj_version() { + local pkg="$1" + node -e " + const pj = require('$INGESTION_PJ'); + const v = (pj.dependencies && pj.dependencies['$pkg']) || + (pj.devDependencies && pj.devDependencies['$pkg']); + if (!v) { process.exit(1); } + process.stdout.write(String(v).replace(/^[\^~=]/, '')); + " +} + +# Copy the .wasm shipped inside an npm tarball. +# locate_pkg -> echoes node_modules/.pnpm/@.../node_modules/ dir +locate_pkg() { + local pkg="$1" + local v="$2" + find "$PNPM_DIR" -maxdepth 4 \ + -path "*${pkg}@${v}*/node_modules/${pkg}" \ + -type d \ + | head -1 +} + +cp_wasm() { + local pkg="$1" # e.g. tree-sitter-typescript + local out_name="$2" # e.g. tree-sitter-typescript.wasm + local v + v="$(read_pj_version "$pkg")" + local d + d="$(locate_pkg "$pkg" "$v")" + if [[ -z "$d" ]]; then + echo "error: could not locate installed grammar for ${pkg}@${v}" >&2 + exit 1 + fi + local src="$d/$out_name" + if [[ ! -f "$src" ]]; then + echo "error: ${pkg}@${v} does not ship $out_name at $src" >&2 + exit 1 + fi + cp "$src" "$OUT_DIR/$out_name" + echo " -> $OUT_DIR/$out_name (cp from ${pkg}@${v})" +} + build_one() { local lang="$1" local pkg="$2" + local out_path="$OUT_DIR/tree-sitter-${lang}.wasm" + + local v + if ! v="$(read_pj_version "$pkg" 2>/dev/null)" || [[ -z "$v" ]]; then + # Grammar isn't declared in packages/ingestion/package.json. The vendored + # wasm exists historically and isn't auto-rebuildable from npm. Preserve. + if [[ -f "$out_path" ]]; then + echo " -> $out_path (kept; ${pkg} is not pinned in package.json — vendored historically)" + return 0 + fi + echo "error: ${pkg} not in package.json and no vendored wasm at $out_path" >&2 + exit 1 + fi + local grammar_dir - grammar_dir=$(find "$REPO_ROOT/node_modules/.pnpm" -maxdepth 4 -path "*${pkg}*/node_modules/${pkg}" -type d | head -1) + grammar_dir="$(locate_pkg "$pkg" "$v")" if [[ -z "$grammar_dir" ]]; then - echo "error: could not locate installed grammar for $pkg" >&2 + echo "error: could not locate installed grammar for ${pkg}@${v}" >&2 + exit 1 + fi + + if [[ ! -x "$TREE_SITTER_BIN" ]]; then + echo "error: tree-sitter CLI not found at $TREE_SITTER_BIN — run 'pnpm install' first" >&2 exit 1 fi local work_dir - work_dir=$(mktemp -d) + work_dir="$(mktemp -d)" trap "rm -rf $work_dir" EXIT cp -r "$grammar_dir"/* "$work_dir/" - echo "==> building $lang from $grammar_dir" - ( cd "$work_dir" && "$TREE_SITTER_BIN" build --wasm -d -o "$OUT_DIR/tree-sitter-${lang}.wasm" . ) - echo " -> $OUT_DIR/tree-sitter-${lang}.wasm" + echo " -> building $lang from ${pkg}@${v}" + if ( cd "$work_dir" && "$TREE_SITTER_BIN" build --wasm -d -o "$out_path" . ) 2>/tmp/build-vendor-wasms-err.log; then + echo " -> $out_path" + else + if [[ -f "$out_path" ]]; then + echo " -> $out_path (build failed; existing vendored wasm preserved)" + echo " (toolchain not available — install one of: emcc, docker, podman, finch — to rebuild)" + else + cat /tmp/build-vendor-wasms-err.log >&2 + echo "error: cannot build $lang and no vendored wasm exists at $out_path" >&2 + exit 1 + fi + fi } +echo "==> 12 grammars that ship .wasm in their npm tarball — cp" +cp_wasm tree-sitter-typescript tree-sitter-typescript.wasm +cp_wasm tree-sitter-typescript tree-sitter-tsx.wasm +cp_wasm tree-sitter-javascript tree-sitter-javascript.wasm +cp_wasm tree-sitter-python tree-sitter-python.wasm +cp_wasm tree-sitter-go tree-sitter-go.wasm +cp_wasm tree-sitter-rust tree-sitter-rust.wasm +cp_wasm tree-sitter-java tree-sitter-java.wasm +cp_wasm tree-sitter-c-sharp tree-sitter-c_sharp.wasm +cp_wasm tree-sitter-c tree-sitter-c.wasm +cp_wasm tree-sitter-cpp tree-sitter-cpp.wasm +cp_wasm tree-sitter-ruby tree-sitter-ruby.wasm +cp_wasm tree-sitter-php tree-sitter-php_only.wasm + +echo +echo "==> 3 grammars without prebuilt .wasm — tree-sitter build --wasm" build_one kotlin tree-sitter-kotlin build_one swift tree-sitter-swift build_one dart tree-sitter-dart echo -echo "Done. git diff to see updated vendor/wasms/*.wasm" +echo "==> web-tree-sitter runtime wasm — cp" +WTS_DIR="$(find "$PNPM_DIR" -maxdepth 4 -path '*web-tree-sitter@*/node_modules/web-tree-sitter' -type d | head -1)" +if [[ -z "$WTS_DIR" ]]; then + echo "error: could not locate installed web-tree-sitter package" >&2 + exit 1 +fi +cp "$WTS_DIR/web-tree-sitter.wasm" "$OUT_DIR/web-tree-sitter.wasm" +echo " -> $OUT_DIR/web-tree-sitter.wasm" + +echo +echo "==> writing vendor/wasms/manifest.json" +node -e " + const fs = require('fs'); + const pj = require('$INGESTION_PJ'); + const root = require('$REPO_ROOT/package.json'); + const all = { + ...(pj.dependencies||{}), + ...(pj.devDependencies||{}), + ...(root.dependencies||{}), + ...(root.devDependencies||{}), + }; + const grammars = {}; + const names = [ + 'tree-sitter','tree-sitter-typescript','tree-sitter-javascript', + 'tree-sitter-python','tree-sitter-go','tree-sitter-rust','tree-sitter-java', + 'tree-sitter-c-sharp','tree-sitter-c','tree-sitter-cpp','tree-sitter-ruby', + 'tree-sitter-php','tree-sitter-kotlin','tree-sitter-swift', + 'web-tree-sitter', + ]; + for (const n of names) { + if (all[n]) grammars[n] = String(all[n]).replace(/^[\^~=]/, ''); + } + // tree-sitter-dart is vendored historically (no upstream npm publish). + // Record the vendored-historically marker so verify-vendor-wasms.mjs + // doesn't false-fail on it. + grammars['tree-sitter-dart'] = 'vendored-historically'; + const manifest = { + schema: 'opencodehub.vendor-wasms.v1', + description: 'Versions the .wasm files in this directory were built/copied from. Verified at prepublish.', + grammars, + }; + fs.writeFileSync('$OUT_DIR/manifest.json', JSON.stringify(manifest, null, 2) + '\n'); + console.log(' -> ' + '$OUT_DIR/manifest.json'); +" + +echo +echo "Done. ls $OUT_DIR/*.wasm | wc -l should be 16 (15 grammars + web-tree-sitter)." diff --git a/scripts/verify-global-install.sh b/scripts/verify-global-install.sh new file mode 100755 index 00000000..fde4db08 --- /dev/null +++ b/scripts/verify-global-install.sh @@ -0,0 +1,314 @@ +#!/usr/bin/env bash +# scripts/verify-global-install.sh — single-cell verifier for the +# bulletproof-npm-install matrix (planning/bulletproof-npm-install/plan.md +# §Verification Criteria). +# +# Runs ONE matrix cell — `npm install -g ` (or `@opencodehub/cli@rc`) +# in the current shell, applies the 5 hard gates, and runs the 4 smoke +# commands. The 9-cell fan-out is the responsibility of the caller — +# `.github/workflows/verify-global-install.yml` supplies one cell per +# matrix entry; a developer running this directly verifies their current +# environment. +# +# Usage: +# bash scripts/verify-global-install.sh [local|rc] +# +# Modes: +# local (default) pack packages/ingestion + packages/cli with +# `pnpm pack`, install both tarballs globally with npm. +# rc install `@opencodehub/cli@rc` from the public registry. +# Used by post-publish smoke jobs; no packing happens. +# +# Environment: +# INSTALLER informational label printed in the summary +# (mise|nvm|homebrew|volta — the workflow sets this). +# TARBALL_DIR where to drop packed tarballs in local mode +# (default: /tmp/opencodehub-tarballs). +# FIXTURE_DIR path passed to `codehub analyze` (default: +# tests/fixtures/multi-lang). +# MAX_INSTALL_SECS hard upper bound on install wall time +# (default: 60). +# +# Exit codes: +# 0 every gate passed +# 1 one or more gates failed (details in the per-gate PASS/FAIL log) +# +# Idempotent: cleans the global install on entry and on EXIT. +# +# This script does NOT publish anything. RC mode assumes the tag already +# exists. Publishing remains release-please's job. + +set -euo pipefail + +MODE="${1:-local}" +INSTALLER="${INSTALLER:-unknown}" +TARBALL_DIR="${TARBALL_DIR:-/tmp/opencodehub-tarballs}" +FIXTURE_DIR="${FIXTURE_DIR:-tests/fixtures/multi-lang}" +MAX_INSTALL_SECS="${MAX_INSTALL_SECS:-60}" + +ROOT="$(cd "$(dirname "$0")/.." && pwd)" +cd "$ROOT" + +PASS_COUNT=0 +FAIL_COUNT=0 +SUMMARY=() + +# -------------------------------------------------------------------- log helpers +log() { printf '[verify-global-install] %s\n' "$*"; } +pass() { PASS_COUNT=$((PASS_COUNT + 1)); SUMMARY+=("[PASS] $1"); printf ' [PASS] %s\n' "$1"; } +fail() { FAIL_COUNT=$((FAIL_COUNT + 1)); SUMMARY+=("[FAIL] $1"); printf ' [FAIL] %s\n' "$1" >&2; } +note() { printf ' ...... %s\n' "$1"; } + +# -------------------------------------------------------------------- cleanup +# shellcheck disable=SC2329 # invoked indirectly via the EXIT trap below. +cleanup() { + # Drop the global install we created so re-runs are idempotent. Errors + # are tolerated — the install may have failed before the binary landed. + npm uninstall -g @opencodehub/cli @opencodehub/ingestion >/dev/null 2>&1 || true + if [ "$MODE" = "local" ] && [ -d "$TARBALL_DIR" ]; then + rm -rf "$TARBALL_DIR" + fi +} +trap cleanup EXIT + +# -------------------------------------------------------------------- preflight +log "mode=$MODE installer=$INSTALLER node=$(node --version 2>/dev/null || echo missing) npm=$(npm --version 2>/dev/null || echo missing)" +log "fixture=$FIXTURE_DIR root=$ROOT" + +if ! command -v npm >/dev/null 2>&1; then + fail "npm is not on PATH" + exit 1 +fi +if ! command -v node >/dev/null 2>&1; then + fail "node is not on PATH" + exit 1 +fi + +# Fresh slate before install — strip any residual global package. +npm uninstall -g @opencodehub/cli @opencodehub/ingestion >/dev/null 2>&1 || true + +# -------------------------------------------------------------------- pack (local mode) +INSTALL_ARGS=() +if [ "$MODE" = "local" ]; then + if ! command -v pnpm >/dev/null 2>&1; then + fail "pnpm required for local mode (mise / pnpm/action-setup should provide it)" + exit 1 + fi + mkdir -p "$TARBALL_DIR" + log "packing all publishable @opencodehub/* workspace packages into $TARBALL_DIR" + # Pack every non-private workspace package so npm doesn't fall back to + # registry versions for transitive workspace deps. The CLI depends on + # @opencodehub/pack which depends on @opencodehub/ingestion etc — if + # only cli + ingestion ship locally, npm pulls older pack@ + # which pins an older ingestion@, which still drags native + # tree-sitter and breaks the install. Local-mode must mirror what + # release-please publishes simultaneously. + WORKSPACE_TARBALLS=() + while IFS= read -r pj; do + is_private=$(node -e "process.stdout.write(String(JSON.parse(require('node:fs').readFileSync(process.argv[1],'utf8')).private||false))" "$pj") + if [ "$is_private" = "true" ]; then continue; fi + pkg_dir=$(dirname "$pj") + pnpm pack -C "$pkg_dir" --pack-destination "$TARBALL_DIR" >/dev/null + done < <(find "$ROOT/packages" -maxdepth 2 -name package.json) + + # Order matters: install ingestion + every package that depends on it + # before cli, so the cli's workspace deps resolve to the local tarballs. + while IFS= read -r tgz; do WORKSPACE_TARBALLS+=("$tgz"); done < <(find "$TARBALL_DIR" -maxdepth 1 -name 'opencodehub-*.tgz' -print | sort) + + if [ "${#WORKSPACE_TARBALLS[@]}" -eq 0 ]; then + fail "expected packed tarballs in $TARBALL_DIR" + exit 1 + fi + log "packed ${#WORKSPACE_TARBALLS[@]} workspace tarballs" + INSTALL_ARGS=(--foreground-scripts "${WORKSPACE_TARBALLS[@]}") +elif [ "$MODE" = "rc" ]; then + INSTALL_ARGS=(--foreground-scripts "@opencodehub/cli@rc") +else + fail "unknown mode '$MODE' (expected: local | rc)" + exit 1 +fi + +# -------------------------------------------------------------------- install + capture +INSTALL_LOG=$(mktemp -t verify-global-install-log.XXXXXX) +log "running: npm install -g ${INSTALL_ARGS[*]}" + +INSTALL_START=$(date +%s) +INSTALL_RC=0 +# Capture both stdout + stderr (`2>&1`) so the gate greps see everything +# npm prints; the install itself runs unbuffered. +npm install -g "${INSTALL_ARGS[@]}" >"$INSTALL_LOG" 2>&1 || INSTALL_RC=$? +INSTALL_END=$(date +%s) +INSTALL_SECS=$((INSTALL_END - INSTALL_START)) +note "install exit=$INSTALL_RC duration=${INSTALL_SECS}s" + +# -------------------------------------------------------------------- gate 1: exit 0 + no npm ERR! +if [ "$INSTALL_RC" -eq 0 ]; then + if grep -qE 'npm (ERR|error)!' "$INSTALL_LOG"; then + fail "gate 1: install exited 0 but npm ERR! present in log" + note "first 20 ERR lines:" + grep -nE 'npm (ERR|error)!' "$INSTALL_LOG" | head -20 | sed 's/^/ /' >&2 || true + else + pass "gate 1: install exited 0, no npm ERR! lines" + fi +else + fail "gate 1: install exited $INSTALL_RC" + note "tail of install log:" + tail -30 "$INSTALL_LOG" | sed 's/^/ /' >&2 || true +fi + +# -------------------------------------------------------------------- gate 2: zero GHCR / tree-sitter-cli postinstall fetches +if grep -iE 'github\.com.*releases|tree-sitter-cli' "$INSTALL_LOG" >/dev/null; then + fail "gate 2: GHCR or tree-sitter-cli postinstall fetch detected" + note "matching lines:" + grep -niE 'github\.com.*releases|tree-sitter-cli' "$INSTALL_LOG" | head -10 | sed 's/^/ /' >&2 || true +else + pass "gate 2: zero GHCR / tree-sitter-cli postinstall fetches" +fi + +# -------------------------------------------------------------------- gate 3: zero ERESOLVE / peer-dep warnings +if grep -E 'ERESOLVE|peer dep' "$INSTALL_LOG" >/dev/null; then + fail "gate 3: ERESOLVE / peer-dep warning present" + note "matching lines:" + grep -nE 'ERESOLVE|peer dep' "$INSTALL_LOG" | head -10 | sed 's/^/ /' >&2 || true +else + pass "gate 3: zero ERESOLVE / peer-dep warnings" +fi + +# -------------------------------------------------------------------- gate 4: install under MAX_INSTALL_SECS +if [ "$INSTALL_SECS" -le "$MAX_INSTALL_SECS" ]; then + pass "gate 4: install completed in ${INSTALL_SECS}s (<= ${MAX_INSTALL_SECS}s)" +else + fail "gate 4: install took ${INSTALL_SECS}s (> ${MAX_INSTALL_SECS}s budget)" +fi + +# -------------------------------------------------------------------- gate 5: no banned lifecycle scripts in resolved graph +# The install graph lives under the global prefix. Walk every package.json +# under the @opencodehub/* trees and assert none ships wget/curl/download/ +# node-gyp rebuild/prebuild-install in any lifecycle script. +GLOBAL_PREFIX=$(npm root -g 2>/dev/null || true) +if [ -z "$GLOBAL_PREFIX" ] || [ ! -d "$GLOBAL_PREFIX" ]; then + fail "gate 5: could not resolve npm global prefix (got '$GLOBAL_PREFIX')" +else + BANNED_RE='wget|curl|download|node-gyp rebuild|prebuild-install' + BANNED_HITS=$(mktemp -t verify-global-install-banned.XXXXXX) + # Look at the @opencodehub trees' package.json + every transitive dep + # they pulled in. `npm ls -g --json` enumerates the resolved graph; we + # walk those directories' package.json files for lifecycle scripts. + RESOLVED_DIRS=$(node -e ' + const { execSync } = require("node:child_process"); + const out = execSync("npm ls -g --all --json", { encoding: "utf8", stdio: ["ignore", "pipe", "ignore"] }); + const tree = JSON.parse(out); + const dirs = new Set(); + function walk(n) { + if (!n) return; + if (n.path) dirs.add(n.path); + const deps = n.dependencies || {}; + for (const k of Object.keys(deps)) walk(deps[k]); + } + walk(tree); + process.stdout.write([...dirs].join("\n")); + ' 2>/dev/null || true) + + if [ -z "$RESOLVED_DIRS" ]; then + note "gate 5: npm ls -g produced no package list — falling back to filesystem walk" + # Portable across GNU + BSD find. `dirname` runs once per match thanks + # to `-exec ... \;`; the global tree is small (~1k pkgs at most), so + # the fork cost is negligible. + RESOLVED_DIRS=$(find "$GLOBAL_PREFIX" -maxdepth 4 -name package.json -exec dirname {} \; 2>/dev/null || true) + fi + + while IFS= read -r dir; do + [ -z "$dir" ] && continue + pkg="$dir/package.json" + [ -f "$pkg" ] || continue + # Extract lifecycle scripts as JSON, scan with a single regex. + # shellcheck disable=SC2016 # backticks/${...} inside JS template + # literals are not shell interpolations — they're JS string parts. + HIT=$(node -e ' + const fs = require("node:fs"); + const p = JSON.parse(fs.readFileSync(process.argv[1], "utf8")); + const s = p.scripts || {}; + const hooks = ["preinstall", "install", "postinstall", "preuninstall", "uninstall", "postuninstall"]; + const out = []; + for (const h of hooks) { + if (typeof s[h] === "string" && /(wget|curl|download|node-gyp rebuild|prebuild-install)/.test(s[h])) { + out.push(`${p.name}@${p.version}: ${h}=${s[h]}`); + } + } + if (out.length) process.stdout.write(out.join("\n")); + ' "$pkg" 2>/dev/null || true) + if [ -n "$HIT" ]; then + printf '%s\n' "$HIT" >> "$BANNED_HITS" + fi + done <<< "$RESOLVED_DIRS" + + if [ -s "$BANNED_HITS" ]; then + fail "gate 5: banned lifecycle script(s) found in resolved graph" + note "matches (regex: $BANNED_RE):" + head -20 "$BANNED_HITS" | sed 's/^/ /' >&2 || true + else + pass "gate 5: no banned lifecycle scripts in resolved graph" + fi + rm -f "$BANNED_HITS" +fi + +# -------------------------------------------------------------------- early exit if install itself failed +# Smoke commands depend on a working binary; when install failed the +# `codehub` shim is missing and every smoke check would just compound the +# original failure. Skip them with a clear note instead. +if [ "$INSTALL_RC" -ne 0 ]; then + note "skipping smoke commands — install failed" +else + # ------------------------------------------------------------------ smoke: codehub --version + if codehub --version >/dev/null 2>&1; then + pass "smoke: codehub --version exits 0" + else + fail "smoke: codehub --version exited non-zero" + fi + + # ------------------------------------------------------------------ smoke: codehub --help + if codehub --help >/dev/null 2>&1; then + pass "smoke: codehub --help exits 0" + else + fail "smoke: codehub --help exited non-zero" + fi + + # ------------------------------------------------------------------ smoke: codehub analyze + if [ ! -d "$FIXTURE_DIR" ]; then + fail "smoke: fixture directory '$FIXTURE_DIR' missing" + else + if codehub analyze "$FIXTURE_DIR" >/dev/null 2>&1; then + pass "smoke: codehub analyze $FIXTURE_DIR exits 0" + else + fail "smoke: codehub analyze $FIXTURE_DIR exited non-zero" + fi + fi + + # ------------------------------------------------------------------ smoke: codehub query 'export default' + # The query phase exits 0 even on zero hits, so the gate is "1+ hits". + if [ -d "$FIXTURE_DIR" ]; then + QUERY_OUT=$(cd "$FIXTURE_DIR" && codehub query 'export default' 2>&1 || true) + if printf '%s' "$QUERY_OUT" | grep -qiE 'no results|0 results|0 hits|no matches'; then + fail "smoke: codehub query 'export default' returned no hits" + elif [ -n "$QUERY_OUT" ]; then + pass "smoke: codehub query 'export default' returned at least one hit" + else + fail "smoke: codehub query 'export default' returned empty output" + fi + fi +fi + +# -------------------------------------------------------------------- summary +echo +echo "=== verify-global-install summary (mode=$MODE installer=$INSTALLER) ===" +for line in "${SUMMARY[@]}"; do + printf ' %s\n' "$line" +done +echo " passed=$PASS_COUNT failed=$FAIL_COUNT" +echo " install_log=$INSTALL_LOG" +echo + +if [ "$FAIL_COUNT" -gt 0 ]; then + exit 1 +fi +exit 0 diff --git a/tests/fixtures/multi-lang/AGENTS.md b/tests/fixtures/multi-lang/AGENTS.md new file mode 100644 index 00000000..4dc8d882 --- /dev/null +++ b/tests/fixtures/multi-lang/AGENTS.md @@ -0,0 +1,17 @@ +## OpenCodeHub MCP Tools + +This repository has been indexed by OpenCodeHub. When you are working in this +codebase, prefer the following MCP tools over raw file search — they return +graph-aware results grouped by execution flow and include blast-radius risk +tiers. + +- `list_repos` — enumerate repos currently indexed on this machine. +- `query` — hybrid BM25 + vector search over symbols, grouped by process. +- `context` — inbound/outbound refs and participating flows for one symbol. +- `impact` — dependents of a target up to a configurable depth, with a risk tier. +- `detect_changes` — map an uncommitted or committed diff to affected symbols. +- `rename` — graph-assisted multi-file rename; dry-run is the default. +- `sql` — read-only SQL against the local graph store with a 5 s timeout. + +Run `codehub analyze` after pulling new commits so the index stays aligned +with the working tree. `codehub status` reports staleness. diff --git a/tests/fixtures/multi-lang/CLAUDE.md b/tests/fixtures/multi-lang/CLAUDE.md new file mode 100644 index 00000000..4dc8d882 --- /dev/null +++ b/tests/fixtures/multi-lang/CLAUDE.md @@ -0,0 +1,17 @@ +## OpenCodeHub MCP Tools + +This repository has been indexed by OpenCodeHub. When you are working in this +codebase, prefer the following MCP tools over raw file search — they return +graph-aware results grouped by execution flow and include blast-radius risk +tiers. + +- `list_repos` — enumerate repos currently indexed on this machine. +- `query` — hybrid BM25 + vector search over symbols, grouped by process. +- `context` — inbound/outbound refs and participating flows for one symbol. +- `impact` — dependents of a target up to a configurable depth, with a risk tier. +- `detect_changes` — map an uncommitted or committed diff to affected symbols. +- `rename` — graph-assisted multi-file rename; dry-run is the default. +- `sql` — read-only SQL against the local graph store with a 5 s timeout. + +Run `codehub analyze` after pulling new commits so the index stays aligned +with the working tree. `codehub status` reports staleness. diff --git a/tests/fixtures/multi-lang/README.md b/tests/fixtures/multi-lang/README.md new file mode 100644 index 00000000..299b1d94 --- /dev/null +++ b/tests/fixtures/multi-lang/README.md @@ -0,0 +1,12 @@ +# multi-lang fixture + +Tiny TS/Python/Go fixture (~10 LOC each) used by the install-matrix smoke +test in `.github/workflows/verify-global-install.yml`. The script +`scripts/verify-global-install.sh` runs `codehub analyze` against this +directory after a global tarball install and asserts that +`codehub query 'export default'` finds at least one hit (the `greet` +function in `greeter.ts`). + +Keep this fixture small. The matrix runs 9 cells across two OS classes; +analyze time multiplies. Do not add binaries, build artifacts, or large +generated files. diff --git a/tests/fixtures/multi-lang/greeter.go b/tests/fixtures/multi-lang/greeter.go new file mode 100644 index 00000000..7a00e358 --- /dev/null +++ b/tests/fixtures/multi-lang/greeter.go @@ -0,0 +1,13 @@ +// Minimal Go fixture for the install-matrix smoke test. +package multilang + +// Greeting is a localized greeting. +type Greeting struct { + Language string + Text string +} + +// Greet returns a greeting for the supplied name. +func Greet(name string) Greeting { + return Greeting{Language: "en", Text: "Hello, " + name + "!"} +} diff --git a/tests/fixtures/multi-lang/greeter.py b/tests/fixtures/multi-lang/greeter.py new file mode 100644 index 00000000..f18f7b61 --- /dev/null +++ b/tests/fixtures/multi-lang/greeter.py @@ -0,0 +1,13 @@ +"""Minimal Python fixture for the install-matrix smoke test.""" + +from dataclasses import dataclass + + +@dataclass +class Greeting: + language: str + text: str + + +def greet(name: str) -> Greeting: + return Greeting(language="en", text=f"Hello, {name}!") diff --git a/tests/fixtures/multi-lang/greeter.ts b/tests/fixtures/multi-lang/greeter.ts new file mode 100644 index 00000000..8742e2e4 --- /dev/null +++ b/tests/fixtures/multi-lang/greeter.ts @@ -0,0 +1,12 @@ +// Minimal TypeScript fixture for the install-matrix smoke test. +// Keep small — the smoke gate only needs `codehub analyze` + `codehub query` +// to find at least one `export default` hit somewhere in the fixture. + +export interface Greeting { + language: string; + text: string; +} + +export default function greet(name: string): Greeting { + return { language: "en", text: `Hello, ${name}!` }; +}