Skip to content

Fix 4 critical bugs causing near-zero recall#17

Closed
suykerbuyk wants to merge 1 commit into
coder:mainfrom
suykerbuyk:fix/recall-bugs
Closed

Fix 4 critical bugs causing near-zero recall#17
suykerbuyk wants to merge 1 commit into
coder:mainfrom
suykerbuyk:fix/recall-bugs

Conversation

@suykerbuyk
Copy link
Copy Markdown
Contributor

Summary

Fixes 4 distinct bugs that caused near-zero recall when using HNSW search on non-trivial datasets. These were discovered through integration testing with real document embeddings (TF-IDF over constitutional texts) and validated against brute-force ground truth.

Bugs fixed

  1. Heap.Max() and PopLast() returned wrong elements (heap/heap.go) — The min-heap's Max() returned the last array element, which is NOT the maximum in a binary min-heap. When evicting the "worst" candidate to make room for a better one, a random element was removed instead. Fixed by scanning leaf nodes (indices n/2 through n-1).

  2. Search termination was too aggressive (graph.go) — Terminated when no neighbor improved result.Min() (the best result), but the HNSW algorithm terminates when the best remaining candidate is farther than result.Max() (the worst result). This stopped exploration far too early.

  3. replenish() hardcoded CosineDistance (graph.go:168) — Neighbor repair after deletion always used CosineDistance regardless of the graph's configured distance function, corrupting topology for non-Cosine metrics.

  4. Stale elevator panic (graph.go) — When traversing layers top-to-bottom, the elevator node could reference a deleted node. Added existence checks with fallback to layer.entry(). Addresses Panic will appear when running this demo. #15.

Test additions

  • Heap tests: 1 → 11 tests, coverage 61% → 100%
  • Graph tests: 16 new tests (SearchWithDistance, Lookup, empty graph, delete with Euclidean/Cosine, stale elevator regression, dimension mismatch)
  • Distance/Encode tests: custom function registration, error paths, string key round-trip
  • Integration tests: constitution corpus with recall@k vs brute-force ground truth, delete+search resilience, export/import round-trip, benchmarks
  • Overall coverage: main 89.1%, heap 100%

Also adds a Makefile (make test for unit, make integration for full suite) and a constitution-search example CLI.

Test plan

  • go test -short ./... — all unit tests pass
  • go test ./... — all integration tests pass with recall thresholds met
  • go test -bench=. ./... — benchmarks run clean

🤖 Generated with Claude Code

This commit fixes the root causes of the 0-2/10 recall measured when
integrating coder/hnsw into vibe-palace's semantic search pipeline.
Four distinct bugs were identified and fixed:

1. Heap.Max() and PopLast() returned wrong elements (heap/heap.go)

   The min-heap's Max() returned data[len-1] (the last array element),
   which is NOT the maximum in a binary min-heap. PopLast() removed
   that same arbitrary element. This corrupted search result sets:
   when evicting the "worst" candidate to make room for a better one,
   a random element was removed instead. Fixed by scanning leaf nodes
   (indices n/2 through n-1) to find the true maximum.

2. Search termination condition was too aggressive (graph.go)

   The search algorithm terminated when no neighbor improved
   result.Min() (the BEST result), but the standard HNSW algorithm
   terminates when the best remaining CANDIDATE is farther than
   result.Max() (the WORST result). The old condition stopped
   exploring far too early, leaving most of the k result slots
   filled with suboptimal nodes. After this fix, recall on synthetic
   200-node/64-dim data jumped from ~0.46 to ~0.96.

3. replenish() hardcoded CosineDistance (graph.go:168)

   When a node was deleted, the neighbor repair function always used
   CosineDistance regardless of the graph's configured distance
   function. This corrupted graph topology for any non-Cosine metric.
   Fixed by threading DistanceFunc through the addNeighbor() ->
   replenish() -> isolate() -> Delete() call chain.

4. Stale elevator panic (graph.go Add/search methods)

   When traversing layers top-to-bottom, the elevator node (best
   entry point from the previous layer) could reference a node that
   had been deleted between layers. Added existence checks with
   fallback to layer.entry(). Addresses upstream issue coder#15.

Test embeddings: TF-IDF weighted feature hashing

The integration tests and sample CLI use a deterministic text embedding
based on the hashing trick with TF-IDF weighting. Each word hashes to
a dimension index (feature hashing), weighted by its inverse document
frequency. Stop words (~40 common English words) are removed so that
rare discriminating terms dominate the vector. Embedding dimension is
2048 to minimize hash collisions (~6400 unique terms in the corpus).

This produces meaningful lexical similarity — a query for "Arms" finds
the 2nd Amendment and Mexico's Article 10 (right to keep arms). This
is a test/demo utility only; vibe-palace uses real transformer
embeddings (all-MiniLM-L6-v2 via hugot) which are unaffected by this.

Test changes:
- Heap tests: 1 test -> 11 tests, coverage 61.1% -> 100%
- Graph tests: added 16 new tests covering SearchWithDistance,
  Lookup, empty graph ops, Dims, Len, AddReplace, delete with
  Euclidean/Cosine, stale elevator regression, dimension mismatch
- Distance tests: RegisterDistanceFunc, name roundtrip, unknown func
- Encode tests: error paths for bad version, unknown distance,
  truncated data, string key round-trip
- Integration tests: constitution corpus (US/Canada/Mexico/Amendments)
  with recall@k vs brute-force ground truth, delete+search,
  export/import round-trip, benchmarks
- Updated TestGraph_AddSearch expected results to reflect correct
  nearest neighbors after heap fix: {64,65,63,66} not {64,65,62,63}
- Overall coverage: main 89.1%, heap 100%

New files:
- integration_test.go: integration tests with TF-IDF feature hashing
- example/constitution-search/main.go: sample CLI application
- test-data/*.md: US Constitution, Amendments, Canada, Mexico
- Makefile: standard build facade (help, build, test, integration,
  bench, vet, check, install, clean)
@ammario
Copy link
Copy Markdown
Member

ammario commented Apr 9, 2026

I'm not interested in reviewing a generated PR this large. If you want to make this reviewable get it down to a few test cases showing the bugs and minimal fix.

@suykerbuyk
Copy link
Copy Markdown
Contributor Author

Closing to restructure into smaller, focused PRs as requested. Will resubmit as separate minimal PRs — one per bug fix with compact tests.

@suykerbuyk suykerbuyk closed this Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants