Fix 4 critical bugs causing near-zero recall by suykerbuyk · Pull Request #17 · coder/hnsw

suykerbuyk · 2026-04-08T22:27:32Z

Summary

Fixes 4 distinct bugs that caused near-zero recall when using HNSW search on non-trivial datasets. These were discovered through integration testing with real document embeddings (TF-IDF over constitutional texts) and validated against brute-force ground truth.

Bugs fixed

Heap.Max() and PopLast() returned wrong elements (heap/heap.go) — The min-heap's Max() returned the last array element, which is NOT the maximum in a binary min-heap. When evicting the "worst" candidate to make room for a better one, a random element was removed instead. Fixed by scanning leaf nodes (indices n/2 through n-1).
Search termination was too aggressive (graph.go) — Terminated when no neighbor improved result.Min() (the best result), but the HNSW algorithm terminates when the best remaining candidate is farther than result.Max() (the worst result). This stopped exploration far too early.
replenish() hardcoded CosineDistance (graph.go:168) — Neighbor repair after deletion always used CosineDistance regardless of the graph's configured distance function, corrupting topology for non-Cosine metrics.
Stale elevator panic (graph.go) — When traversing layers top-to-bottom, the elevator node could reference a deleted node. Added existence checks with fallback to layer.entry(). Addresses Panic will appear when running this demo. #15.

Test additions

Heap tests: 1 → 11 tests, coverage 61% → 100%
Graph tests: 16 new tests (SearchWithDistance, Lookup, empty graph, delete with Euclidean/Cosine, stale elevator regression, dimension mismatch)
Distance/Encode tests: custom function registration, error paths, string key round-trip
Integration tests: constitution corpus with recall@k vs brute-force ground truth, delete+search resilience, export/import round-trip, benchmarks
Overall coverage: main 89.1%, heap 100%

Also adds a Makefile (make test for unit, make integration for full suite) and a constitution-search example CLI.

Test plan

go test -short ./... — all unit tests pass
go test ./... — all integration tests pass with recall thresholds met
go test -bench=. ./... — benchmarks run clean

🤖 Generated with Claude Code

This commit fixes the root causes of the 0-2/10 recall measured when integrating coder/hnsw into vibe-palace's semantic search pipeline. Four distinct bugs were identified and fixed: 1. Heap.Max() and PopLast() returned wrong elements (heap/heap.go) The min-heap's Max() returned data[len-1] (the last array element), which is NOT the maximum in a binary min-heap. PopLast() removed that same arbitrary element. This corrupted search result sets: when evicting the "worst" candidate to make room for a better one, a random element was removed instead. Fixed by scanning leaf nodes (indices n/2 through n-1) to find the true maximum. 2. Search termination condition was too aggressive (graph.go) The search algorithm terminated when no neighbor improved result.Min() (the BEST result), but the standard HNSW algorithm terminates when the best remaining CANDIDATE is farther than result.Max() (the WORST result). The old condition stopped exploring far too early, leaving most of the k result slots filled with suboptimal nodes. After this fix, recall on synthetic 200-node/64-dim data jumped from ~0.46 to ~0.96. 3. replenish() hardcoded CosineDistance (graph.go:168) When a node was deleted, the neighbor repair function always used CosineDistance regardless of the graph's configured distance function. This corrupted graph topology for any non-Cosine metric. Fixed by threading DistanceFunc through the addNeighbor() -> replenish() -> isolate() -> Delete() call chain. 4. Stale elevator panic (graph.go Add/search methods) When traversing layers top-to-bottom, the elevator node (best entry point from the previous layer) could reference a node that had been deleted between layers. Added existence checks with fallback to layer.entry(). Addresses upstream issue coder#15. Test embeddings: TF-IDF weighted feature hashing The integration tests and sample CLI use a deterministic text embedding based on the hashing trick with TF-IDF weighting. Each word hashes to a dimension index (feature hashing), weighted by its inverse document frequency. Stop words (~40 common English words) are removed so that rare discriminating terms dominate the vector. Embedding dimension is 2048 to minimize hash collisions (~6400 unique terms in the corpus). This produces meaningful lexical similarity — a query for "Arms" finds the 2nd Amendment and Mexico's Article 10 (right to keep arms). This is a test/demo utility only; vibe-palace uses real transformer embeddings (all-MiniLM-L6-v2 via hugot) which are unaffected by this. Test changes: - Heap tests: 1 test -> 11 tests, coverage 61.1% -> 100% - Graph tests: added 16 new tests covering SearchWithDistance, Lookup, empty graph ops, Dims, Len, AddReplace, delete with Euclidean/Cosine, stale elevator regression, dimension mismatch - Distance tests: RegisterDistanceFunc, name roundtrip, unknown func - Encode tests: error paths for bad version, unknown distance, truncated data, string key round-trip - Integration tests: constitution corpus (US/Canada/Mexico/Amendments) with recall@k vs brute-force ground truth, delete+search, export/import round-trip, benchmarks - Updated TestGraph_AddSearch expected results to reflect correct nearest neighbors after heap fix: {64,65,63,66} not {64,65,62,63} - Overall coverage: main 89.1%, heap 100% New files: - integration_test.go: integration tests with TF-IDF feature hashing - example/constitution-search/main.go: sample CLI application - test-data/*.md: US Constitution, Amendments, Canada, Mexico - Makefile: standard build facade (help, build, test, integration, bench, vet, check, install, clean)

ammario · 2026-04-09T15:51:58Z

I'm not interested in reviewing a generated PR this large. If you want to make this reviewable get it down to a few test cases showing the bugs and minimal fix.

suykerbuyk · 2026-04-11T18:25:12Z

Closing to restructure into smaller, focused PRs as requested. Will resubmit as separate minimal PRs — one per bug fix with compact tests.

suykerbuyk mentioned this pull request Apr 8, 2026

Fix search result set bounded by k instead of efSearch #18

Closed

3 tasks

suykerbuyk closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 4 critical bugs causing near-zero recall#17

Fix 4 critical bugs causing near-zero recall#17
suykerbuyk wants to merge 1 commit into
coder:mainfrom
suykerbuyk:fix/recall-bugs

suykerbuyk commented Apr 8, 2026

Uh oh!

ammario commented Apr 9, 2026

Uh oh!

suykerbuyk commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suykerbuyk commented Apr 8, 2026

Summary

Bugs fixed

Test additions

Test plan

Uh oh!

ammario commented Apr 9, 2026

Uh oh!

suykerbuyk commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants