Fix 4 critical bugs causing near-zero recall#17
Closed
suykerbuyk wants to merge 1 commit into
Closed
Conversation
This commit fixes the root causes of the 0-2/10 recall measured when integrating coder/hnsw into vibe-palace's semantic search pipeline. Four distinct bugs were identified and fixed: 1. Heap.Max() and PopLast() returned wrong elements (heap/heap.go) The min-heap's Max() returned data[len-1] (the last array element), which is NOT the maximum in a binary min-heap. PopLast() removed that same arbitrary element. This corrupted search result sets: when evicting the "worst" candidate to make room for a better one, a random element was removed instead. Fixed by scanning leaf nodes (indices n/2 through n-1) to find the true maximum. 2. Search termination condition was too aggressive (graph.go) The search algorithm terminated when no neighbor improved result.Min() (the BEST result), but the standard HNSW algorithm terminates when the best remaining CANDIDATE is farther than result.Max() (the WORST result). The old condition stopped exploring far too early, leaving most of the k result slots filled with suboptimal nodes. After this fix, recall on synthetic 200-node/64-dim data jumped from ~0.46 to ~0.96. 3. replenish() hardcoded CosineDistance (graph.go:168) When a node was deleted, the neighbor repair function always used CosineDistance regardless of the graph's configured distance function. This corrupted graph topology for any non-Cosine metric. Fixed by threading DistanceFunc through the addNeighbor() -> replenish() -> isolate() -> Delete() call chain. 4. Stale elevator panic (graph.go Add/search methods) When traversing layers top-to-bottom, the elevator node (best entry point from the previous layer) could reference a node that had been deleted between layers. Added existence checks with fallback to layer.entry(). Addresses upstream issue coder#15. Test embeddings: TF-IDF weighted feature hashing The integration tests and sample CLI use a deterministic text embedding based on the hashing trick with TF-IDF weighting. Each word hashes to a dimension index (feature hashing), weighted by its inverse document frequency. Stop words (~40 common English words) are removed so that rare discriminating terms dominate the vector. Embedding dimension is 2048 to minimize hash collisions (~6400 unique terms in the corpus). This produces meaningful lexical similarity — a query for "Arms" finds the 2nd Amendment and Mexico's Article 10 (right to keep arms). This is a test/demo utility only; vibe-palace uses real transformer embeddings (all-MiniLM-L6-v2 via hugot) which are unaffected by this. Test changes: - Heap tests: 1 test -> 11 tests, coverage 61.1% -> 100% - Graph tests: added 16 new tests covering SearchWithDistance, Lookup, empty graph ops, Dims, Len, AddReplace, delete with Euclidean/Cosine, stale elevator regression, dimension mismatch - Distance tests: RegisterDistanceFunc, name roundtrip, unknown func - Encode tests: error paths for bad version, unknown distance, truncated data, string key round-trip - Integration tests: constitution corpus (US/Canada/Mexico/Amendments) with recall@k vs brute-force ground truth, delete+search, export/import round-trip, benchmarks - Updated TestGraph_AddSearch expected results to reflect correct nearest neighbors after heap fix: {64,65,63,66} not {64,65,62,63} - Overall coverage: main 89.1%, heap 100% New files: - integration_test.go: integration tests with TF-IDF feature hashing - example/constitution-search/main.go: sample CLI application - test-data/*.md: US Constitution, Amendments, Canada, Mexico - Makefile: standard build facade (help, build, test, integration, bench, vet, check, install, clean)
3 tasks
Member
|
I'm not interested in reviewing a generated PR this large. If you want to make this reviewable get it down to a few test cases showing the bugs and minimal fix. |
Contributor
Author
|
Closing to restructure into smaller, focused PRs as requested. Will resubmit as separate minimal PRs — one per bug fix with compact tests. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes 4 distinct bugs that caused near-zero recall when using HNSW search on non-trivial datasets. These were discovered through integration testing with real document embeddings (TF-IDF over constitutional texts) and validated against brute-force ground truth.
Bugs fixed
Heap.Max() and PopLast() returned wrong elements (
heap/heap.go) — The min-heap'sMax()returned the last array element, which is NOT the maximum in a binary min-heap. When evicting the "worst" candidate to make room for a better one, a random element was removed instead. Fixed by scanning leaf nodes (indices n/2 through n-1).Search termination was too aggressive (
graph.go) — Terminated when no neighbor improvedresult.Min()(the best result), but the HNSW algorithm terminates when the best remaining candidate is farther thanresult.Max()(the worst result). This stopped exploration far too early.replenish() hardcoded CosineDistance (
graph.go:168) — Neighbor repair after deletion always used CosineDistance regardless of the graph's configured distance function, corrupting topology for non-Cosine metrics.Stale elevator panic (
graph.go) — When traversing layers top-to-bottom, the elevator node could reference a deleted node. Added existence checks with fallback tolayer.entry(). Addresses Panic will appear when running this demo. #15.Test additions
Also adds a Makefile (
make testfor unit,make integrationfor full suite) and aconstitution-searchexample CLI.Test plan
go test -short ./...— all unit tests passgo test ./...— all integration tests pass with recall thresholds metgo test -bench=. ./...— benchmarks run clean🤖 Generated with Claude Code