Skip to content
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
f330d73
docs(rfc): integrate PiPNN as an alternative graph-index build algorithm
SeliMeli May 11, 2026
e8d1cd2
docs(rfc): rename to 01049-pipnn-integration.md after PR creation
SeliMeli May 11, 2026
4ba6d4a
docs(rfc): add M1 in-memory build/search milestone, list-format miles…
SeliMeli May 11, 2026
4fe210f
docs(rfc): address Copilot review comments
SeliMeli May 11, 2026
77950b3
docs(rfc): address reviewer comments from PR #1049
SeliMeli May 14, 2026
018dbe9
docs(rfc): add fixed-resource trade-off validation experiment (M6)
SeliMeli May 14, 2026
6121119
docs(rfc): address wuw92's cycle + determinism comments
SeliMeli May 14, 2026
61f1ea5
docs(rfc): narrow Stage 1 scope to disk-index full-rebuild only
SeliMeli May 14, 2026
d090895
docs(rfc): clarify PiPNN covers both initial builds and rebuilds
SeliMeli May 14, 2026
7de117e
docs(rfc): tighten prose, lift key points into tables
SeliMeli May 14, 2026
da8b2ac
docs(rfc): remove hybrid update model validation from Stage 1
SeliMeli May 14, 2026
d0baaba
docs(rfc): renumber milestones contiguously, drop edit-history phrasing
SeliMeli May 14, 2026
e1c2b8c
docs(rfc): drop obvious determinism note from checkpoint defer entry
SeliMeli May 14, 2026
4de515c
docs(rfc): strip obvious filler and defensive prose
SeliMeli May 14, 2026
d37bedc
docs(rfc): M1 clarify GEMM-friendly quantization; add M7 abstraction …
SeliMeli May 18, 2026
eb52c60
ci: retrigger workflow
SeliMeli May 18, 2026
7faab09
rfc(pipnn): clarify disk-format vs graph-content equivalence
SeliMeli May 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
303 changes: 303 additions & 0 deletions rfcs/01049-pipnn-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,303 @@
# Integrate PiPNN as an Alternative Graph-Index Build Algorithm

| | |
|---|---|
| **Authors** | Weiyao Luo |
| **Contributors** | DiskANN team |
| **Created** | 2026-05-11 |
| **Updated** | 2026-05-14 |

## Summary

Add **PiPNN** ([Pick-in-Partitions Nearest Neighbors](https://arxiv.org/abs/2602.21247)) as a second algorithm for **DiskANN's disk-index full-build path** — both initial builds and full rebuilds. PiPNN writes its graph in **Vamana's existing disk format** (same headers, node/edge layout, frozen-point convention), so the disk writer, PQ codes, and search pipeline are unchanged. The graph **contents themselves** differ from what Vamana would build at the same parameters — the two algorithms construct graphs differently — but recall@k matches at equivalent L (see Benchmarks). PiPNN is **up to 6.3× faster to build** on measured workloads. Vamana stays the default; PiPNN is opt-in behind a Cargo feature. In-memory PiPNN build is intentionally out of scope — DiskANN's in-mem path is for streaming per-point construction, which PiPNN's batch algorithm cannot do.

## Motivation

### How DiskANN builds today

DiskANN uses one algorithm — **Vamana** — which inserts points one-by-one with greedy search + `RobustPrune`. Clients use it in three modes:

| Mode | Description |
|---|---|
| Incremental | Per-point inserts on an in-memory graph (Vamana). Disk index not mutated. |
| Full rebuild | Rebuild the entire graph from a snapshot; produces an immutable disk index. |
| Partitioned full rebuild | Shard, build, merge — bounds peak RAM (`build_merged_vamana_index`). |

PiPNN is a faster substitute for modes 2 and 3. Mode 1 stays with Vamana — PiPNN's batch design has no efficient per-point insert (see "Batch-only" below).

### What PiPNN does

A four-phase **batch** builder:

1. **Partition (RBC).** Randomized Ball Carving recursively splits the dataset into small *overlapping* leaf clusters; each point lands in `fanout` of its nearest leaders at every level. Recursion stops at a leaf-size cap (`c_max`, ~256–1024).
2. **Local k-NN per leaf (GEMM).** For each leaf, compute the full pairwise distance matrix as one batched GEMM (intra-leaf N×N where N ≈ `c_max`), then extract per-point top-`leaf_k`. This is structurally different from a 1×N flat scan ([#1036](https://github.com/microsoft/DiskANN/issues/1036)) — the batching across `c_max²` evaluations is where PiPNN's wall-clock advantage comes from.
3. **HashPrune merge.** Merge edges from all leaves into a per-point reservoir of bounded size (`l_max` ~64–128), keyed by an LSH angular bucket for diversity. Naturally streamable — see memory mitigation.
4. **Optional final RobustPrune.** Same algorithm Vamana uses, applied as a single pass when the workload wants more geometric diversification.

Output: `Vec<Vec<u32>>` adjacency lists, handed to the existing disk writer. PQ training and search are unchanged.

### Problem statement

Vamana full builds at 10M+ are the bottleneck:

| Dataset | Vamana build |
|---|---:|
| Enron 1M (384d) | 70s |
| BigANN 10M (128d) | 358s |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me replicate these results using #856 for the inmem index? If I use the following to compare:

{
  "search_directories": [
    "<snip>"
  ],
  "output_directory": null,
  "jobs": [
    {
      "type": "async-index-build",
      "content": {
        "search_phase": {
          "groundtruth": "sift-1m.gt.bin",
          "num_threads": [
            16
          ],
          "queries": "sift-1m.queries.u8.bin",
          "reps": 5,
          "runs": [
            {
              "recall_k": 10,
              "search_l": [
                10,
                20,
                30,
                40,
                50,
                60,
                70,
                80,
                90,
                100
              ],
              "search_n": 10
            }
          ],
          "search-type": "topk"
        },
        "source": {
          "alpha": 1.2000000476837158,
          "backedge_ratio": 1.0,
          "data": "sift-1m.u8.bin",
          "data_type": "uint8",
          "distance": "squared_l2",
          "index-source": "Build",
          "insert_retry": null,
          "l_build": 50,
          "max_degree": 32,
          "multi_insert": null,
          "num_threads": 16,
          "save_path": null,
          "start_point_strategy": "medoid"
        }
      }
    },
    {
      "type": "async-index-build",
      "content": {
        "search_phase": {
          "groundtruth": "sift-1m.gt.bin",
          "num_threads": [
            16
          ],
          "queries": "sift-1m.queries.u8.bin",
          "reps": 5,
          "runs": [
            {
              "recall_k": 10,
              "search_l": [
                10,
                20,
                30,
                40,
                50,
                60,
                70,
                80,
                90,
                100
              ],
              "search_n": 10
            }
          ],
          "search-type": "topk"
        },
        "source": {
          "alpha": 1.2000000476837158,
          "backedge_ratio": 1.0,
          "data": "sift-1m.u8.bin",
          "data_type": "uint8",
          "distance": "squared_l2",
          "index-source": "Build",
          "insert_retry": null,
          "l_build": 50,
          "max_degree": 32,
          "multi_insert": null,
          "num_threads": 16,
          "save_path": null,
          "pipnn": {
              "algorithm": "PiPNN"
          },
          "start_point_strategy": "medoid"
        }
      }
    }
  ]
}

on my laptop, I get 11.2 seconds for index build with the inmem index and 15.6 seconds with PiPNN with worse final recall. Note that I am using the default hyper-parameters for pipnn as I don't have a good idea on how to configure them. The display for PiPNN is as follows:

Async Full-Precision Index Build

               tag: async-index-build
              file: <snip>
         data_type: uint8
        max degree: 32
           L-build: 50
             alpha: 1.2
start point strategy: Medoid
    backedge ratio: 1
Using Multi Insert: NO
start_point_strategy: Medoid
     build threads: 16
         Save Path: None
   build algorithm: PiPNN
       pipnn.c_max: 1024
       pipnn.c_min: 256
      pipnn.leaf_k: 3
       pipnn.l_max: 128
    pipnn.replicas: 1

PiPNN build: 1000000 points x 128 dims
  Vector store:  0.059s
PiPNN Build Timing
  LSH sketches:   0.111s
  Partition:      1.777s  (65526 leaves)
  Leaf build:     13.009s  (145781104 edges)
  Graph extract:  0.443s
  Final prune:    0.000s
  Total:          15.382s

  Avg degree: 31.9, Max degree: 32, Isolated: 0
  Edge transfer: 0.162s (medoid=89028)
  Total (incl. store + transfer): 15.602s



Index Build Time: 15.602119s
Vectors Inserted: 1000000
Kind: PiPNN
Insert Latencies:
  average: 15us
      p90: 15us
      p99: 15us

Can you provide suggestions on which hyper-parameters to use? Thanks!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Mark, with R 32, L 50, pipnn is quite close to vamana. You can use this config:

  {
    "max_degree": 32,
    "l_build": 50,
    "alpha": 1.2,
    "distance": "squared_l2",
    "num_threads": 16,
    "pipnn": {
      "algorithm": "PiPNN",
      "c_max": 256,
      "c_min": 16,
      "p_samp": 0.005,
      "fanout": [10, 3],
      "leaf_k": 3,
      "l_max": 50,
      "num_hash_planes": 12,
      "final_prune": true
    }
  }

it should match vaman's build speed and recall

And when increase max_degree to 64 L build 72 (pipnn l_max 72), the build time for vamana is 13s and pipnn's remains 7s

| Enron 10M (384d) | 844s |

PiPNN completes the same builds **up to 6.3× faster** at matching recall (numbers below).

#### Trade-off hypothesis

> Given a fixed worker (CPU/RAM/SSD), PiPNN delivers higher build throughput than Vamana at matching recall *when its working set fits*. Below that threshold, the three-tier dispatch (one-shot → disk-edges → merged-shards) keeps PiPNN at or under Vamana's RAM footprint.

On BigANN 10M, 16 threads:

| RAM budget | PiPNN strategy | PiPNN build | Vamana build |
|---:|---|---:|---:|
| ≥ 12 GB | one-shot | 80–133s | 358s |
| 6–12 GB | disk-edges | ~126s | 358s |
| 3–6 GB | merged-shards | ~332s | ~358s (partitioned) |
Comment on lines +57 to +61
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the disk-edges and the merged-shards strategies? Not sure if this is mentioned somewhere else

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disk-edges are brefily introduced below, basically we split pipnn into partition and leaf build + hp and graph construct to avoid keep dataset in mem the whole build to reduce mem footprint.
merged-shards is just vamana's sharded build


**Neither algorithm converts surplus RAM into faster builds.** PiPNN's wall-clock is bottlenecked by HashPrune + GEMM; once the working set fits, more RAM doesn't help. PiPNN trades a higher *minimum* RAM budget for a faster build at that budget. Validation: see **M4**.

### Hybrid update model (Stage 2 direction)

Both algorithms write the same disk format, so a graph built by either can be loaded and (once in memory) extended by Vamana. Production update story:

- **Full rebuild → PiPNN.** Several times faster than Vamana.
- **Incremental insert → in-memory Vamana.** Unchanged — applies to the in-memory graph; the disk index file is not mutated in place.
- **Rebuild triggers.** Embedding rotation, schema/parameter retuning, large batch inserts, or periodic safety rebuilds — not just gradual recall drift. DiskANN's claim that incremental updates keep recall healthy still stands; PiPNN just makes the eventual rebuild cheaper.

This is why we don't need PiPNN to support `insert(point)` — the disk format is the integration point between batch and incremental.
Comment on lines +69 to +73
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this section is providing a solution for. Is the key takeaway here that building the graph with one algorithm doesn't preclude us from using another one in the future?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, basically the two graph are quality identical and disk format identical


### Two-stage rollout

- **Stage 1 (this RFC).** Land PiPNN as an alternative builder for the disk-index full-build path (initial builds *and* rebuilds), behind a `pipnn` Cargo feature. Vamana stays default.
- **Stage 2 (separate proposal, gated by Stage-1 milestones).** Retire Vamana's full-build path; keep Vamana for in-memory incremental inserts.

In-memory PiPNN build is not in any stage — see "Out of scope".

### Goals (Stage 1)

1. **Pluggable disk builder** — a selector that routes Vamana (default) vs. PiPNN (opt-in). No behavior change at existing call sites.
2. **Disk-format compatibility** — PiPNN emits Vamana's on-disk schema (node/edge layout, headers, frozen-point convention, PQ/SQ codes all unchanged), so search loads PiPNN-built indexes with no code change. The graph **contents** differ from Vamana's by design — the two algorithms construct graphs differently — but recall@k is equivalent across tested datasets.
3. **API backward compatibility** — `DiskIndexBuilder`, `IndexConfiguration`, JSON schema all stay additive.
4. **Feature parity for the full-build role** — deliver the Vamana disk-build capabilities PiPNN still lacks (quantization, label filters).
5. **Memory mitigation** — a three-tier build path that brings PiPNN's peak RSS to or under Vamana's at a documented build-time cost.

## Proposal

### Workspace structure

Add a crate `diskann-pipnn` depending on `diskann`, `diskann-linalg`, `diskann-vector`, `diskann-quantization`, `diskann-utils`. **It does not depend on `diskann-disk`** — that would form a cycle with the consumer-side feature gate.

```text
diskann, diskann-linalg, diskann-quantization, diskann-vector, diskann-utils
│ (shared deps, no edges to builders)
┌───────────┴────────────┐
diskann-pipnn diskann-disk
│ ↑ feature "pipnn"
└───→ Vec<Vec<u32>> ────┘
```

### `BuildAlgorithm` enum

```rust
#[derive(Debug, Clone, Default, PartialEq, Serialize, Deserialize)]
#[serde(tag = "algorithm")]
pub enum BuildAlgorithm {
#[default]
Vamana,

#[cfg(feature = "pipnn")]
PiPNN {
c_max: usize, c_min: usize, p_samp: f64,
fanout: Vec<usize>, leaf_k: usize, replicas: usize,
l_max: usize, num_hash_planes: usize,
final_prune: bool, leader_cap: usize,
saturate_after_prune: bool,
num_threads: usize, // 0 = all logical CPUs
},
}
```

`DiskIndexBuildParameters` gains a `build_algorithm` field; JSON config gains an optional `build_algorithm` block.

**`num_threads` is not a RAM knob.** Per-thread overhead is ~16–20 MB (stripe + leaf-build scratch). Use `build_ram_limit_gb` to bound RAM.

**Feature-flag deserialization is config-only, not index-file.** Index files built by either algorithm load with or without the `pipnn` feature. A JSON config naming `"algorithm": "PiPNN"` fed to a binary built without the feature fails fast with `unknown variant 'PiPNN'`; configs that omit `build_algorithm` parse identically across feature builds.

### Builder dispatch

```rust
match build_parameters.build_algorithm() {
BuildAlgorithm::Vamana => self.build_inmem_vamana_index().await,
#[cfg(feature = "pipnn")]
BuildAlgorithm::PiPNN { .. } => self.build_inmem_pipnn_index().await,
}
```

The PiPNN branch produces `Vec<Vec<u32>>` via `diskann_pipnn::builder::build_typed` and hands it to the existing `DiskIndexWriter`.

### Compatibility

| Surface | Status |
|---|---|
| On-disk graph format | unchanged |
| PQ/SQ codes on disk | unchanged |
| Search API | unchanged |
| Public Rust types | additive only (new field with default) |
| Benchmark JSON config | additive only (new optional field) |

### Feature gating

`diskann-disk` gains a `pipnn` Cargo feature, off by default. With it off, `BuildAlgorithm::PiPNN` does not exist at the type level — no runtime branch, no `diskann-pipnn` dependency.

## Trade-offs

### Batch-only (algorithmic, not implementation)

The PiPNN paper "eliminates search from graph-building" — partition first, then one batched GEMM per leaf. The batching advantage **requires knowing leaf membership before computing distances**; at batch size 1, PiPNN reduces to per-point distance work no faster than Vamana's greedy insert. Two phases (partition, leaf k-NN) are batch-by-design; HashPrune and final RobustPrune happen to be online, but you've already paid for the batch phases by the time you reach them.

### Memory vs. build speed

PiPNN holds more working memory than Vamana — dominated by the **HashPrune reservoir** (`l_max × 8 B` per point). On BigANN 10M (`c_max=256, fanout=[10,3], leaf_k=3, l_max=64`):

| | PiPNN one-shot | Vamana |
|---|---:|---:|
| Peak RSS | 10.8 GB | 6.3 GB |

Mitigation via the three-tier build (dispatched by the existing `build_ram_limit_gb` knob):

| Strategy | Peak RSS | Build | Recall@10 L=50 | Trigger |
|---|---:|---:|---:|---|
| One-shot | 10.8 GB | 133s | 95.00% | RAM ≥ ~32 GB |
| Disk-edges | 6.4 GB | 126s | 95.00% | RAM 8–32 GB |
| Merged shards | 3.3 GB | 332s | 95.31% | RAM 4–8 GB |

Disk-edges matches Vamana's RAM at ~3× the build speed. Merged-shards uses *less* RAM than Vamana (3.3 vs. 6.3 GB) at a 2.5× build-time cost.

### Alternatives considered

| | Choice | Rejected because |
|---|---|---|
| A | (Chosen) Add PiPNN behind a feature flag | — |
| B | Replace Vamana immediately | PiPNN lacks checkpoint / full quantization / label-filter parity; production-validation gap |
| C | Separate top-level crate/binary | Duplicates PQ training, disk writer, search pipeline — maintenance burden, no compatibility benefit |

### Algorithm risks

Recall depends on partition overlap (`fanout`) and reservoir size (`l_max`) — a larger parameter space than Vamana's `R`/`L_build`. Mitigation: keep Vamana as default and ship reference parameter sets per workload class.

## Benchmark Results

Azure `Standard_L16s_v3`, 16 threads, NVMe, `RUSTFLAGS=-C target-cpu=native`.

### Build time

| Dataset | Vamana | PiPNN | Speedup |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the Vamana results here with the spherical/scalar quantizer?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the result here is fp build

|---|---:|---:|---:|
| Enron 1M (384d) | 70s | 13s | 5.4× |
| BigANN 10M (128d) | 358s | 80s | 4.5× |
| Enron 10M (384d) | 844s | 133s | 6.3× |

### BigANN 10M — recall × QPS

Default PiPNN (`c_max=256, fanout=[10,3], leaf_k=3, l_max=64, pq_chunks=64`) vs. Vamana (`R=64, L=64, pq_chunks=64`):

| L | PiPNN R@10 | PiPNN QPS | Vamana R@10 | Vamana QPS |
|---|---:|---:|---:|---:|
| 10 | 77.76% | 10,670 | 79.23% | 11,618 |
| 50 | 96.31% | 5,574 | 97.10% | 5,940 |
| 100 | 98.61% | 3,430 | 99.01% | 3,568 |

With a higher-recall config (`c_max=512, fanout=[10,4], l_max=128, final_prune`), PiPNN matches/exceeds Vamana at L=50 (97.22%) and L=100 (99.21%) at 143s (still 2.5× faster).

### Enron 10M (384d) — recall × QPS

PiPNN (`c_max=256, fanout=[8,3], leaf_k=2, l_max=64, pq_chunks=192`) vs. Vamana (`R=64, L=72, pq_chunks=192`):

| L | PiPNN R@1000 | PiPNN QPS | Vamana R@1000 | Vamana QPS |
|---|---:|---:|---:|---:|
| 1000 | 89.99% | 378 | 89.33% | 384 |
| 2000 | 96.46% | 192 | 95.36% | 195 |
| 3000 | 97.74% | 129 | 96.68% | 130 |

PiPNN beats Vamana on recall at every L at parity QPS — and 6.3× faster build.

## Future Work — Stage-1 Milestones

Stage 1 covers build-from-scratch and full rebuilds with PiPNN. M0 ships in this RFC; M1–M6 are follow-on work, parallelizable where dependencies allow.

### M0 — Skeleton integration (this RFC)
Crate, `BuildAlgorithm` enum, dispatch behind `pipnn` Cargo feature. JSON config gains optional `build_algorithm`. CI smoke test (SIFT-1M) with `--features pipnn`.

### M1 — Quantization parity
PiPNN's leaf-build computes pairwise distances via batched GEMM, so its quantization options are constrained by what hardware GEMM units accelerate: **int8** (AVX-512-VNNI, AMX-INT8), **fp16** (AVX-512-FP16), and **bf16** (AVX-512-BF16, AMX-BF16). 1-bit is not in that set — GEMM hardware does multiply-accumulate, not XOR+popcount.

Vamana's existing quantized path is **SQ1 (1-bit) Hamming**, which is a fundamentally different kernel (XOR+popcount, not MAC). PiPNN cannot directly reuse Vamana's SQ1 quantizer for the same speedup story; porting requires adapting the quantization pipeline to a GEMM-compatible format and computing distance through the quantized-GEMM dispatch.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is supporting quantized GEMM in M1? This feels like a nice to have that can come later no? Unless it's needed for RAM savings...

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supporting quantized gemm also unblock using fp16/u8 native gemm, but the proiority could be modified


Scope:
- Extend PiPNN with **SQ_8 (int8)** quantization via the trained `ScalarQuantizer` from `diskann-quantization`, producing the packed int8 buffers the quantized-GEMM interface consumes.
- fp16 and bf16 are natural follow-ons through the same dispatch path.

**Pass:** SQ_8 PiPNN recall within 0.5% of FP-PiPNN on BigANN 10M and Enron 10M.

### M2 — Label-filtered indexes
Run filter benchmark configs with `BuildAlgorithm::PiPNN`; confirm filter-recall within ±1% of Vamana. Partition may need label-aware leaf assignment for high-cardinality labels.

### M3 — Three-tier memory dispatch
Implement and validate the disk-edges + merged-shards paths selected by `build_ram_limit_gb`. **Pass:** at `build_ram_limit_gb=4`, PiPNN-merged on BigANN 10M has peak RSS ≤ 4 GB and recall within 1% of one-shot.

Two disk-edges variants are on the table: (i) materialize all leaf edges then stream HashPrune (current prototype), or (ii) interleave leaf-build + HashPrune in chunks. The second avoids full edge-set materialization at the cost of a second partition pass.

### M4 — Fixed-resource trade-off validation
Validates the **trade-off hypothesis** from the Problem Statement.

- **Setup.** Lock CPU/SSD on a fixed worker; enforce RAM via cgroups. Sweep RAM `{3, 6, 8, 12, 16, 24, 32}` GB on BigANN 10M; include rows for Enron 10M and a 100M-scale dataset.
- **Cells.** Vamana one-shot, Vamana partitioned, PiPNN one-shot, PiPNN disk-edges, PiPNN merged-shards. OOM cells are valid results.
- **Metrics.** Wall-clock, peak RSS, CPU util, SSD bytes, recall@K, QPS — reported as **vectors/min/worker** for cross-shape comparison.
- **Pass.** Documented matrix with a clearly-better algorithm (or tie) per budget at matching recall. Surprises are Stage-1 blockers.

### M5 — Production validation: recall × QPS × dimensionality
End-to-end on the full workload mix. Datasets: BigANN, Enron, plus one production-representative. Scales 10M and 100M (billion if hardware permits). Metrics `squared_l2` and `cosine_normalized`. **Pass:** per cell, PiPNN recall@K within ±1% of Vamana's at matching QPS, or higher QPS at matching recall.

### M6 — Operational readiness
Telemetry (per-phase timing + RSS via existing OTel tracer), permanent docs replacing experimental `CLAUDE.md` notes, runbook (OOM, partition timeout, `l_max` saturation), default parameter recommendations per workload class.

### M7 — Abstraction convergence with `diskann` *(Stage-2 entry gate)*

Stage-1 lands PiPNN as a separate crate that touches several primitives that ought to be shared with the rest of `diskann`. Stage 2 doesn't commit to retiring Vamana's full-rebuild path until those primitives converge — otherwise we accept two parallel implementations of the same low-level building blocks long-term.

- **a. Pruning.** Vamana's `RobustPrune` / `occlude_list` (currently `pub(crate)` in `diskann/src/graph/internal/prune.rs`) is exposed via a shared pruning surface — either promoted to `pub` or extracted to a `diskann-prune` crate. PiPNN's `final_prune_from_candidates` (currently a duplicate of that logic in `diskann-pipnn/src/builder.rs`) is replaced by a call into the shared API. PiPNN's HashPrune (LSH-bucketed reservoir merge) co-locates in the same shared pruning module for code-organization consistency. Whether Vamana can usefully consume HashPrune is an exploration item — if yes, Vamana reuses; if no, the code still lives in the right place.

- **b. Quantized GEMM.** PiPNN's quantized leaf-build distance consumes the **quantized-GEMM interface** that `diskann-quantization` is developing for multi-vectors (int8/bf16/fp16 inputs, f32 accumulation). Whichever side ships first contributes: if the multi-vector quantized GEMM lands first, PiPNN reuses; if PiPNN ships its GEMM solution first, that solution is contributed back to `diskann-quantization` so other components (including a future Vamana variant if it wants) can adopt it. Either direction, no PiPNN-internal quantized distance kernel remains.

- **c. GEMM dispatch.** `diskann-pipnn/src/gemm.rs` (today a thin `faer` wrapper) consumes the same shared GEMM / quantized-GEMM interface per (b). PiPNN's leaf-build shapes (m=n=256, k=128 / 384) and partition-stripe shapes (m≈1000, n up to 65k, k=128 / 384) are included in that interface's coverage tests.

- **d. Distance kernels.** PiPNN's per-leaf and per-stripe distance computations fully delegate to `diskann_vector::distance` for both FP and quantized metrics. No bypass paths.

**Validation:** a diff of `diskann-pipnn/src/` showing only algorithm-specific code (partition / scheduler / builder orchestration), no primitive duplication.

### Deferred to Stage 2

- **Hybrid update model validation.** The Stage-2 loop (PiPNN build → incremental Vamana inserts → recall decay → rebuild) belongs with the Stage-2 proposal that adopts the model.

- **Checkpoint / resume.** Vamana's streaming checkpoint design doesn't fit PiPNN's batch phases, and operational value is lower (PiPNN's BigANN-10M build is ~80s).

### Out of scope: not part of any stage

- **In-memory PiPNN build.** The in-mem `DiskANNIndex` exists for streaming construction, which PiPNN can't do efficiently. If a non-streaming in-mem consumer ever wants PiPNN's speed: build to disk, then load.
- **Build-time PQ distance kernel.** Not used by Vamana in production today.
- **PiPNN incremental insert/delete API.** The hybrid update model (Vamana inserts on the in-memory graph, PiPNN for full rebuilds) removes the need.
- **Frozen-point semantics.** PiPNN writes the medoid as the single frozen start point — already byte-compatible with Vamana's default.
- **Multi-vector index support.** Revisit only if a production workload requires it.

## References

1. [PiPNN: Pick-in-Partitions Nearest Neighbors (arXiv:2602.21247)](https://arxiv.org/abs/2602.21247)
2. [Vamana / DiskANN (NeurIPS 2019)](https://papers.nips.cc/paper/9527-rand-nsg-fast-accurate-billion-point-nearest-neighbor-search-on-a-single-node.pdf)
3. Existing disk index layout: `diskann-disk/src/storage/`
4. Existing Vamana builder: `diskann-disk/src/build/builder/build.rs`
Loading