Skip to content

TusKANNy/vectorium

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

174 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vectorium

Vectorium is a Rust library for storing, accessing, and compressing dense and sparse embedding datasets. Multivector support (ColBERT-style late-interaction) is available as an optional feature. The main goal is to provide a unified dataset/encoder interface that can be shared by indexing/search crates such as Seismic and kANNolo.

If you are new to KNN: exhaustive KNN searches score every vector in the dataset and return the top‑k closest results. That is accurate but slow at scale. ANN indexes (HNSW, IVF, Seismic, etc.) trade a bit of accuracy for speed by building extra data structures (e.g., proximity graphs, inverted indexes) on top of the same dataset/encoder primitives.

Vectorium includes an exhaustive search API (Dataset::search) and a binary executable for ground-truth computation on CPU. For state‑of‑the‑art ANN indexing, use these tools: Seismic and kANNolo.

Cargo features

Feature What it enables Default
multivec DenseMultiVectorView/Owned, MultiVectorDataset, RerankIndex, and all multivec encoders (PlainMultiVecQuantizer, MultiVecProductQuantizer, MultiVecTwoLevelProductQuantizer) No
cli The compute_groundtruth binary No

Building

This repository targets nightly Rust (rust-toolchain.toml pins it).

If you want to compile the library only (dense and sparse datasets, no multivec, no binary):

RUSTFLAGS="-C target-cpu=native" cargo build --release

If you want the compute_groundtruth CLI binary (dense and sparse):

RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli

If you want multivector support in the library (adds MultiVectorDataset, RerankIndex, all multivec encoders):

RUSTFLAGS="-C target-cpu=native" cargo build --release --features multivec

If you want both the CLI binary and multivector support:

RUSTFLAGS="-C target-cpu=native" cargo build --release --features "cli,multivec"

Using as a dependency

If you only need dense and sparse datasets:

[dependencies]
vectorium = { git = "https://github.com/TusKANNy/vectorium.git" }

If you also need multivector support:

[dependencies]
vectorium = { git = "https://github.com/TusKANNy/vectorium.git", features = ["multivec"] }

Command-line tool

Vectorium ships a binary compute_groundtruth (feature cli) under src/bin/ to exhaustive top‑k for a set of queries, writes a TSV file.

After building with --features cli, you will find compute_groundtruth in target/release/. You can also run it directly via Cargo:

RUSTFLAGS="-C target-cpu=native" cargo run --release --features cli --bin compute_groundtruth -- --help

This tool computes exhaustive top‑k neighbors for each query and writes the results as a TSV file. It is designed for research/benchmarking workflows where you need exact results (ground truth) to compare against an ANN index. It parallelizes across queries using Rayon, which is typically the right granularity for CPU ground‑truth runs.

Inputs and formats

  • --dataset-type dense (default): expects .npy inputs (dataset + queries). The reader loads .npy as f32 and can optionally convert the dataset storage via --value-type.
  • --dataset-type sparse: expects Seismic binary format inputs (dataset + queries) via read_seismic_format.

Output format

Each output line is:

query_id<TAB>doc_id<TAB>rank<TAB>score
  • query_id: 0‑based query index.
  • doc_id: a VectorId (0‑based vector index in the dataset).
  • rank: 1..=k.
  • score:
    • for --distance euclidean: squared Euclidean distance (smaller is better);
    • for --distance dotproduct: dot product score (larger is better).

Examples

Dense vectors stored in .npy, Euclidean, PQ:

RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli
./target/release/compute_groundtruth \
  -i <path>/dataset.npy \
  -q <path>/queries.npy \
  -o sift_gt_l2.tsv \
  --dataset-type dense \
  --value-type f32 \
  --distance euclidean \
  --encoder pq \
  --pq-subspaces 8

Dense vectors stored in .npy, dot product, plain (no PQ):

RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli
./target/release/compute_groundtruth \
  -i <path>/dataset.npy \
  -q <path>/queries.npy \
  -o sift_true_gt_ip.tsv \
  --dataset-type dense \
  --value-type f32 \
  --distance dotproduct \
  --encoder plain

Sparse dataset (Seismic binary), dot product, DotVByte compression:

RUSTFLAGS="-C target-cpu=native" cargo build --release --features cli
./target/release/compute_groundtruth \
  -i <path>/dataset.seismic \
  -q <path>/queries.seismic \
  -o sparse_gt_ip.tsv \
  --dataset-type sparse \
  --component-type u16 \
  --value-type f32 \
  --distance dotproduct \
  --encoder dotvbyte

Parameters and common combinations

  • -i, --input-file <PATH>: dataset path (.npy when --dataset-type dense, Seismic binary when sparse).
  • -q, --query-file <PATH>: query path (same format as the dataset type).
  • -o, --output-path <PATH>: output TSV path.
  • -k, --k <N>: number of neighbors to output per query (default: 10).
  • -d, --distance <euclidean|dotproduct>: scoring metric (default: euclidean).
  • -v, --value-type <f32|f16|bf16|fixedu8|fixedu16>: dataset storage value type (default: f32).
    • Dense queries are always f32.
    • Sparse queries are always f32 (dataset values may be quantized).
  • --dataset-type <dense|sparse>: dataset format (default: dense).
  • --component-type <u16|u32>: sparse component type (default: u32).
  • --encoder <plain|pq|dotvbyte>: encoder (default: plain).
    • pq: dense‑only, requires --value-type f32, and uses --pq-subspaces.
    • dotvbyte: sparse‑only, requires --component-type u16 and --distance dotproduct.
  • --pq-subspaces <M>: number of PQ subspaces (0 auto‑selects a supported value that divides the dataset dimension).
    • Product Quantization splits vectors into M equal‑sized chunks, so dim % M == 0 is required.
    • Supported values are currently {128, 96, 64, 32, 16, 8, 4}.

Rust library

The library is organized around two core concepts:

  1. A Dataset owns encoded vectors and an encoder.
  2. A VectorEncoder defines the encoded representation and how to evaluate distances via a QueryEvaluator.

Indexes typically store a dataset plus extra search structures. During search they repeatedly need distances between a query and candidate vectors inside the dataset. Vectorium is designed to make that path explicit and efficient: you can build a query evaluator once, then score many candidate vectors without decoding them first.

1) Dense datasets (growable and immutable)

use vectorium::{Dataset, DenseDataset, DenseVectorView, DotProduct, DatasetGrowable, PlainDenseDatasetGrowable, PlainDenseQuantizer};

let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut dataset = PlainDenseDatasetGrowable::new(encoder);

dataset.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
dataset.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));

let frozen: DenseDataset<_> = dataset.into();
assert_eq!(frozen.len(), 2);
assert_eq!(frozen.get(0).values(), &[1.0, 0.0, 2.0]);

2) Sparse datasets (component/value pairs)

use vectorium::{Dataset, DotProduct, DatasetGrowable, PlainSparseDataset, PlainSparseDatasetGrowable, PlainSparseQuantizer, SparseVectorView};

let encoder = PlainSparseQuantizer::<u16, f32, DotProduct>::new(5, 5);
let mut dataset = PlainSparseDatasetGrowable::new(encoder);

dataset.push(SparseVectorView::new(&[1_u16, 3], &[1.0, 2.0]));
dataset.push(SparseVectorView::new(&[0_u16, 4], &[0.5, 3.5]));

let frozen: PlainSparseDataset<u16, f32, DotProduct> = dataset.into();
assert_eq!(frozen.len(), 2);
assert_eq!(frozen.get(0).components(), &[1_u16, 3]);

3) Distance computation with a query evaluator (index-style)

The typical pattern for an index is: build the evaluator once, then score many candidates.

use vectorium::{Dataset, DatasetGrowable, DenseDataset, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer, QueryEvaluator, VectorEncoder, VectorId};

let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
growable.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));
let dataset: DenseDataset<_> = growable.into();

let query = DenseVectorView::new(&[0.2, 0.1, 0.7]);
let evaluator = dataset.encoder().query_evaluator(query);

let candidate: VectorId = 0;
let score = evaluator.compute_distance(dataset.get(candidate));
let _ = score;

4) Exhaustive search (top‑k baseline)

If you do not have an ANN index yet, Dataset::search(query, k) provides an exhaustive top‑k baseline:

use vectorium::{Dataset, DatasetGrowable, DenseDataset, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer};

let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
growable.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));
let dataset: DenseDataset<_> = growable.into();

let query = DenseVectorView::new(&[0.2, 0.1, 0.7]);
let top2 = dataset.search(query, 2);
assert_eq!(top2.len(), 2);

5) Parallel ground-truth computation in Rust (query-level parallelism)

The easiest way to use multiple CPU cores is to parallelize across queries. This is the strategy used by compute_groundtruth.

use rayon::prelude::*;
use vectorium::{Dataset, DatasetGrowable, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer};

let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
growable.push(DenseVectorView::new(&[0.5, 1.5, 0.0]));

let queries = vec![vec![0.2_f32, 0.1, 0.7], vec![1.0_f32, 0.0, 0.0]];
let results: Vec<_> = queries
    .par_iter()
    .map(|q| growable.search(DenseVectorView::new(q.as_slice()), 1))
    .collect();

assert_eq!(results.len(), 2);

6) Multivector datasets (requires multivec feature)

Multivector datasets store variable-length sequences of token vectors — the representation used by late-interaction models such as ColBERT. Enable the feature in your Cargo.toml first (see Cargo features).

# #[cfg(feature = "multivec")]
# {
use vectorium::{
    Dataset, DatasetGrowable, DenseMultiVectorView, MultiVectorDatasetGrowable,
    PlainMultiVecQuantizer,
};

// Each document is a sequence of token vectors.  Here dim=2 and each doc has 2 tokens.
let encoder = PlainMultiVecQuantizer::<f32>::new(2);
let mut dataset = MultiVectorDatasetGrowable::new(encoder);

// Doc 0: two tokens [1.0, 0.0] and [0.0, 1.0]
dataset.push(DenseMultiVectorView::new(&[1.0_f32, 0.0, 0.0, 1.0], 2));
// Doc 1: two tokens [0.5, 0.5] and [1.0, 1.0]
dataset.push(DenseMultiVectorView::new(&[0.5_f32, 0.5, 1.0, 1.0], 2));

let frozen = dataset.into_immutable();
assert_eq!(frozen.len(), 2);
# }

7) Range-based access and prefetch

Datasets expose range_from_id/id_from_range so callers can keep lightweight handles to the underlying storage ranges. This is mainly useful for sparse/packed layouts, where range lookups can be a cache miss and prefetching can help.

use vectorium::{Dataset, DatasetGrowable, DenseDataset, DenseVectorView, DotProduct, PlainDenseDatasetGrowable, PlainDenseQuantizer};

let encoder = PlainDenseQuantizer::<f32, DotProduct>::new(3);
let mut growable = PlainDenseDatasetGrowable::new(encoder);
growable.push(DenseVectorView::new(&[1.0, 0.0, 2.0]));
let dataset: DenseDataset<_> = growable.into();

let id = 0;
let range = dataset.range_from_id(id);
dataset.prefetch_with_range(range.clone());
let view = dataset.get_with_range(range);
assert_eq!(view.values(), &[1.0, 0.0, 2.0]);

Design notes

  • Type safety by construction: datasets are sealed and are tied to a specific encoder type, so mixing dense/sparse/packed representations is prevented at compile time.
  • Evaluator-driven search: encoders build a QueryEvaluator from the query once; the evaluator can then score many dataset vectors. This matches how most ANN indexes are structured internally.
  • Distance ordering: distances implement Ord. DotProduct uses reversed ordering (larger is better), while SquaredEuclideanDistance uses the natural ordering (smaller is better). Distance values must not be NaN.
  • Range-based access: range_from_id/get_with_range and prefetch_with_range are meant for index implementations that keep storage ranges around (especially for sparse/packed datasets) and want predictable, low-overhead access.

License

MIT. See LICENSE-MIT.

About

A library for storing and accessing datasets of dense and sparse vectors, with efficient support for the core operations required by vector indexing data structures.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors