Skip to content

Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079

Draft
suhasjs wants to merge 24 commits into
mainfrom
mhildebr/saveload
Draft

Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079
suhasjs wants to merge 24 commits into
mainfrom
mhildebr/saveload

Conversation

@suhasjs
Copy link
Copy Markdown
Contributor

@suhasjs suhasjs commented May 16, 2026

  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
  • Does this PR modify any existing APIs?
  • Is the change to the API backwards compatible?
  • Should this result in any changes to our documentation, either updating existing docs or adding new ones?

Reference Issues/PRs

Initial attempt at #737

What does this implement/fix? Briefly explain your changes.

Introduces two new traits (diskann_record::save::Save and diskann_record::load::Load)and a new create diskann-record to support file-based serialization + deserialization. This PR also implements the traits for DiskANNIndex (and all nested structs) for a successful prototype.
Sample output can be found here.

Any other comments?

This PR is meant to be in a draft state until we're happy with the abstractions and the results.

Mark Hildebrand and others added 21 commits April 15, 2026 14:32
…borrow for Writer::finish() since it calls std::io::Write::flush() that needs a mutable borrow
… wherever needed + modify writer return type to Result<>
… diskann-providers with a new SingleUseWriteProvider (also for reads with SingleUseReadProvider)
…ad_from_bin<> to only use one read since VectorDataIterator already has the metadata needed for
@suhasjs suhasjs added this to the 2026-05 milestone May 16, 2026
@suhasjs suhasjs self-assigned this May 16, 2026
@suhasjs suhasjs added the enhancement New feature or request label May 16, 2026
@suhasjs suhasjs moved this to In Progress in DiskANN backlog May 16, 2026
Copy link
Copy Markdown
Contributor

@hildebrandmw hildebrandmw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Suhas! My comments this round are mainly about enum handling. I think we can do this more cleanly by having enums "handle themselves" rather than lifting it into a somewhat fragile interaction in the Save/Load traits.

Other than that, the code inside diskann-record could use considerable hardening and testing.

Do we want to plan out how plugins like VFS would work in this PR, or save that for the future? I ask because VFS is used pretty heavily in tests, so having a replacement that's compliant with this new system will be required. And it will be a good stress test that we can expand the backend serialization to other formats in the future.

Comment thread diskann-record/src/save/mod.rs Outdated
/// load that the tag's presence matches the corresponding [`Load::IS_ENUM`].
fn variant(&self) -> Option<Cow<'_, str>> {
None
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this for longer than I care to admit, I think we can get rid of enum handling entirely at the trait level. If we consider three cases:

enum UnitEnum {
    A,
    B,
    C,
}

enum InlineStruct {
    A(f32),
    B {
       x: String,
       y: f32,
    }
}

enum Struct {
    A(A),
    B(B),
}

struct A {
    field: f32,
}

struct B {
    x: String,
    y: f32,
}

The serialized representations in JSON could be

{
    "$version": "0.0.0",
    "A": null,
}

{
    $version": "0.0.0",
    "B": {
        "x": "hello world",
        "y": 1,
    }
}

{
    "$version": "0.0.0",
    "A": {
        "$version": "0.0.1",
        "0": 1,
    }
}

For versions 1 and 2, there is just a top level version, which makes sense since the structure of these enums evolves in lockstep with the outer enum. For Version 3, this allows nested objects in enums to have distinct versions naturally.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$variant is a little more explicit in how it handles enums. But since most of our enums fall into one of the three variants you described, I've gone ahead and removed $variant handling.

enum ErrorInner {
Light(Kind),
Heavy(anyhow::Error),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I still really like the Light/Heavy idea, if we're honest it's probably overkill and will likely not get implemented ever.

Instead, the main goal I think is to differentiate crucial errors vs "recoverable" errors, where the latter is used for a probing API. In that context, we'd have something like:

struct Error {
    inner: anyhow::Error,
    recoverable: bool,
}

Most of the constructors would imply critical errors, with probing APIs using an additional (_recoverable) API. This keeps normal code simple and the probing API possible and reasonably efficient.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did seem overkill. I've removed it in the latest commit.

"handle references file {:?} which is not registered in the manifest",
key,
)));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding a

if key.contains("..") || key_as_path.is_absolute() {
   return Err(/* ... */);
}

to keep files inside the directory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Added it.

* Licensed under the MIT license.
*/

//! Adapters that expose a single in-hand [`Read`]/[`Write`] target as a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go ahead and start updating the exiting code to use io::Read/io::Write directly so we don't need this thing?

Copy link
Copy Markdown
Contributor Author

@suhasjs suhasjs May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By 'existing code', do you mean all code that uses Storage{Read/Write}Provider traits? Not sure I understood your ask here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code that currently takes Storage{Read/Write}Provider does so by taking the provider and a file name and immediately opening the necessary file. These could all be pretty easily migrated to taking std::io::Read and std::io::Write directly, moving the file open up one level and avoiding the need for this hacky one-off provider.


pub fn get(&self, key: &str) -> Option<&Value<'a>> {
self.record.get(key)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the vein of moving enums to handling themselves, if we add:

impl<'a> Record<'a> {
    fn keys(&self) -> Keys<'_> {
        Keys::new(self.record.key())
    }
}

struct Keys<'a> {
    keys: std::collections::hash_map::Keys<'a, Cow<'a, str>, Value<'a>>,
}

impl<'a> Iterator for Keys<'a> {
    // iterate
}

impl ExactSizeIterator for Keys<'_> {}

Then enums could have a way of inspecting the keys for loading.

@arrayka arrayka linked an issue May 19, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Document and extend index serialization format

2 participants