Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079
Support file-based serialize/deserialize for DiskANN indices using Save/Load traits#1079suhasjs wants to merge 24 commits into
Conversation
…borrow for Writer::finish() since it calls std::io::Write::flush() that needs a mutable borrow
…ave::Results<Handle>
…w directly impls for Value<'static>
… wherever needed + modify writer return type to Result<>
… required for their Load/Save impls)
… diskann-providers with a new SingleUseWriteProvider (also for reads with SingleUseReadProvider)
…ad_from_bin<> to only use one read since VectorDataIterator already has the metadata needed for
…ghborProviderAsync
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suhas! My comments this round are mainly about enum handling. I think we can do this more cleanly by having enums "handle themselves" rather than lifting it into a somewhat fragile interaction in the Save/Load traits.
Other than that, the code inside diskann-record could use considerable hardening and testing.
Do we want to plan out how plugins like VFS would work in this PR, or save that for the future? I ask because VFS is used pretty heavily in tests, so having a replacement that's compliant with this new system will be required. And it will be a good stress test that we can expand the backend serialization to other formats in the future.
| /// load that the tag's presence matches the corresponding [`Load::IS_ENUM`]. | ||
| fn variant(&self) -> Option<Cow<'_, str>> { | ||
| None | ||
| } |
There was a problem hiding this comment.
After thinking about this for longer than I care to admit, I think we can get rid of enum handling entirely at the trait level. If we consider three cases:
enum UnitEnum {
A,
B,
C,
}
enum InlineStruct {
A(f32),
B {
x: String,
y: f32,
}
}
enum Struct {
A(A),
B(B),
}
struct A {
field: f32,
}
struct B {
x: String,
y: f32,
}The serialized representations in JSON could be
{
"$version": "0.0.0",
"A": null,
}
{
$version": "0.0.0",
"B": {
"x": "hello world",
"y": 1,
}
}
{
"$version": "0.0.0",
"A": {
"$version": "0.0.1",
"0": 1,
}
}For versions 1 and 2, there is just a top level version, which makes sense since the structure of these enums evolves in lockstep with the outer enum. For Version 3, this allows nested objects in enums to have distinct versions naturally.
There was a problem hiding this comment.
$variant is a little more explicit in how it handles enums. But since most of our enums fall into one of the three variants you described, I've gone ahead and removed $variant handling.
| enum ErrorInner { | ||
| Light(Kind), | ||
| Heavy(anyhow::Error), | ||
| } |
There was a problem hiding this comment.
While I still really like the Light/Heavy idea, if we're honest it's probably overkill and will likely not get implemented ever.
Instead, the main goal I think is to differentiate crucial errors vs "recoverable" errors, where the latter is used for a probing API. In that context, we'd have something like:
struct Error {
inner: anyhow::Error,
recoverable: bool,
}Most of the constructors would imply critical errors, with probing APIs using an additional (_recoverable) API. This keeps normal code simple and the probing API possible and reasonably efficient.
There was a problem hiding this comment.
It did seem overkill. I've removed it in the latest commit.
| "handle references file {:?} which is not registered in the manifest", | ||
| key, | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Probably worth adding a
if key.contains("..") || key_as_path.is_absolute() {
return Err(/* ... */);
}to keep files inside the directory.
There was a problem hiding this comment.
Good catch. Added it.
| * Licensed under the MIT license. | ||
| */ | ||
|
|
||
| //! Adapters that expose a single in-hand [`Read`]/[`Write`] target as a |
There was a problem hiding this comment.
Should we go ahead and start updating the exiting code to use io::Read/io::Write directly so we don't need this thing?
There was a problem hiding this comment.
By 'existing code', do you mean all code that uses Storage{Read/Write}Provider traits? Not sure I understood your ask here.
There was a problem hiding this comment.
Most of the code that currently takes Storage{Read/Write}Provider does so by taking the provider and a file name and immediately opening the necessary file. These could all be pretty easily migrated to taking std::io::Read and std::io::Write directly, moving the file open up one level and avoiding the need for this hacky one-off provider.
|
|
||
| pub fn get(&self, key: &str) -> Option<&Value<'a>> { | ||
| self.record.get(key) | ||
| } |
There was a problem hiding this comment.
In the vein of moving enums to handling themselves, if we add:
impl<'a> Record<'a> {
fn keys(&self) -> Keys<'_> {
Keys::new(self.record.key())
}
}
struct Keys<'a> {
keys: std::collections::hash_map::Keys<'a, Cow<'a, str>, Value<'a>>,
}
impl<'a> Iterator for Keys<'a> {
// iterate
}
impl ExactSizeIterator for Keys<'_> {}Then enums could have a way of inspecting the keys for loading.
… that wraps anyhow::Error along with a recoverable bool to allow probing APIUs
…-> throws an error now
Reference Issues/PRs
Initial attempt at #737
What does this implement/fix? Briefly explain your changes.
Introduces two new traits (
diskann_record::save::Saveanddiskann_record::load::Load)and a new creatediskann-recordto support file-based serialization + deserialization. This PR also implements the traits forDiskANNIndex(and all nested structs) for a successful prototype.Sample output can be found here.
Any other comments?
This PR is meant to be in a draft state until we're happy with the abstractions and the results.