feat: Deletion vector reader#866
Conversation
laskoviymishka
left a comment
There was a problem hiding this comment.
nice work!
this is a solid foundation for DV read support: envelope parsing is correct, puffin ReadAt integration is clean, and the Java golden test data is a great choice for cross-impl validation.
main concerns:
- OOM from sparse keys — the gap-filling loop in DeserializeRoaringPositionBitmap can allocate billions of empty bitmaps with adversarial input (keys [0, 0xFFFFFFFF]). needs a cap or a map-based representation
- Serialize writes empty gap bitmaps — Java only writes non-empty keys, so Go-serialized output won't match. reader-only for now but it's public — either fix or unexport
- ReadDV cardinality trap — always passes dvFile.Count() which is 0 when unset, rejecting valid DVs
- dvMockDataFile/strPtr missing — tests won't compile, looks like a helper file was left out of the PR
- 4 golden files unused — especially all-container-types-position-index.bin which exercises run containers, the main cross-library bug source
left some comments.
Also general observation - we should add for serialize / write puffin x-engine compatibility e2e tests, this probably beyond this PR scope, but need to keep this in mind once we start to enable DV support.
| const ( | ||
| // DVMagicNumber is the magic number for deletion vectors. | ||
| // Spec bytes: D1 D3 39 64 (big-endian) = 0x6439D3D1 (little-endian uint32) | ||
| DVMagicNumber uint32 = 0x6439D3D1 |
There was a problem hiding this comment.
Will defining the constant this way work for big-endian architectures or do we need to add a conditional compilation etc? Can we just use the actual uint32 value to avoid endian differences?
There was a problem hiding this comment.
change it to
DVMagicNumber uint32 = 1681511377
like here ??
| // - Magic (4 bytes, little-endian): must be 0x6439D3D1 | ||
| // - Bitmap (variable): roaring bitmap in Iceberg portable format | ||
| // - CRC-32 (4 bytes, big-endian): checksum over magic + bitmap |
There was a problem hiding this comment.
it's super weird that the spec jumps back and forth between big and little endian representations....
| } | ||
| defer f.Close() | ||
|
|
||
| reader, err := puffin.NewReader(f) |
There was a problem hiding this comment.
does the puffin reader need a defer reader.Close()?
There was a problem hiding this comment.
puffin reader doesn't own the file IO , it gets it from caller so i would like the caller of puffin file be responsible for closing not puffin reader
let me know if you think otherwise
laskoviymishka
left a comment
There was a problem hiding this comment.
no major concerns from my side.
A few minors to keep in mind for future work on DV:
uint64vs signed-with-guards. If we move touint64later, thepos < 0guards andcount < 0checks go away. Probably the cleaner shape long-term.- X-engine compat e2e tests for serialize/write once DV write support lands. Currently reader-only, so low risk today, but we'll want Spark/Java-produced DVs decoded + Go-produced DVs read back by Spark before enabling write. This is a way bigger item, so need to do a separate research
maxBitmapCountis currently a package constant. If we ever hit a real table that legitimately needs more, it becomes a config knob conversation, worth a brief note in the code on where the number came from.
Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
2abd5bb to
2d06d13
Compare
No description provided.