Skip to content

feat: Deletion vector reader#866

Open
Shreyas220 wants to merge 8 commits intoapache:mainfrom
Shreyas220:deletion_vector-reader
Open

feat: Deletion vector reader#866
Shreyas220 wants to merge 8 commits intoapache:mainfrom
Shreyas220:deletion_vector-reader

Conversation

@Shreyas220
Copy link
Copy Markdown
Contributor

No description provided.

@Shreyas220 Shreyas220 requested a review from zeroshade as a code owner April 9, 2026 11:28
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

this is a solid foundation for DV read support: envelope parsing is correct, puffin ReadAt integration is clean, and the Java golden test data is a great choice for cross-impl validation.

main concerns:

  1. OOM from sparse keys — the gap-filling loop in DeserializeRoaringPositionBitmap can allocate billions of empty bitmaps with adversarial input (keys [0, 0xFFFFFFFF]). needs a cap or a map-based representation
  2. Serialize writes empty gap bitmaps — Java only writes non-empty keys, so Go-serialized output won't match. reader-only for now but it's public — either fix or unexport
  3. ReadDV cardinality trap — always passes dvFile.Count() which is 0 when unset, rejecting valid DVs
  4. dvMockDataFile/strPtr missing — tests won't compile, looks like a helper file was left out of the PR
  5. 4 golden files unused — especially all-container-types-position-index.bin which exercises run containers, the main cross-library bug source

left some comments.

Also general observation - we should add for serialize / write puffin x-engine compatibility e2e tests, this probably beyond this PR scope, but need to keep this in mind once we start to enable DV support.

Comment thread table/deletion_vector.go Outdated
Comment thread table/dv/deletion_vector_test.go
Comment thread manifest.go
Comment thread table/roaring_bitmap.go Outdated
Comment thread table/dv/roaring_bitmap.go Outdated
Comment thread table/dv/roaring_bitmap.go Outdated
Comment thread table/dv/roaring_bitmap.go Outdated
Comment thread table/deletion_vector.go Outdated
Comment thread table/dv/deletion_vector.go
Comment thread table/roaring_bitmap_test.go Outdated
const (
// DVMagicNumber is the magic number for deletion vectors.
// Spec bytes: D1 D3 39 64 (big-endian) = 0x6439D3D1 (little-endian uint32)
DVMagicNumber uint32 = 0x6439D3D1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defining the constant this way work for big-endian architectures or do we need to add a conditional compilation etc? Can we just use the actual uint32 value to avoid endian differences?

Copy link
Copy Markdown
Contributor Author

@Shreyas220 Shreyas220 Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread table/dv/deletion_vector.go Outdated
Comment on lines +46 to +48
// - Magic (4 bytes, little-endian): must be 0x6439D3D1
// - Bitmap (variable): roaring bitmap in Iceberg portable format
// - CRC-32 (4 bytes, big-endian): checksum over magic + bitmap
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's super weird that the spec jumps back and forth between big and little endian representations....

}
defer f.Close()

reader, err := puffin.NewReader(f)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the puffin reader need a defer reader.Close()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

puffin reader doesn't own the file IO , it gets it from caller so i would like the caller of puffin file be responsible for closing not puffin reader

let me know if you think otherwise

Comment thread table/dv/roaring_bitmap.go Outdated
Comment thread table/dv/roaring_bitmap.go
Comment thread table/dv/roaring_bitmap.go Outdated
Comment thread table/dv/roaring_bitmap.go Outdated
Comment thread table/dv/roaring_bitmap.go Outdated
Comment thread manifest.go
Copy link
Copy Markdown
Contributor

@laskoviymishka laskoviymishka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: no major concerns from my side.

A few minors to keep in mind for future work on DV:

  1. uint64 vs signed-with-guards. If we move to uint64 later, the pos < 0 guards and count < 0 checks go away. Probably the cleaner shape long-term.
  2. X-engine compat e2e tests for serialize/write once DV write support lands. Currently reader-only, so low risk today, but we'll want Spark/Java-produced DVs decoded + Go-produced DVs read back by Spark before enabling write. This is a way bigger item, so need to do a separate research
  3. maxBitmapCount is currently a package constant. If we ever hit a real table that legitimately needs more, it becomes a config knob conversation, worth a brief note in the code on where the number came from.

Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
Signed-off-by: Shreyas220 <shreyas.ny@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants