Skip to content

feat(table): deletion-vector writer #997

@laskoviymishka

Description

@laskoviymishka

Parent: #589

Depends on #866 (DV bitmap reader landing first to settle the codec). v3 prefers DVs over Parquet position-delete files. Iceberg-go would be the first non-Java client with a DV writer — pyiceberg writes Parquet position-deletes today, and iceberg-rust hasn't shipped this either.

Shape:

// New file: table/internal/dv_writer.go
type DVWriter struct {
    fs    io.IO
    blobs []puffin.Blob
}

func (w *DVWriter) Add(dataFilePath string, positions []int64) error
func (w *DVWriter) Flush(ctx context.Context, location string) (DataFile, error)

Serializes a Roaring 64-bit bitmap → Puffin blob (blob_type=deletion-vector-v1) → manifest entry with content=1, file_format=PUFFIN, ReferencedDataFile, ContentOffset, ContentSizeInBytes. Commits through the existing RowDelta.AddDeletes() API so the rest of the producer stack is unchanged.

Add a table property write.delete.format=position|dv. On v3 the default flips to dv; v2 stays on position. When dv, the existing position-delete writer in table/internal/parquet_files.go is bypassed in favor of DVWriter.

Scope is large enough that it should land across multiple PRs — roaring serialization + writer skeleton, then producer wiring + property gating, then cross-client tests is one reasonable split. Discussion of the breakdown is welcome in this thread before any code lands.

Spec: Iceberg deletion vectors, Puffin format, RoaringBitmap serialization. Cross-client coverage: write a DV via iceberg-go, read back by Java/pyiceberg, assert filtered rows match.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions