Add Cython BinaryEncoder for Avro block encoding by rynewang · Pull Request #3303 · apache/iceberg-python

rynewang · 2026-04-30T21:37:01Z

Summary

Mirrors the existing CythonBinaryDecoder (decoder_fast.pyx). The pure-Python BinaryEncoder emits each varint byte as a fresh bytes([x]) allocation plus a stream-write call; the Cython implementation writes into a growable char* buffer with inlined zigzag encoding and memcpy, then materialises once via getvalue().

Integration

AvroOutputFile.write_block now constructs its in-memory block encoder via a new new_memory_encoder() factory (same pattern as new_decoder()): returns CythonBinaryEncoder when the extension is built, otherwise a thin MemoryBinaryEncoder wrapper around the existing BinaryEncoder + BytesIO. The header/framing encoder (self.encoder) is unchanged — it writes directly to the output stream and is low-volume.

Benchmark

Encoding 50k ManifestEntry records (14 columns with full column stats — column_sizes, value_counts, null_value_counts, lower_bounds, upper_bounds), through the real construct_writer tree:

encoder	wall	throughput	output bytes
pure Python	1.64 s	30.5 k/s	18,492,808
Cython	0.36 s	138.0 k/s	18,492,808

~4.5× at the encoder-leaf level; the remaining time is the Python Writer tree dispatch, which is unchanged.

Testing

tests/avro/test_encoder.py is parametrised over both implementations so every primitive assertion runs against each.
New test_int_round_trip covers zigzag edge cases including int64 min/max via encode→new_decoder→assert.
New test_encoders_byte_identical asserts both implementations produce identical bytes for a mixed payload.
Existing tests/avro/ (171 tests) and tests/utils/test_manifest.py (manifest write/read round-trip) pass.

Notes

write_utf8 / write_bytes accept untyped args (matching the pure-Python duck-typed behaviour) since callers pass str-enum values like FileFormat.PARQUET.
write_float / write_double use STRUCT_FLOAT.pack (explicit little-endian) rather than raw memcpy, same as the decoder — they're not on the hot path.
Zigzag is done on uint64_t to avoid signed-shift UB.

Mirrors the existing CythonBinaryDecoder. The pure-Python BinaryEncoder emits each varint byte via bytes([x]) and a stream write per primitive; the Cython implementation writes into a growable char* buffer with inlined zigzag encoding and memcpy, then materialises once via getvalue(). AvroOutputFile.write_block now uses new_memory_encoder() which returns the Cython implementation when the extension is built and falls back to a MemoryBinaryEncoder wrapper otherwise (same pattern as new_decoder()). Encoding 50k realistic ManifestEntry records (14 columns with full stats) goes from 1.64s to 0.36s (4.5x), byte-identical output. Tests are parametrised over both implementations and include int64-boundary round-trips and a byte-equivalence check.

rynewang mentioned this pull request Apr 30, 2026

Roll added manifests at commit.manifest.target-size-bytes in fast-append #3304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Cython BinaryEncoder for Avro block encoding#3303

Add Cython BinaryEncoder for Avro block encoding#3303
rynewang wants to merge 1 commit intoapache:mainfrom
rynewang:perf/cython-avro-encoder

rynewang commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rynewang commented Apr 30, 2026

Summary

Integration

Benchmark

Testing

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant