Add Cython BinaryEncoder for Avro block encoding#3303
Open
rynewang wants to merge 1 commit intoapache:mainfrom
Open
Add Cython BinaryEncoder for Avro block encoding#3303rynewang wants to merge 1 commit intoapache:mainfrom
rynewang wants to merge 1 commit intoapache:mainfrom
Conversation
Mirrors the existing CythonBinaryDecoder. The pure-Python BinaryEncoder emits each varint byte via bytes([x]) and a stream write per primitive; the Cython implementation writes into a growable char* buffer with inlined zigzag encoding and memcpy, then materialises once via getvalue(). AvroOutputFile.write_block now uses new_memory_encoder() which returns the Cython implementation when the extension is built and falls back to a MemoryBinaryEncoder wrapper otherwise (same pattern as new_decoder()). Encoding 50k realistic ManifestEntry records (14 columns with full stats) goes from 1.64s to 0.36s (4.5x), byte-identical output. Tests are parametrised over both implementations and include int64-boundary round-trips and a byte-equivalence check.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Mirrors the existing
CythonBinaryDecoder(decoder_fast.pyx). The pure-PythonBinaryEncoderemits each varint byte as a freshbytes([x])allocation plus a stream-write call; the Cython implementation writes into a growablechar*buffer with inlined zigzag encoding andmemcpy, then materialises once viagetvalue().Integration
AvroOutputFile.write_blocknow constructs its in-memory block encoder via a newnew_memory_encoder()factory (same pattern asnew_decoder()): returnsCythonBinaryEncoderwhen the extension is built, otherwise a thinMemoryBinaryEncoderwrapper around the existingBinaryEncoder+BytesIO. The header/framing encoder (self.encoder) is unchanged — it writes directly to the output stream and is low-volume.Benchmark
Encoding 50k
ManifestEntryrecords (14 columns with full column stats —column_sizes,value_counts,null_value_counts,lower_bounds,upper_bounds), through the realconstruct_writertree:~4.5× at the encoder-leaf level; the remaining time is the Python
Writertree dispatch, which is unchanged.Testing
tests/avro/test_encoder.pyis parametrised over both implementations so every primitive assertion runs against each.test_int_round_tripcovers zigzag edge cases includingint64min/max via encode→new_decoder→assert.test_encoders_byte_identicalasserts both implementations produce identical bytes for a mixed payload.tests/avro/(171 tests) andtests/utils/test_manifest.py(manifest write/read round-trip) pass.Notes
write_utf8/write_bytesaccept untyped args (matching the pure-Python duck-typed behaviour) since callers passstr-enum values likeFileFormat.PARQUET.write_float/write_doubleuseSTRUCT_FLOAT.pack(explicit little-endian) rather than rawmemcpy, same as the decoder — they're not on the hot path.uint64_tto avoid signed-shift UB.