Skip to content

Add Cython BinaryEncoder for Avro block encoding#3303

Open
rynewang wants to merge 1 commit intoapache:mainfrom
rynewang:perf/cython-avro-encoder
Open

Add Cython BinaryEncoder for Avro block encoding#3303
rynewang wants to merge 1 commit intoapache:mainfrom
rynewang:perf/cython-avro-encoder

Conversation

@rynewang
Copy link
Copy Markdown

Summary

Mirrors the existing CythonBinaryDecoder (decoder_fast.pyx). The pure-Python BinaryEncoder emits each varint byte as a fresh bytes([x]) allocation plus a stream-write call; the Cython implementation writes into a growable char* buffer with inlined zigzag encoding and memcpy, then materialises once via getvalue().

Integration

AvroOutputFile.write_block now constructs its in-memory block encoder via a new new_memory_encoder() factory (same pattern as new_decoder()): returns CythonBinaryEncoder when the extension is built, otherwise a thin MemoryBinaryEncoder wrapper around the existing BinaryEncoder + BytesIO. The header/framing encoder (self.encoder) is unchanged — it writes directly to the output stream and is low-volume.

Benchmark

Encoding 50k ManifestEntry records (14 columns with full column stats — column_sizes, value_counts, null_value_counts, lower_bounds, upper_bounds), through the real construct_writer tree:

encoder wall throughput output bytes
pure Python 1.64 s 30.5 k/s 18,492,808
Cython 0.36 s 138.0 k/s 18,492,808

~4.5× at the encoder-leaf level; the remaining time is the Python Writer tree dispatch, which is unchanged.

Testing

  • tests/avro/test_encoder.py is parametrised over both implementations so every primitive assertion runs against each.
  • New test_int_round_trip covers zigzag edge cases including int64 min/max via encode→new_decoder→assert.
  • New test_encoders_byte_identical asserts both implementations produce identical bytes for a mixed payload.
  • Existing tests/avro/ (171 tests) and tests/utils/test_manifest.py (manifest write/read round-trip) pass.

Notes

  • write_utf8 / write_bytes accept untyped args (matching the pure-Python duck-typed behaviour) since callers pass str-enum values like FileFormat.PARQUET.
  • write_float / write_double use STRUCT_FLOAT.pack (explicit little-endian) rather than raw memcpy, same as the decoder — they're not on the hot path.
  • Zigzag is done on uint64_t to avoid signed-shift UB.

Mirrors the existing CythonBinaryDecoder. The pure-Python
BinaryEncoder emits each varint byte via bytes([x]) and a stream
write per primitive; the Cython implementation writes into a
growable char* buffer with inlined zigzag encoding and memcpy,
then materialises once via getvalue().

AvroOutputFile.write_block now uses new_memory_encoder() which
returns the Cython implementation when the extension is built and
falls back to a MemoryBinaryEncoder wrapper otherwise (same
pattern as new_decoder()).

Encoding 50k realistic ManifestEntry records (14 columns with full
stats) goes from 1.64s to 0.36s (4.5x), byte-identical output.

Tests are parametrised over both implementations and include
int64-boundary round-trips and a byte-equivalence check.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant