feat(v2.1-rv64): switch to u16 limbs in riscv adapters and modular airs by shuklaayush · Pull Request #2805 · openvm-org/openvm

shuklaayush · 2026-05-21T20:43:07Z

PR 4 of the memory-bus-u16 split, stacked on PR 2. Shrinks adapter trace columns that only reconstruct packed memory-bus payloads, and migrates the algebra modular is_eq chip to u16 limbs.

This PR sits on PR 2 (not #2794) because the PR 2 squash extended the shared vec-heap u16 adapter family and dropped the INT256_NUM_U8_LIMBS from riscv/circuit::adapters; PR 4 builds on those changes.

Why

#2794 already reshaped these adapters' memory-aux columns and bus payloads to packed u16 blocks, but the trace columns that reconstruct those payloads were still byte-shaped:

alu_w / mul_w stored rs1_high / rs2_high as 4 u8 limbs even though the bus consumes them as 2 u16 cells.
loadstore stored rs1_data as 8 u8 limbs even though pointer arithmetic only needs the low 32 bits as 2 u16 cells.
The byte-shaped vec-heap adapters (vec_heap, vec_heap_branch, eq_mod) reconstructed register pointers through byte limbs and bitwise byte-pair range checks.

All of these can store the column as u16 cells directly and replace the high-byte range check with a scaled 16-bit range check on the high u16 cell, removing the BitwiseOperationLookup traffic.

Separately, algebra's modular is_eq chip operates on packed u16 operands and only needs limb-level range checks at LIMB_BITS = 16, so it can migrate to the new u16 adapter introduced in PR 2.

What changes

RISC-V W-adapters and loadstore

Files: extensions/riscv/circuit/src/adapters/{alu_w,mul_w,loadstore,mod}.rs, extensions/riscv/circuit/src/loadstore/{execution,tests}.rs, extensions/riscv/circuit/src/{base_alu_w,divrem_w,mul_w,shift_w}/tests.rs, extensions/riscv/circuit/cuda/include/riscv/adapters/{alu_w,mul_w,loadstore}.cuh.

alu_w / mul_w: rs1_high, rs2_high are packed as two u16 cells; bus reads zero-extend the lower 32 bits and pull u16-shaped upper cells directly from the trace.
loadstore: rs1_data shrinks to RS1_DATA_U16S = 2 u16 cells (the low 32 bits of the rs1 register). Bus reads zero-extend the upper half on the wire.
STORE-to-DEFERRAL_AS is explicitly unsupported in this adapter and the now-dead timed_write_deferral path is removed — feat(v2.1-rv64): switch memory bus blocks to u16 cells #2794 already dropped that opcode, and the remaining call would have instantiated TracingMemory::write::<F, 8>, which violates BLOCK_FE_WIDTH. The F-celled path needs a separate two-message bus shape if it is ever reintroduced.
Shared helpers handle u8-pair packing, low-32-bit u16 composition, high-u16 pointer scaling, and 16-bit immediate sign-extension.
W-suffix tests (base_alu_w, divrem_w, mul_w, shift_w) updated to prank u16 cells where the column is migrated.

Shared heap-data adapters

Files: extensions/riscv-adapters/src/{vec_heap,vec_heap_branch,eq_mod,eq_mod_u16,lib}.rs, extensions/riscv-adapters/cuda/include/riscv-adapters/{constants,vec_heap,vec_heap_branch}.cuh.

vec_heap, vec_heap_branch, and eq_mod materialize register pointer values (rs_val / rd_val) as u16 cells.
Pointer bounds use VariableRangeChecker on the high u16 cell instead of byte-pair BitwiseOperationLookup send_range.
Adds the byte-shaped vec-heap adapter READ_SIZE / WRITE_SIZE const generics so byte-shaped 256-bit chips (BaseAlu256, Mul256, Shift256) and u16-shaped 256-bit chips (LessThan256) can share the same wrapper without losing the block-size information needed for MemoryWriteAuxCols and pointer arithmetic.
Adds Rv64IsEqualModU16Adapter* for the new u16-shaped modular is_eq.
CUDA mirrors the new column shapes and packing helpers.

Algebra modular `is_eq`

Files: extensions/algebra/circuit/src/modular_chip/{is_eq,addsub,muldiv,mod,tests}.rs, extensions/algebra/circuit/src/extension/{modular,hybrid,fp2}.rs, extensions/algebra/circuit/src/{execution,fields,preflight,lib}.rs, extensions/algebra/circuit/src/fp2_chip/{addsub,mod,muldiv,tests}.rs.

LIMB_BITS 8 → 16 where the chip's limbs are only consumed as packed u16 values.
Wires through the new Rv64IsEqualModU16 adapter variant.
Core uses VariableRangeChecker for b_lt_diff - 1 and c_lt_diff - 1 instead of byte-pair bitwise checks.
CPU, hybrid, and CUDA registration paths all use Rv64IsEqualModU16Adapter*.
Negative tests that previously pranked [F; 4] byte arrays now prank [F; 2] u16 cells where the column is migrated; positive coverage preserved.
Mod-builder generated AIRs are untouched.

is_eq executor memory reads were already adjusted in #2794 for MEMORY_BLOCK_BYTES; PR 4 does not rework that API.

Cascade through dependents

Files: extensions/bigint/circuit/src/{lib,extension/mod,tests}.rs, extensions/bigint/circuit/cuda/src/bigint.cu, extensions/ecc/circuit/src/weierstrass_chip/{add_ne,double}/mod.rs.

Bigint type aliases and ecc weierstrass call sites pick up the new READ_SIZE / WRITE_SIZE generic positions on Rv64VecHeapAdapter*. Byte-shaped 256-bit chips pass MEMORY_BLOCK_BYTES; u16-shaped chips pass BLOCK_FE_WIDTH.
Bigint pulls INT256_NUM_U8_LIMBS from the bigint crate root (per PR 2's reviewer feedback) instead of riscv/circuit::adapters.

Migration notes

Rv64VecHeapAdapter* now takes 5 const generics (NUM_READS, BLOCKS_PER_READ, BLOCKS_PER_WRITE, READ_SIZE, WRITE_SIZE). Byte-shaped 256-bit chips: pass MEMORY_BLOCK_BYTES for both sizes. u16-shaped chips: use Rv64VecHeapU16Adapter.
alu_w / mul_w adapter callsites that materialized rs_high bytes should pull u16-shaped upper cells from the new column slot.
loadstore rs1_data columns shrink from 8 u8s to 2 u16 cells; trace-fillers that previously wrote 8 byte cells should write 2 u16 cells (low 32 bits) and let the AIR zero-extend on the bus.
is_eq modular chip callers should switch from Rv64IsEqualModAdapter* to Rv64IsEqualModU16Adapter* and pass LANE_SIZE = BLOCK_FE_WIDTH (= 4 u16 cells per heap-read block).
STORE-to-DEFERRAL_AS is no longer routed through Rv64LoadStoreAdapter; the F-celled deferral-store path was already removed in feat(v2.1-rv64): switch memory bus blocks to u16 cells #2794 and any reintroduction needs a dedicated two-message bus shape.

resolves int-7830, int-7831

github-actions · 2026-05-21T21:26:20Z

Claude finished @shuklaayush's task in 2m 21s —— View job

Code Review: u16-shape W adapters and modular is-eq

Thorough review of all 48 changed files complete. This is a clean, mechanically consistent migration from u8-shaped to u16-shaped trace columns across the W-adapters, loadstore, heap adapters, and modular is_eq chip.

Soundness — Verified Correct

Pointer range check pattern (adapters/mod.rs:91-98). scale_rv64_ptr_high_u16 computes high_u16 * 2^(32 - pointer_max_bits), then a 16-bit range check enforces high_u16 < 2^(pointer_max_bits - 16). Applied consistently across all adapter AIRs and trace fillers. The host-side counterpart scale_rv64_ptr_high_u16_value at line 101 mirrors this correctly for add_count calls.

W-adapter bus packing (alu_w.rs, mul_w.rs). Bus payloads are constructed as [pack_u8_pair(lo0, lo1), pack_u8_pair(lo2, lo3), rs_high[0], rs_high[1]]. The pack_u8_pair function (lo + hi * 256) correctly produces u16 cells matching the memory bus shape. Sign-extension writes use RV64_U16_LIMB_MASK * result_sign (= 0xFFFF * sign) for upper cells — correct for 32-to-64 sign extension at u16 granularity. pack_high_u16 (adapters/mod.rs:60-64) correctly packs 4 raw record bytes into 2 u16 cells via u16::from_le_bytes.

Loadstore rs1_data shrinkage (loadstore.rs). Reducing from 4 u8 cells to RV64_LOW32_U16_LIMBS (2 u16 cells) is correct — only the low 32 bits are used for pointer arithmetic. expand_to_rv64_block zero-pads the upper two cells on the bus. The carry arithmetic (lines 180-189) uses inverse(RV64_U16_LIMB_BASE) instead of inverse(1 << (RV64_CELL_BITS * 2)) — algebraically identical since RV64_U16_LIMB_BASE = 2^16. Range check widths RV64_U16_LIMB_BITS - 3 and pointer_max_bits - RV64_U16_LIMB_BITS are equivalent to the old byte-pair formulation.

sign_extend_imm16 (adapters/mod.rs:108-110). imm + sign * (u32::MAX << 16) computes imm + 0xFFFF0000 when sign == 1. Since imm is a 16-bit value (max 0xFFFF), the maximum result is 0xFFFFFFFF = u32::MAX — no overflow. Matches the old behavior.

DEFERRAL_AS guards (loadstore.rs:416-418 and 468-470). Both unreachable! guards cover the read-path and write-path halves of STORE operations separately — correct and non-redundant.

Modular is_eq range checks (is_eq.rs:255-263). Replacing send_range(b_lt_diff - 1, c_lt_diff - 1) with two independent range_check(..., LIMB_BITS) calls is correct. Each difference is independently in [1, 2^LIMB_BITS], so (diff - 1) ∈ [0, 2^LIMB_BITS). LIMB_BITS is now RV64_U16_LIMB_BITS = 16, matching the u16 limb representation.

bytes_lt_u16_modulus helper (is_eq.rs:767-776). The host-side fast-execute path correctly decomposes each u16 modulus cell into constituent bytes for big-endian byte-level comparison. The u16_modulus_byte helper extracts even-index bytes as cell as u8 and odd-index bytes as (cell >> 8) as u8 — correctly little-endian.

Heap adapters pointer checks. All three migrated adapters (vec_heap.rs, vec_heap_branch.rs, eq_mod.rs) correctly switch from paired BitwiseOperationLookupBus::send_range to individual VariableRangeCheckerBus::range_check calls, with consistent composition via compose_rv64_low32_u16 and bounding via scale_rv64_ptr_high_u16. The u16 variants (vec_heap_u16.rs, vec_heap_branch_u16.rs) are refactored to use the same shared helpers.

eq_mod_u16.rs heap-read pointer offsets (line 159). The new adapter uses MEMORY_BLOCK_BYTES for pointer offsets, which is correct since memory addresses are always byte-addressed regardless of on-wire cell shape.

CUDA Parity — Complete

All .cuh files mirror the Rust column shape changes:

alu_w.cuh and mul_w.cuh use pack_u8_pairs_le to pack record bytes into Fp u16 cells.
loadstore.cuh uses pack_u32_to_u16_cells_le for both rs1_data and mem_ptr_limbs, and calls sign_extend_imm16 from constants.cuh.
vec_heap.cuh and vec_heap_branch.cuh replace BitwiseOperationLookup with VariableRangeChecker and use scale_rv64_ptr_high_u16 from constants.cuh.
The generalized pack_u8_pairs_le in offline_checker.cuh accepts arbitrary even-length byte arrays via static_assert(NUM_BYTES % 2 == 0).
MemoryWriteU16AuxRecord added alongside the existing MemoryWriteBytesAuxRecord typedef.

Minor Observations (non-blocking)

columns_via proxy type (is_eq.rs:73). ModularIsEqualCoreAir uses columns_via(ModularIsEqualCoreCols<u8, READ_LIMBS>). Since READ_LIMBS now counts u16 limbs (e.g., NUM_LIMBS_32_U16 = 16) rather than u8 limbs, the u8 proxy is slightly misleading for documentation purposes. Consider updating to u16 for clarity, though this is cosmetic — columns_via only uses the type for struct layout computation.
run_unsigned_less_than returns NUM_LIMBS on equality (is_eq.rs:789). When x == y, the function returns (false, NUM_LIMBS) as diff_idx. This is safe because the is_eq filler only calls this function and then uses diff_idx to index lt_marker and modulus_limbs — but only when b_cmp / c_cmp is true (i.e., strict less-than holds). When x == y, the subsequent assertions assert!(b_cmp) / assert!(c_cmp) would fire (since b must be strictly less than the modulus), so the out-of-bounds diff_idx is never actually used for indexing. Might be worth a comment or a debug_assert!(diff_idx < NUM_LIMBS) after each call for defensive clarity, but not a bug.
constants.rs deletion (riscv-adapters/src/constants.rs). The u32_to_le_u16_limbs function was cleanly moved to riscv/circuit/src/adapters/mod.rs. All imports updated correctly.
Test coverage is comprehensive. Negative tests properly use pack_u8_pair_u32 for u16-shaped pranking. The new rv64_write_u16_heap_default test utility correctly handles BLOCK_FE_WIDTH-sized writes with U16_CELL_SIZE byte addressing and has a compile-time guard assert!(NUM_LIMBS.is_multiple_of(BLOCK_FE_WIDTH)).
BitwiseOperationLookupChip removal from modular extension (extension/modular.rs). The modular extension wiring no longer creates a bitwise lookup chip for is_eq. This is correct since is_eq was the only modular consumer of that bus; addsub/muldiv already used VariableRangeCheckerBus via the mod-builder.

Verdict

Looks good to merge. Constraint arithmetic is preserved at u16 granularity across all adapters, BitwiseOperationLookup traffic is correctly replaced with VariableRangeChecker calls, CUDA parity is maintained, and test coverage is thorough. No soundness issues found.

github-actions · 2026-05-21T21:35:12Z

group	app.proof_time_ms	app.cycles	leaf.proof_time_ms
fibonacci	1,549	4,000,051	432
keccak	13,796	14,365,133	2,368
sha2_bench	9,217	11,167,961	1,396
regex	1,460	4,090,656	364
ecrecover	471	112,210	266
pairing	592	592,827	255
kitchen_sink	2,140	1,979,971	407

Note: cells_used metrics omitted because CUDA tracegen does not expose unpadded trace heights.

Commit: 53d1b4b

Benchmark Workflow