Perf/shm latency and compiler optimizations#116
Conversation
…tation - Add .cargo/config.toml with target-cpu=native for optimal codegen - Replace nix crate clock_gettime wrapper with direct libc call - Capture receive timestamp inside receive_blocking() immediately after condvar wake, matching C measurement point - Eliminate redundant zero-fill: vec![0u8;N] → Vec::with_capacity + set_len - Replace per-byte ring buffer copies with copy_nonoverlapping in blocking path - Add #[inline] hints to all hot-path SHM functions - Server loop uses transport-captured timestamp when available Reduces SHM direct mode mean latency by ~12.5% (23.4µs → 20.5µs), narrowing gap vs reference C benchmark from 25% to ~10%. Co-authored-by: Cursor <cursoragent@cursor.com>
Update comment on the receive-side clock_gettime call to better describe why it's captured inside the mutex (matches reference C approach for accurate latency measurement). Co-authored-by: Cursor <cursoragent@cursor.com>
Document the rationale behind each optimization with inline comments: - .cargo/config.toml: explain target-cpu=native and portability note - mod.rs: explain direct libc vs nix crate clock_gettime, receive_time_ns field - shared_memory_direct.rs: send/receive timestamp placement, zero-fill elimination - shared_memory_blocking.rs: bulk copy_nonoverlapping vs byte-by-byte with before/after - shared_memory.rs: inline hints on ring buffer hot-path functions - main.rs: transport-level timestamp preference in both server loops Co-authored-by: Cursor <cursoragent@cursor.com>
|
📈 Changed lines coverage: 93.33% (28/30)🚨 Uncovered lines in this PR
📊 Code Coverage Summary
|
- Collapse short copy_nonoverlapping calls to single line in shared_memory_blocking.rs - Remove extra blank line in shared_memory_direct.rs Co-authored-by: Cursor <cursoragent@cursor.com>
f5fdaa1 to
14193bd
Compare
dustinblack
left a comment
There was a problem hiding this comment.
Technically sound changes — the optimizations are well-motivated and the benchmark data supports them. A few items to discuss:
Looks good
- Direct
libc::clock_gettimewith#[inline]eliminates wrapper overhead on every message receive_time_nsfield with#[serde(skip)]is a clean way to pass transport-level timing without changing the wire format orreceive_blocking()API- Timestamp capture inside the mutex in SHM-direct matches the reference C program's measurement point exactly — this is the key fix for the latency gap
- Bulk
copy_nonoverlappingwith wrap-around handling replaces the per-byte modulo loop correctly - Zero-fill elimination (
Vec::with_capacity+set_len) is safe —copy_nonoverlappingwrites every byte beforeset_lenis called - Server fallback (
if receive_time_ns != 0) preserves behavior for all non-SHM transports
Items to discuss
-
target-cpu=nativein.cargo/config.toml— This is a project-wide build setting that affects all contributors and CI. Anyone who clones and builds gets non-portable binaries without realizing it. Consider whether this belongs in the repo (affecting everyone) or in the deployment/CI workflow for target hardware builds. -
SHM-direct throughput regression (22-31% at small messages) — The
clock_gettimeinside the mutex extends the critical section. This is acknowledged and doesn't affect latency-focused tests with send-delay, but it's a real regression for throughput benchmarks. Is this acceptable as-is, or should the conditional timestamp placement (inside/outside mutex based on test type) be addressed before merge? -
Error handling on
clock_gettime— The old code had a fallback ifclock_gettimefailed. The new code ignores the return value.CLOCK_MONOTONICessentially never fails on Linux, but adebug_assert!on the return value would be cheap insurance.
Relates to #117.
📈 Changed lines coverage: 93.33% (28/30)🚨 Uncovered lines in this PR
📊 Code Coverage Summary
|
dustinblack
left a comment
There was a problem hiding this comment.
Follow-up on the target-cpu=native concern — after thinking through our CI and release workflows more carefully:
The problem: .cargo/config.toml applies to every build, including CI. If CI eventually cross-compiles for ARM on x86 runners, target-cpu=native either gets silently ignored (producing a generic ARM binary, not an optimized one) or could cause unexpected behavior — it definitely won't produce S32G-optimized code. If CI runs on ARM (e.g., Graviton), the binary would be optimized for that specific ARM CPU, not the target platform, and might use instructions that the S32G's Cortex-A53 doesn't support.
Since we'll be testing across multiple ARM platforms (NXP S32G, Qualcomm Ride SX4, Renesas R-Car S4, etc.), and eventually building release binaries in CI, this setting needs to stay out of the repo-wide config. The right approach:
- Remove
target-cpu=nativefrom.cargo/config.tomlso default builds are portable (generic aarch64) - Apply it at build time when building directly on target hardware:
RUSTFLAGS="-C target-cpu=native" cargo build --release - For CI cross-compile jobs, specify the exact CPU target per platform:
RUSTFLAGS="-C target-cpu=cortex-a53" cargo build --release --target aarch64-unknown-linux-gnu - Document the recommended build commands for on-target vs cross-compiled optimized builds
The other optimizations in this PR (timestamp placement, bulk copies, zero-fill elimination, direct libc clock_gettime) are all pure code improvements that don't depend on this flag. They should provide the bulk of the latency improvement regardless of CPU target.
…ttime error handling Move SHM-direct receive timestamp inside/outside mutex based on --send-delay: latency benchmarks (send-delay > 0) capture inside the mutex for accuracy matching the reference C implementation; throughput benchmarks (no send-delay) capture after mutex unlock to eliminate the 22-31% regression at small message sizes. The flag is derived automatically with no new user-facing CLI options. Add debug_assert! on all raw clock_gettime return values as cheap insurance against silent failures. Remove .cargo/config.toml (target-cpu=native) to restore binary portability across CPU variants. Co-authored-by: Cursor <cursoragent@cursor.com>
The file was deleted. It no longer exists in the repo. The target-cpu=native flag now only appears in CONFIG.md as a documentation example for manual performance builds, not as a project-wide build setting.
A precise_timestamps boolean flag was added to BlockingSharedMemoryDirect. It's derived automatically from --send-delay: Latency benchmarks (send-delay > 0): timestamp captured inside the mutex for accuracy matching the reference C implementation.
debug_assert! was added on all raw clock_gettime return values. Both call sites now have it: mod.rs shared_memory_direct.rs Documentation added README.md — "SHM-Direct Conditional Timestamp Placement": Explains the adaptive inside/outside-mutex timestamp behavior README.md — "CPU-Optimized Builds" section: Explains why .cargo/config.toml was removed (portability, CI cross-compilation risks) CONFIG.md — updated "Rust Compiler Optimizations": Added a callout box explaining the rationale for not shipping target-cpu=native in repo config |
…d target-cpu=native rationale - Add 3 unit tests for BlockingSharedMemoryDirect::with_precise_timestamps(): constructor flag verification (true/false) and end-to-end receive with precise_timestamps=true exercising the inside-mutex timestamp code path - Add factory test verifying send_delay variants (None, ZERO, 10ms) are accepted when creating SHM-direct transports - Document SHM-direct conditional timestamp placement in README: adaptive inside/outside-mutex receive timestamp based on --send-delay, with latency vs throughput tradeoff explanation (22-31% regression context) - Document CPU-optimized builds in README: rationale for removing .cargo/config.toml (portability, CI cross-compilation risks across NXP S32G/Qualcomm Ride SX4/Renesas R-Car S4), on-target builds with RUSTFLAGS="-C target-cpu=native", per-platform cross-compile examples - Update CONFIG.md Rust Compiler Optimizations section with callout explaining why target-cpu=native must not be in repo-wide config, add cross-compile example and link to README - Fix pre-existing clippy lint: map_or -> is_some_and on send_delay wiring - All tests passing, clippy clean, cargo fmt applied AI-assisted-by: Claude Opus 4 (Anthropic)
📈 Changed lines coverage: 87.34% (69/79)🚨 Uncovered lines in this PR
📊 Code Coverage Summary
|
Description
Brief description of changes
Type of Change
Testing
Checklist
Branch: perf/shm-latency-and-compiler-optimizations
Base:
mainCreated: 2026-05-05
Target Hardware: NXP S32G (Cortex-A53, aarch64, 8 cores, 16 GB RAM)
Goal: Close the ~25% SHM one-way latency gap between rusty-comms (RC) and reference's C SHM implementation, while avoiding regressions in other IPC mechanisms.
Background
The reference C SHM program uses POSIX shared memory with a
pthread_mutex+pthread_condsynchronization pattern and captures latency timestamps immediately afterpthread_cond_waitreturns (inside the mutex critical section). The rusty-comms SHM-direct implementation uses the same pattern, but had two sources of overhead that inflated measured latency:main.rsafterreceive_blocking()returned — after the mutex unlock, heap allocation, payload copy, andMessagestruct construction. This added ~5-10 µs to the measured latency compared to reference C, which timestamps inside the mutex.Vec<u8>before immediately overwriting it with the SHM payload. The blocking ring buffer implementation also copied data byte-by-byte instead of using bulkmemcpy.get_monotonic_time_ns()function used thenixcrate'sclock_gettimewrapper, which adds function call overhead, aTimeSpecstruct allocation, andResulterror handling on every invocation.Commits
135d4ba— perf: SHM latency optimizations — close gap with reference C implementation (all code changes)30aad75— perf: clarify receive timestamp comment in SHM-direct (comment-only update)Changes (6 files, +174 / -28 lines)
1.
.cargo/config.toml(new file, 2 lines)What: Added
target-cpu=nativevia rustflags.Why: By default, Rust/LLVM compiles for a generic aarch64 target. With
target-cpu=native, LLVM emits instructions optimized for the actual Cortex-A53 on the S32G — wider loads/stores, better scheduling, and improved auto-vectorization. This benefits all mechanisms, not just SHM.How: Created a new
.cargo/config.tomlfile:Note: The existing
Cargo.tomlalready hadlto = true,codegen-units = 1, andpanic = "abort"in the[profile.release]section (these were onmainbefore this branch). Thetarget-cpu=nativeflag complements these by telling LLVM which specific CPU microarchitecture to target.2.
src/ipc/mod.rs— Direct libc clock_gettime + receive_time_ns field (+20 / -11 lines)What (clock): Replaced the
nixcrate'sclock_gettimewrapper inget_monotonic_time_ns()with a directlibc::clock_gettime(CLOCK_MONOTONIC)call. Added#[inline]to the function.Why (clock): The
nixcrate wrapper involves: (a) creating aClockIdenum value, (b) callingnix::time::clock_gettime()which allocates aTimeSpecon the stack, (c) calling into libc, (d) wrapping the result in aResult<TimeSpec, Errno>, (e) pattern-matching theOkvariant to extract seconds and nanoseconds. The directlibccall skips all of that — it writes directly into a stack-allocatedlibc::timespecstruct and returns. The#[inline]attribute ensures no function call overhead at the call site.How (clock): The function body was rewritten from:
to:
The
OffsetDateTimeimport was made conditional on#[cfg(not(unix))]since it is only needed for the non-Unix fallback path.What (receive_time_ns): Added a new
receive_time_ns: u64field to theMessagestruct, annotated with#[serde(skip, default)].Why (receive_time_ns): This field allows transport implementations to capture a receive timestamp deep inside their receive path (e.g., immediately after
pthread_cond_waitreturns in SHM-direct) and pass it up to the server loop. Without this, the server loop had to callget_monotonic_time_ns()after the entirereceive_blocking()call returned — at which point the mutex was already unlocked, the payload was already heap-allocated and copied, and several microseconds had elapsed since the actual message arrival.How (receive_time_ns): The field was added to the
Messagestruct definition:The
#[serde(skip)]annotation ensures this field is never serialized to or deserialized from the wire format (bincode), maintaining backward compatibility. Thedefaultattribute initializes it to0on deserialization. BothMessage::new()andMessage::new_for_blocking()constructors were updated to initializereceive_time_ns: 0.3.
src/ipc/shared_memory_direct.rs— Timestamp capture + allocation elimination (+15 / -4 lines)This is the primary file for closing the reference latency gap, as SHM-direct is the transport used in the reference C benchmark comparison.
What (receive timestamp):
receive_blocking()now callscrate::ipc::get_monotonic_time_ns()immediately afterpthread_cond_waitreturns and the ready flag is confirmed, while still holding the mutex. The result is stored in thereceive_time_nsfield of the returnedMessage.Why (receive timestamp): This is exactly what reference's C implementation does — it captures
clock_gettime(CLOCK_MONOTONIC)inside the mutex immediately after the condvar wake-up. Previously, the RC timestamp was captured inmain.rsafterreceive_blocking()returned, which includes: reading all fields from SHM, allocating aVec<u8>, copying the payload, signaling the sender, unlocking the mutex, constructing theMessagestruct, and returning up the call stack. All of that added ~5-10 µs to the measured latency that reference C does not incur.What (zero-fill elimination): Changed
vec![0u8; payload_len]toVec::with_capacity(payload_len)+std::ptr::copy_nonoverlapping+set_len(payload_len).Why (zero-fill elimination): The
vec![0u8; N]macro allocates N bytes and then zero-fills them withmemset. Since the very next operation iscopy_nonoverlappingwhich overwrites every byte, the zero-fill is redundant.Vec::with_capacityallocates without initializing, andset_lentells Rust the buffer is now valid after the copy. This saves amemsetcall on every received message.What (inline hints): Added
#[inline]toget_raw_message_ptr(),send_blocking(), andreceive_blocking().Why (inline hints): These functions are called on every message send/receive. Inlining eliminates function call overhead and allows LLVM to optimize across the call boundary (e.g., keeping the SHM pointer in a register across multiple field reads). With
lto = trueandcodegen-units = 1, the linker can already inline across crate boundaries, but the#[inline]attribute provides an explicit hint for the intra-crate case.4.
src/ipc/shared_memory_blocking.rs— Bulk ring buffer copies (+40 / -6 lines)This file implements the ring-buffer-based SHM transport (used when
--shm-directis NOT specified).What (write path): Replaced the byte-by-byte loop in
write_data_blocking()withstd::ptr::copy_nonoverlapping. The copy handles wrap-around by splitting into two parts: first from the write position to the end of the buffer, then from the start of the buffer for the remainder.Before:
After:
Why (write path): A byte-by-byte loop with a modulo operation on every iteration prevents LLVM from auto-vectorizing or converting to
memcpy. The bulk copy lets the CPU's memory controller transfer data in cache-line-sized bursts. For a 4096-byte message, this replaces 4096 individual byte stores (each with an integer division for modulo) with 1-2memcpycalls.What (read path): Same transformation for
read_data_blocking()— replaced byte-by-byte reads with bulkcopy_nonoverlapping, plus theVec::with_capacity+set_lenpattern to eliminate zero-fill.What (inline hints): Added
#[inline]todata_ptr(),available_write_space(),available_read_data(),write_data_blocking(), andread_data_blocking().5.
src/ipc/shared_memory.rs— Inline hints (+5 / -0 lines)This file implements the async ring-buffer SHM transport. It already used bulk
copy_nonoverlapping(the blocking variant in file #4 was the one missing it).What: Added
#[inline]todata_ptr(),available_write_space(),available_read_data(),write_data(), andread_data().Why: Consistency with the blocking variant, and to ensure these small functions are inlined into their callers.
6.
src/main.rs— Use transport-captured timestamp (+13 / -6 lines)What: Both the blocking server loop (
run_server_mode_blocking) and the async server loop (run_server_mode) now checkmessage.receive_time_nsbefore callingget_monotonic_time_ns(). If the transport populated the field (non-zero), that earlier timestamp is used. Otherwise, the existingget_monotonic_time_ns()call serves as a fallback.Why: This is the consumer side of the
receive_time_nsfield added toMessage. Currently only SHM-direct populates this field (insidereceive_blocking()). All other transports (TCP, UDS, PMQ, SHM ring-buffer) leave it at 0, so the server loop falls back to callingget_monotonic_time_ns()itself — preserving their existing behavior exactly.Code change:
Benchmark Results (NXP S32G, Cortex-A53)
Data from
out.main(main branch, May 6) vsout.opts(this branch, May 8), collected on the same S32G board. Two independent benchmark runs (May 7 and May 8) produced consistent results; the numbers below are from the May 8 run.SHM Direct — Mean One-Way Latency (ns)
At 1024B and above, latency improvements are 1-37%, driven by the
copy_nonoverlappingoptimization replacing byte-by-byte ring buffer copies and the zero-fill elimination. At smaller sizes (64-512B), the overhead ofclock_gettimeinside the mutex increases the measured mean, but max latency is significantly improved across all sizes.SHM Direct — Max One-Way Latency (ns)
Tail latency (max) is generally improved at most sizes, with some variability due to the inherently noisy nature of max values (single outlier events). The largest improvements come from eliminating the redundant
memsetzero-fill allocation that could trigger page faults and allocator contention.SHM Direct — Throughput (MB/s)
Known issue: SHM-direct throughput regresses 3-31% (worst at small message sizes). This is caused by the
clock_gettimecall inside the mutex critical section. The VDSOclock_gettimecall takes ~50ns, but on the low-clock-speed Cortex-A53, holding the mutex for even that extra time causes disproportionate contention in continuous-send (zero-delay) scenarios. The regression does NOT affect tests with--send-delay(like the reference C benchmark comparison, which uses 10ms delay) because throughput is delay-bound in those cases.Other Mechanisms — Summary
target-cpu=nativereference C vs RC Comparison (116B, 10K iterations, 10ms send-delay, chrt -f 50)
With real-time scheduling and CPU pinning (matching the reference C benchmark test methodology):
What is NOT Affected
receive_time_nsfield is#[serde(skip)], so bincode serialization is unchanged. Messages on the wire are identical betweenmainand this branch.target-cpu=nativeand the fasterclock_gettimewrapper.main(binary resolution issue in benchmark integration tests, not caused by this branch).Known Issues / Trade-offs
SHM-direct throughput regression at small message sizes (64-512B): The
clock_gettimecall inside the mutex extends the critical section. On the Cortex-A53's low clock speed, this causes 22-31% throughput loss in continuous-send benchmarks. The regression tapers to ~3% at 8192B. This does not affect latency-focused tests with send-delay. A future enhancement could make the timestamp placement conditional on whether--send-delayis specified.target-cpu=nativemakes binaries non-portable: The.cargo/config.tomltells LLVM to emit instructions specific to the build machine's CPU. Binaries built on the S32G cannot run on a different aarch64 CPU that lacks the same features. This is intentional for a performance benchmark tool, but should be noted if distributing pre-built binaries.