redhat-performance · mcurrier2 · May 6, 2026 · May 7, 2026 · May 8, 2026 · May 8, 2026
diff --git a/CONFIG.md b/CONFIG.md
@@ -334,10 +334,26 @@ sudo sysctl -p
 ### Application-Level Tuning
 
 #### Rust Compiler Optimizations
+
+> **Important:** This repo does **not** ship a `.cargo/config.toml` with
+> `target-cpu=native`. That setting would apply to every build —
+> including CI — producing non-portable binaries. When CI
+> cross-compiles for ARM on x86 runners, `target-cpu=native`
+> silently optimizes for the build host, not the target, and may
+> emit instructions the deployment CPU does not support (e.g.,
+> ARMv8.2+ on a Cortex-A53). Apply CPU tuning explicitly at build
+> time instead. See the README's
+> [CPU-Optimized Builds](README.md#cpu-optimized-builds) section
+> for per-platform examples.
+
 ```bash
-# Maximum optimization
+# On-target build (uses every instruction the local CPU supports)
 RUSTFLAGS="-C target-cpu=native -C opt-level=3" cargo build --release
 
+# Cross-compile for a specific ARM CPU
+RUSTFLAGS="-C target-cpu=cortex-a53 -C opt-level=3" cargo build \
+  --release --target aarch64-unknown-linux-gnu
+
 # Link-time optimization
 RUSTFLAGS="-C lto=fat" cargo build --release
 

diff --git a/README.md b/README.md
@@ -205,6 +205,16 @@ send-wait-receive per message).
 3. **Client Side**: Timestamp captured after receiving response
 4. **Latency Calculation**: Total elapsed time from send to receive
 
+#### SHM-Direct Conditional Timestamp Placement
+
+The `--shm-direct` transport uses **adaptive receive timestamp placement** based on the test type, controlled automatically by `--send-delay`:
+
+- **Latency-focused tests** (`--send-delay > 0`): The receive timestamp is captured **inside the mutex**, immediately after the condvar wake-up. This matches the reference C SHM implementation and excludes payload copy, allocation, and mutex unlock from measured latency (~5–10 µs savings). The send-delay between messages dwarfs any additional mutex contention.
+
+- **Throughput-focused tests** (no `--send-delay`): The receive timestamp is captured **after the mutex unlock**, keeping the critical section minimal. This avoids a 22–31% throughput regression at small message sizes caused by the extra `clock_gettime` call inside the mutex.
+
+This behavior is fully automatic — no additional CLI flags are needed. The `--send-delay` flag is sufficient to signal intent: if you're pacing messages for latency measurement, you get the most accurate timestamps; if you're saturating the pipe for throughput, you get maximum performance.
+
 #### Streaming Output Columns
 
 The per-message streaming output (JSON and CSV) contains the
@@ -358,6 +368,37 @@ cargo build --release
 
 The optimized binary will be available at `target/release/ipc-benchmark`.
 
+### CPU-Optimized Builds
+
+By default, `cargo build --release` produces **portable binaries** that run on any CPU in the target architecture family (e.g., generic `aarch64`). This is intentional — the repo does not ship a `.cargo/config.toml` with `target-cpu=native` because that setting would silently affect every build, including CI, producing non-portable binaries that may use instructions unsupported on the deployment target.
+
+This matters especially for cross-platform ARM development. If CI runs on AWS Graviton (Neoverse-N1) but the target is an NXP S32G (Cortex-A53), a `target-cpu=native` binary built on Graviton could use ARMv8.2+ instructions that the Cortex-A53 does not support, causing illegal-instruction crashes at runtime.
+
+**When building directly on target hardware**, enable CPU-specific optimizations at build time:
+
+```bash
+# On-target build: let the compiler use every instruction the local CPU supports
+RUSTFLAGS="-C target-cpu=native" cargo build --release
+```
+
+**When cross-compiling in CI**, specify the exact CPU target per platform:
+
+```bash
+# NXP S32G (Cortex-A53)
+RUSTFLAGS="-C target-cpu=cortex-a53" cargo build --release \
+  --target aarch64-unknown-linux-gnu
+
+# Qualcomm Ride SX4 (Cortex-A78AE) — use the closest supported LLVM target
+RUSTFLAGS="-C target-cpu=cortex-a78" cargo build --release \
+  --target aarch64-unknown-linux-gnu
+
+# Renesas R-Car S4 (Cortex-A76)
+RUSTFLAGS="-C target-cpu=cortex-a76" cargo build --release \
+  --target aarch64-unknown-linux-gnu
+```
+
+The performance-critical code paths in this project (timestamp placement, bulk copies, zero-fill elimination, direct `libc::clock_gettime`) are pure code optimizations that do not depend on `target-cpu`. They provide the bulk of the latency improvement regardless of CPU target. The `target-cpu` flag adds a smaller, incremental gain from SIMD auto-vectorization and instruction scheduling tuned for the specific microarchitecture.
+
 ### Quick Start
 
 ```bash

diff --git a/src/benchmark_blocking.rs b/src/benchmark_blocking.rs
@@ -477,6 +477,13 @@ impl BlockingBenchmarkRunner {
                 .arg(self.config.pmq_priority.to_string());
         }
 
+        // Forward send-delay to server so SHM-direct can enable precise
+        // (inside-mutex) timestamps for latency-focused benchmarks.
+        if let Some(delay) = self.config.send_delay {
+            let micros = delay.as_micros();
+            cmd.arg("--send-delay").arg(format!("{micros}us"));
+        }
+
         // Add latency file path if provided (for true IPC measurement)
         if let Some(path) = latency_file_path {
             cmd.arg("--internal-latency-file").arg(path);
@@ -813,8 +820,11 @@ impl BlockingBenchmarkRunner {
     /// - `Ok(())`: Warmup completed successfully
     /// - `Err(anyhow::Error)`: Warmup failed
     fn run_warmup(&self, transport_config: &TransportConfig) -> Result<()> {
-        let mut client_transport =
-            BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
+        let mut client_transport = BlockingTransportFactory::create(
+            &self.mechanism,
+            self.args.shm_direct,
+            self.config.send_delay,
+        )?;
 
         // --- Server Process Spawning ---
         let (mut server_process, mut pipe_reader) = self.spawn_server_process(transport_config)?;
@@ -1000,8 +1010,11 @@ impl BlockingBenchmarkRunner {
         metrics_collector: &mut MetricsCollector,
         mut results_manager: Option<&mut crate::results_blocking::BlockingResultsManager>,
     ) -> Result<()> {
-        let mut client_transport =
-            BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
+        let mut client_transport = BlockingTransportFactory::create(
+            &self.mechanism,
+            self.args.shm_direct,
+            self.config.send_delay,
+        )?;
 
         // Create a temporary file for server to write latencies
         let latency_file_path = std::env::temp_dir()
@@ -1181,8 +1194,11 @@ impl BlockingBenchmarkRunner {
         metrics_collector: &mut MetricsCollector,
         mut results_manager: Option<&mut crate::results_blocking::BlockingResultsManager>,
     ) -> Result<()> {
-        let mut client_transport =
-            BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
+        let mut client_transport = BlockingTransportFactory::create(
+            &self.mechanism,
+            self.args.shm_direct,
+            self.config.send_delay,
+        )?;
 
         // --- Server Process Spawning ---
         let (mut server_process, mut pipe_reader) = self.spawn_server_process(transport_config)?;