Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion CONFIG.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,10 +334,26 @@ sudo sysctl -p
### Application-Level Tuning

#### Rust Compiler Optimizations

> **Important:** This repo does **not** ship a `.cargo/config.toml` with
> `target-cpu=native`. That setting would apply to every build —
> including CI — producing non-portable binaries. When CI
> cross-compiles for ARM on x86 runners, `target-cpu=native`
> silently optimizes for the build host, not the target, and may
> emit instructions the deployment CPU does not support (e.g.,
> ARMv8.2+ on a Cortex-A53). Apply CPU tuning explicitly at build
> time instead. See the README's
> [CPU-Optimized Builds](README.md#cpu-optimized-builds) section
> for per-platform examples.

```bash
# Maximum optimization
# On-target build (uses every instruction the local CPU supports)
RUSTFLAGS="-C target-cpu=native -C opt-level=3" cargo build --release

# Cross-compile for a specific ARM CPU
RUSTFLAGS="-C target-cpu=cortex-a53 -C opt-level=3" cargo build \
--release --target aarch64-unknown-linux-gnu

# Link-time optimization
RUSTFLAGS="-C lto=fat" cargo build --release

Expand Down
41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,16 @@ send-wait-receive per message).
3. **Client Side**: Timestamp captured after receiving response
4. **Latency Calculation**: Total elapsed time from send to receive

#### SHM-Direct Conditional Timestamp Placement

The `--shm-direct` transport uses **adaptive receive timestamp placement** based on the test type, controlled automatically by `--send-delay`:

- **Latency-focused tests** (`--send-delay > 0`): The receive timestamp is captured **inside the mutex**, immediately after the condvar wake-up. This matches the reference C SHM implementation and excludes payload copy, allocation, and mutex unlock from measured latency (~5–10 µs savings). The send-delay between messages dwarfs any additional mutex contention.

- **Throughput-focused tests** (no `--send-delay`): The receive timestamp is captured **after the mutex unlock**, keeping the critical section minimal. This avoids a 22–31% throughput regression at small message sizes caused by the extra `clock_gettime` call inside the mutex.

This behavior is fully automatic — no additional CLI flags are needed. The `--send-delay` flag is sufficient to signal intent: if you're pacing messages for latency measurement, you get the most accurate timestamps; if you're saturating the pipe for throughput, you get maximum performance.

#### Streaming Output Columns

The per-message streaming output (JSON and CSV) contains the
Expand Down Expand Up @@ -358,6 +368,37 @@ cargo build --release

The optimized binary will be available at `target/release/ipc-benchmark`.

### CPU-Optimized Builds

By default, `cargo build --release` produces **portable binaries** that run on any CPU in the target architecture family (e.g., generic `aarch64`). This is intentional — the repo does not ship a `.cargo/config.toml` with `target-cpu=native` because that setting would silently affect every build, including CI, producing non-portable binaries that may use instructions unsupported on the deployment target.

This matters especially for cross-platform ARM development. If CI runs on AWS Graviton (Neoverse-N1) but the target is an NXP S32G (Cortex-A53), a `target-cpu=native` binary built on Graviton could use ARMv8.2+ instructions that the Cortex-A53 does not support, causing illegal-instruction crashes at runtime.

**When building directly on target hardware**, enable CPU-specific optimizations at build time:

```bash
# On-target build: let the compiler use every instruction the local CPU supports
RUSTFLAGS="-C target-cpu=native" cargo build --release
```

**When cross-compiling in CI**, specify the exact CPU target per platform:

```bash
# NXP S32G (Cortex-A53)
RUSTFLAGS="-C target-cpu=cortex-a53" cargo build --release \
--target aarch64-unknown-linux-gnu

# Qualcomm Ride SX4 (Cortex-A78AE) — use the closest supported LLVM target
RUSTFLAGS="-C target-cpu=cortex-a78" cargo build --release \
--target aarch64-unknown-linux-gnu

# Renesas R-Car S4 (Cortex-A76)
RUSTFLAGS="-C target-cpu=cortex-a76" cargo build --release \
--target aarch64-unknown-linux-gnu
```

The performance-critical code paths in this project (timestamp placement, bulk copies, zero-fill elimination, direct `libc::clock_gettime`) are pure code optimizations that do not depend on `target-cpu`. They provide the bulk of the latency improvement regardless of CPU target. The `target-cpu` flag adds a smaller, incremental gain from SIMD auto-vectorization and instruction scheduling tuned for the specific microarchitecture.

### Quick Start

```bash
Expand Down
28 changes: 22 additions & 6 deletions src/benchmark_blocking.rs
Original file line number Diff line number Diff line change
Expand Up @@ -477,6 +477,13 @@ impl BlockingBenchmarkRunner {
.arg(self.config.pmq_priority.to_string());
}

// Forward send-delay to server so SHM-direct can enable precise
// (inside-mutex) timestamps for latency-focused benchmarks.
if let Some(delay) = self.config.send_delay {
let micros = delay.as_micros();
cmd.arg("--send-delay").arg(format!("{micros}us"));
}

// Add latency file path if provided (for true IPC measurement)
if let Some(path) = latency_file_path {
cmd.arg("--internal-latency-file").arg(path);
Expand Down Expand Up @@ -813,8 +820,11 @@ impl BlockingBenchmarkRunner {
/// - `Ok(())`: Warmup completed successfully
/// - `Err(anyhow::Error)`: Warmup failed
fn run_warmup(&self, transport_config: &TransportConfig) -> Result<()> {
let mut client_transport =
BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
let mut client_transport = BlockingTransportFactory::create(
&self.mechanism,
self.args.shm_direct,
self.config.send_delay,
)?;

// --- Server Process Spawning ---
let (mut server_process, mut pipe_reader) = self.spawn_server_process(transport_config)?;
Expand Down Expand Up @@ -1000,8 +1010,11 @@ impl BlockingBenchmarkRunner {
metrics_collector: &mut MetricsCollector,
mut results_manager: Option<&mut crate::results_blocking::BlockingResultsManager>,
) -> Result<()> {
let mut client_transport =
BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
let mut client_transport = BlockingTransportFactory::create(
&self.mechanism,
self.args.shm_direct,
self.config.send_delay,
)?;

// Create a temporary file for server to write latencies
let latency_file_path = std::env::temp_dir()
Expand Down Expand Up @@ -1181,8 +1194,11 @@ impl BlockingBenchmarkRunner {
metrics_collector: &mut MetricsCollector,
mut results_manager: Option<&mut crate::results_blocking::BlockingResultsManager>,
) -> Result<()> {
let mut client_transport =
BlockingTransportFactory::create(&self.mechanism, self.args.shm_direct)?;
let mut client_transport = BlockingTransportFactory::create(
&self.mechanism,
self.args.shm_direct,
self.config.send_delay,
)?;

// --- Server Process Spawning ---
let (mut server_process, mut pipe_reader) = self.spawn_server_process(transport_config)?;
Expand Down
Loading
Loading