perf: speed up fat-tree routing and executor handoff by baochunli · Pull Request #94 · iQua/days

baochunli · 2026-03-15T04:04:19Z

Summary

This PR lands the performance work from the autoresearch branch and captures the follow-on benchmark findings.

Kept code changes

Avoid per-flow graph clones during route computation
- src/flows/flow.rs
- Lets path computation borrow the topology graph instead of cloning it for each flow.
Specialize fat-tree shortest-path routing
- src/flows/route.rs
- Adds an implicit fat-tree traversal fast path before falling back to generic petgraph::algo::astar.
- This keeps shortest-path semantics while cutting routing overhead substantially on fat-tree workloads.
Tune MT executor worker handoff / park behavior
- crates/nexosim/src/executor/mt_executor.rs
- src/topos/topo.rs
- Keeps the executor changes that consistently helped on the original benchmark:
  - hot-worker search-before-park: 5us
  - cold-worker search-before-park: 1us
  - hot-worker linger window: 200us
  - hot-worker-only search policy

Supporting changes

Record the new sim-wall benchmark target and findings
- autoresearch.md
- autoresearch.ideas.md
- autoresearch.jsonl
- autoresearch.sh
- Switches the active harness to:
  - cargo run --release --bin days configs/benchmarks/flow/fattree_k32_tcp_f32_mt.toml
- Parses the simulator's own final Elapsed wall-clock time: log line so the metric excludes setup/routing time.
- Documents that the inherited executor tuning remained the best tested configuration on this smaller benchmark.
Fix the path-from-config endpoint regression test after the borrowed-graph API change
- tests/path_from_config_endpoints.rs
- Updates the test to pass &graph into compute_path.

Benchmark results

Original optimization target

Command:

target/release/days configs/exp_tcp_fattree.toml

Primary metric: total command wall-clock time (wall_s, lower is better)

Progression of kept wins on this branch:

Commit	Change	Result
`d212294`	warm baseline	`wall_s=10.89`
`9256440`	avoid per-flow route graph clones	`wall_s=10.04`
`b60d7c0`	extend hot-worker linger	`wall_s=9.87`
`38e94c2`	fat-tree routing fast path	`wall_s=8.73`
`04802e5`	5us worker search-before-park	`wall_s=8.11`
`63d432c`	hot-worker-only search window	`wall_s=8.05`
`2c04bbf`	shorten hot linger to 200us	`wall_s=8.01`
`b109311`	1us cold-worker search	`wall_s=8.00`

Net improvement versus the warm baseline:

10.89s → 8.00s
about 26.5% lower wall-clock time

Follow-on benchmark target

Command:

cargo run --release --bin days configs/benchmarks/flow/fattree_k32_tcp_f32_mt.toml

Primary metric: simulator-reported elapsed wall time (sim_wall_s, lower is better)

Best logged result on this target:

sim_wall_s=9.52 at c6986ff

Important finding:

the executor/routing improvements inherited from the original campaign still carried over well,
but no additional wins were found on this smaller sim-wall-only benchmark.

The branch records several tried-and-discarded directions, including:

nearby executor constant retunes,
altered accelerated bundling sizes,
a main-thread pre-park spin,
a successful-linger no-park experiment that broke pool-state invariants,
incremental rebundling changes in Simulation,
FIFO port run-batching,
a time-quantization fast path,
TCP timeout-heap bookkeeping changes.

Why this is worth merging

The core runtime wins are already isolated in the kept commits and materially reduce runtime on the original fat-tree benchmark without changing the workload itself.

The follow-on autoresearch work is also useful to keep in-tree because it:

switches the harness to a cleaner sim-only metric for the smaller benchmark,
preserves a written record of what was tested and rejected,
narrows future work toward structural executor changes and TCP/port hot-path investigation,
avoids re-running the same losing local retunes.

Risks / caveats

The fat-tree routing fast path is performance-sensitive and should continue to be treated as a specialization of shortest-path behavior, not a semantics change.
The MT executor changes are deliberately tuned to measured workloads; they improved both the original benchmark and held up as the best tested configuration on the follow-on benchmark, but more validation on other workloads is still worthwhile.
The autoresearch docs/logs are intentionally verbose because they capture the experiment trail.

Validation

Passing

cargo test --features test --test path_from_config_endpoints -- --show-output

Benchmarks used during this branch

./autoresearch.sh on configs/exp_tcp_fattree.toml during the original campaign
./autoresearch.sh on configs/benchmarks/flow/fattree_k32_tcp_f32_mt.toml after switching the harness to sim_wall_s

Known issue during full-suite validation

cargo test --features test -- --show-output
This run now gets through the borrowed-graph path-from-config test, but still ends with:
- tests/ring_allreduce_coverage.rs::mixed_tcp_broadcast_and_ring_collectives_emit_runtime_bytes
I have not addressed that ring-allreduce failure in this PR.

Follow-up ideas

structural executor handoff changes that preserve pool-state invariants,
deeper TCP ACK/timer bookkeeping review,
FIFO port scheduling hot-path work,
extra validation around the fat-tree routing specialization.

…ne\n\nResult: {"status":"keep","wall_s":36.7}

…t: {"status":"keep","wall_s":10.89}

…esult: {"status":"keep","wall_s":10.04}

…"status":"keep","wall_s":9.87}

…cit A* neighbor generation\n\nResult: {"status":"keep","wall_s":8.73}

…ast path\n\nResult: {"status":"keep","wall_s":8.46}

…ree routing fast path\n\nResult: {"status":"keep","wall_s":8.11}

…rking\n\nResult: {"status":"keep","wall_s":8.05}

…er window\n\nResult: {"status":"keep","wall_s":8.01}

…ult: {"status":"keep","wall_s":8}

…n\nResult: {"status":"keep","sim_wall_s":9.52}

baochunli added 15 commits March 14, 2026 22:32

perf(autoresearch): add exp_tcp_fattree harness and cold-build baseli…

6a457db

…ne\n\nResult: {"status":"keep","wall_s":36.7}

perf(autoresearch): record warm baseline for exp_tcp_fattree\n\nResul…

d212294

…t: {"status":"keep","wall_s":10.89}

perf(routing): avoid per-flow graph clones when computing routes\n\nR…

9256440

…esult: {"status":"keep","wall_s":10.04}

perf(executor): extend hot-worker linger window to 250us\n\nResult: {…

b60d7c0

…"status":"keep","wall_s":9.87}

perf(routing): specialize fat-tree shortest-path traversal with impli…

38e94c2

…cit A* neighbor generation\n\nResult: {"status":"keep","wall_s":8.73}

docs(autoresearch): record validated warm rerun of fat-tree routing f…

ade6e4e

…ast path\n\nResult: {"status":"keep","wall_s":8.46}

perf(executor): let workers search for 5us before parking after fat-t…

04802e5

…ree routing fast path\n\nResult: {"status":"keep","wall_s":8.11}

perf(executor): give only hot workers the 5us search window before pa…

63d432c

…rking\n\nResult: {"status":"keep","wall_s":8.05}

perf(executor): pair hot-worker-only search with a shorter 200us ling…

2c04bbf

…er window\n\nResult: {"status":"keep","wall_s":8.01}

perf(executor): let cold workers search for 1us before parking\n\nRes…

b109311

…ult: {"status":"keep","wall_s":8}

perf(autoresearch): switch to fattree_k32_tcp_f32_mt sim-wall metric\…

c6986ff

…n\nResult: {"status":"keep","sim_wall_s":9.52}

test(flow): borrow graph in path-from-config endpoint test

563eddc

docs(autoresearch): record k32 fattree findings

f8ff913

chore(clippy): fix all-targets warnings

23d1418

fix(routing): validate fat-tree fast-path layout

6d922a8

baochunli merged commit ebab0a1 into main Mar 15, 2026
1 check passed

baochunli deleted the autoresearch/tcp-fattree-runtime-2026-03-15 branch March 15, 2026 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: speed up fat-tree routing and executor handoff#94

perf: speed up fat-tree routing and executor handoff#94
baochunli merged 15 commits into
mainfrom
autoresearch/tcp-fattree-runtime-2026-03-15

baochunli commented Mar 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

baochunli commented Mar 15, 2026

Summary

Kept code changes

Supporting changes

Benchmark results

Original optimization target

Follow-on benchmark target

Why this is worth merging

Risks / caveats

Validation

Passing

Benchmarks used during this branch

Known issue during full-suite validation

Follow-up ideas

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant