Skip to content

perf: speed up fat-tree routing and executor handoff#94

Merged
baochunli merged 15 commits into
mainfrom
autoresearch/tcp-fattree-runtime-2026-03-15
Mar 15, 2026
Merged

perf: speed up fat-tree routing and executor handoff#94
baochunli merged 15 commits into
mainfrom
autoresearch/tcp-fattree-runtime-2026-03-15

Conversation

@baochunli

Copy link
Copy Markdown
Collaborator

Summary

This PR lands the performance work from the autoresearch branch and captures the follow-on benchmark findings.

Kept code changes

  1. Avoid per-flow graph clones during route computation

    • src/flows/flow.rs
    • Lets path computation borrow the topology graph instead of cloning it for each flow.
  2. Specialize fat-tree shortest-path routing

    • src/flows/route.rs
    • Adds an implicit fat-tree traversal fast path before falling back to generic petgraph::algo::astar.
    • This keeps shortest-path semantics while cutting routing overhead substantially on fat-tree workloads.
  3. Tune MT executor worker handoff / park behavior

    • crates/nexosim/src/executor/mt_executor.rs
    • src/topos/topo.rs
    • Keeps the executor changes that consistently helped on the original benchmark:
      • hot-worker search-before-park: 5us
      • cold-worker search-before-park: 1us
      • hot-worker linger window: 200us
      • hot-worker-only search policy

Supporting changes

  1. Record the new sim-wall benchmark target and findings

    • autoresearch.md
    • autoresearch.ideas.md
    • autoresearch.jsonl
    • autoresearch.sh
    • Switches the active harness to:
      • cargo run --release --bin days configs/benchmarks/flow/fattree_k32_tcp_f32_mt.toml
    • Parses the simulator's own final Elapsed wall-clock time: log line so the metric excludes setup/routing time.
    • Documents that the inherited executor tuning remained the best tested configuration on this smaller benchmark.
  2. Fix the path-from-config endpoint regression test after the borrowed-graph API change

    • tests/path_from_config_endpoints.rs
    • Updates the test to pass &graph into compute_path.

Benchmark results

Original optimization target

Command:

target/release/days configs/exp_tcp_fattree.toml

Primary metric: total command wall-clock time (wall_s, lower is better)

Progression of kept wins on this branch:

Commit Change Result
d212294 warm baseline wall_s=10.89
9256440 avoid per-flow route graph clones wall_s=10.04
b60d7c0 extend hot-worker linger wall_s=9.87
38e94c2 fat-tree routing fast path wall_s=8.73
04802e5 5us worker search-before-park wall_s=8.11
63d432c hot-worker-only search window wall_s=8.05
2c04bbf shorten hot linger to 200us wall_s=8.01
b109311 1us cold-worker search wall_s=8.00

Net improvement versus the warm baseline:

  • 10.89s → 8.00s
  • about 26.5% lower wall-clock time

Follow-on benchmark target

Command:

cargo run --release --bin days configs/benchmarks/flow/fattree_k32_tcp_f32_mt.toml

Primary metric: simulator-reported elapsed wall time (sim_wall_s, lower is better)

Best logged result on this target:

  • sim_wall_s=9.52 at c6986ff

Important finding:

  • the executor/routing improvements inherited from the original campaign still carried over well,
  • but no additional wins were found on this smaller sim-wall-only benchmark.

The branch records several tried-and-discarded directions, including:

  • nearby executor constant retunes,
  • altered accelerated bundling sizes,
  • a main-thread pre-park spin,
  • a successful-linger no-park experiment that broke pool-state invariants,
  • incremental rebundling changes in Simulation,
  • FIFO port run-batching,
  • a time-quantization fast path,
  • TCP timeout-heap bookkeeping changes.

Why this is worth merging

The core runtime wins are already isolated in the kept commits and materially reduce runtime on the original fat-tree benchmark without changing the workload itself.

The follow-on autoresearch work is also useful to keep in-tree because it:

  • switches the harness to a cleaner sim-only metric for the smaller benchmark,
  • preserves a written record of what was tested and rejected,
  • narrows future work toward structural executor changes and TCP/port hot-path investigation,
  • avoids re-running the same losing local retunes.

Risks / caveats

  • The fat-tree routing fast path is performance-sensitive and should continue to be treated as a specialization of shortest-path behavior, not a semantics change.
  • The MT executor changes are deliberately tuned to measured workloads; they improved both the original benchmark and held up as the best tested configuration on the follow-on benchmark, but more validation on other workloads is still worthwhile.
  • The autoresearch docs/logs are intentionally verbose because they capture the experiment trail.

Validation

Passing

  • cargo test --features test --test path_from_config_endpoints -- --show-output

Benchmarks used during this branch

  • ./autoresearch.sh on configs/exp_tcp_fattree.toml during the original campaign
  • ./autoresearch.sh on configs/benchmarks/flow/fattree_k32_tcp_f32_mt.toml after switching the harness to sim_wall_s

Known issue during full-suite validation

  • cargo test --features test -- --show-output
  • This run now gets through the borrowed-graph path-from-config test, but still ends with:
    • tests/ring_allreduce_coverage.rs::mixed_tcp_broadcast_and_ring_collectives_emit_runtime_bytes
  • I have not addressed that ring-allreduce failure in this PR.

Follow-up ideas

  • structural executor handoff changes that preserve pool-state invariants,
  • deeper TCP ACK/timer bookkeeping review,
  • FIFO port scheduling hot-path work,
  • extra validation around the fat-tree routing specialization.

…ne\n\nResult: {"status":"keep","wall_s":36.7}
…cit A* neighbor generation\n\nResult: {"status":"keep","wall_s":8.73}
…ast path\n\nResult: {"status":"keep","wall_s":8.46}
…ree routing fast path\n\nResult: {"status":"keep","wall_s":8.11}
…rking\n\nResult: {"status":"keep","wall_s":8.05}
…er window\n\nResult: {"status":"keep","wall_s":8.01}
…n\nResult: {"status":"keep","sim_wall_s":9.52}
@baochunli baochunli merged commit ebab0a1 into main Mar 15, 2026
1 check passed
@baochunli baochunli deleted the autoresearch/tcp-fattree-runtime-2026-03-15 branch March 15, 2026 04:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant