Skip to content

Fix completion-hang deadlock so all clients report BENCHMARK_DONE#699

Open
charles-typ wants to merge 17 commits into
facebookresearch:v2-betafrom
charles-typ:export-D110091576-to-v2-beta
Open

Fix completion-hang deadlock so all clients report BENCHMARK_DONE#699
charles-typ wants to merge 17 commits into
facebookresearch:v2-betafrom
charles-typ:export-D110091576-to-v2-beta

Conversation

@charles-typ

Copy link
Copy Markdown
Contributor

Summary:
The distributed run never finalized cleanly: warmup completed (warmup_done=N) but benchmark_done stalled below num_clients (e.g. 13/16). Because the admin server only broadcasts ALL_DONE once benchmarkCompleteClients_ >= numExpectedClients_, the missing clients block EVERY client forever at waitForNotification(0), so the run is torn down with no clean client-side results (only server-aggregated metrics survive).

Root cause is a google::log_mutex/stdio deadlock, not a coroutine hang. gstack on a stuck client showed 48 of 189 threads blocked forever (futex abstime=0x0) on google::log_mutex inside LogMessage::Flush(); the lock holder was blocked in __libc_write(fd=1) to stdout, its buffer full of Benchmark GET error (sample 1/1000): mc_res_tko repeated. The stdout/stderr pipe to benchpress/automark was backpressured, so write() blocked while holding the stdio FILE lock and glog's log_mutex, wedging every logging thread including the coro worker event-base threads -> mainScope.joinAsync() never returns -> BENCHMARK_DONE is never sent. Under the mc_res_tko storm at multi-M QPS, even 1-in-1000 error sampling is thousands of synchronous prints/sec, which is enough to back the pipe up.

Fix: the four hot-path sampled error prints now fire only on the FIRST error per worker (errors == 1 instead of errors % 1000 == 1) -> at most one line per worker for the whole run, flood-proof, while keeping the error-type diagnostic (still gated by --verbose). The error counts themselves are unchanged and still reported at the end.

This change also folds in the related straggler-drain robustness the completion path needs: per-request co_await now uses folly::makePromiseContract + SemiFuture::within(10s) (global-timekeeper based, abandons the underlying future so the await always completes even if the McRouter callback never fires), and the warmup/measurement worker scopes use CancellableAsyncScope::cancelAndJoinAsync() to cancel rather than wait on requests still outstanding at window-end.

This entire diff stack contributed to the final result.

Differential Revision: D110091576

Yupeng Tang and others added 17 commits June 29, 2026 14:22
Summary:
## Problem

When `additional_fanout=500` is used to simulate production's high connection count
(num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO
cascade during warmup:

1. **Connection storm**: All 32K lazy connections are established simultaneously on first
   requests, overwhelming the server's TCP accept queue (default backlog=1024).

2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight
   per client fire simultaneously. With 2 clients, this is 128K concurrent requests
   hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error)
   and server TKO marking. Once TKO is set, all subsequent requests fail immediately.

**Previous 2-client benchmark results (without this fix):**
- Client 0: 97.7% error rate
- Client 1: 48.8% error rate

## Solution

Three changes to prevent TKO:

### 1. Server: Increase TCP listen backlog (65536)
Prevents connection refusals during connection storms from multiple clients.

### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`)
Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period
(default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming
the server.

### 3. Client: Adaptive load control during warmup (TCP congestion control)
Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD
(Additive Increase, Multiplicative Decrease):
- Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads)
- **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128)
- **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500)
- **Backoff**: halves inflight if error rate > 5%
- All workers share a dynamic `currentMaxInflight` atomic variable

New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2)

## Results

**2-client benchmark with fix (adaptive load control):**

| Metric | Client 0 | Client 1 |
|--------|----------|----------|
| Warmup QPS | 428,540 | 428,989 |
| Warmup Errors | **0** | **0** |
| Benchmark QPS | **482,367** | **482,775** |
| GET Errors | **0** | **0** |
| SET Errors | **0** | **0** |
| Hit Ratio | 100% | 100% |
| P50 Latency | 130ms | 130ms |
| P99 Latency | 263ms | 263ms |

Combined: **~965K QPS with 0 errors** across both clients.

Differential Revision: D98351095
Summary:
- Add createSameThreadClient() support to eliminate cross-thread message queue hops
- Workers run directly on McRouter proxy EventBases instead of separate thread pool
- Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark)
- Add experiment config files for various benchmark configurations

Differential Revision: D98968871
Summary:
Add configurable per-request CPU overhead simulation to ucachebench server to
help close the CPU utilization gap between ucachebench (~35% idle) and
production ucache (~9% idle). The simulation includes hash computation,
clock_gettime calls, and memory allocations that mimic production ACL checks,
CacheTable key construction, and serialization overhead.

Changes:
- Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy)
- Wire flag through run.py and jobs_internal.yml
- Add folly::hash and BenchmarkUtil deps
- Add exp_y config (fibers enabled) and exp_z config (fibers + overhead)

Experiment results:
- Exp V (baseline): 35% idle, 6.91M QPS
- Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp
- Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective

Differential Revision: D99338676
Summary:
Implement production-like per-request overhead features to close the CPU
utilization gap between ucachebench (~46% idle) and production ucache (~9% idle).

Features added:
- Compound key construction (McStoredKey-style: "uc:pool:key:v1")
- MurmurHash2 key hashing (matching production getHashForKey)
- ACL prefix checks with F14FastMap lookup
- Overload protection with inflight request counting
- Stats tracking (12+ atomic increments per request)
- Ticket staleness checks
- Egress hash computation
- Response timestamps via clock_gettime

Also adds --production-features flag to run.py, jobs_internal.yml, and
server main.cpp to enable these features via automark config.

Differential Revision: D99338673
Summary:
Adds three new production-like CPU overhead simulations to close the CPU
utilization gap between ucachebench and production ucache:
- CRC32C hardware-accelerated value checksums (integrity verification)
- Thrift compact protocol serialization simulation (varint encoding, field headers)
- IOBuf chain construction and coalescing (header + value chaining)

Also adds benchmark config files for various experiment configurations.

Differential Revision: D99338677
Summary:
Adds thread-local HotHashDetector matching production TLHotKeyTracker.
Production maintains two detectors per IO thread (QPS + egress hotness),
calling bumpHash() on every request and response. Each bumpHash() does
L1 counter increment, conditional L2 probe, and periodic maintenance
(counter decay, threshold adjustment). This adds ~2-3% CPU overhead
matching production ucache.

Differential Revision: D99338674
Summary:
Three additional production-like CPU overhead simulations:
- Egress rate limiting: thread-local F14 map simulating
  NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window
  + token bucket per-response checks
- KCB double-lookup: second CacheLib find for ~25% of requests,
  matching production Key Client Binding version mismatch path
- Per-thread CPU load measurement: CLOCK_THREAD_CPUTIME_ID reads
  matching production shouldLoadShed() per-request checks

Differential Revision: D99338675
Summary:
- Add cpu_work_us parameter: configurable microseconds of CPU busy-work per
  request using hash computation loops. Allows tuning server CPU utilization
  to match production levels.
- Enable bucket_lock_power=20 in config (production default), adding
  CacheTable-style fiber mutex contention per request.
- Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON.

Differential Revision: D99338680
Differential Revision: D99338672
Summary:
- bind_source.c: LD_PRELOAD library that round-robins connect() across
  multiple source IPv6 addresses, allowing a single client process to
  exceed the 64K ephemeral port limit per source IP
- automark integration: auto-discovers host IPv6, adds secondary IPs,
  sets LD_PRELOAD+BIND_ADDRESSES, cleans up on exit via trap
- New config param num_source_ips controls how many secondary IPs per client
- conntest config for 1-client testing with num_source_ips=2

Differential Revision: D99406079
Summary:
Simplify ucachebench client configuration by replacing the interacting knobs (num_proxies, additional_fanout, max_inflight) with two high-level controls:

1. `--num_connections=N`: Specify the target TCP connection count directly. Internally derives num_proxies and additional_fanout, enforcing the mcrouter 32768 ProxyDestination limit. Example: `--num_connections=16000` instead of `--num_proxies=32 --additional_fanout=500`.

2. `--auto_concurrency`: Automatically discovers optimal concurrency during the benchmark using AIMD (Additive Increase, Multiplicative Decrease). Splits the benchmark into a ramp phase (searches for peak QPS) and a steady phase (holds optimal concurrency). Replaces manual max_inflight tuning. Supports `--target_utilization=0.9` to back off from peak.

Also:
- Default `use_same_thread_client` to true (matches production, +4% QPS in testing)
- Add flags to run.py argument parser and CLIENT_ONLY_PARAMS
- Regenerate carbon protocol files

Differential Revision: D106825050
Summary:
Add synchronous request handler methods (processUcbGetSync, processUcbSetSync, processUcbDeleteSync) that return replies directly without wrapping in SemiFuture. The Thrift request handler now calls these sync methods inline, matching production ucache's synchronous request processing pattern.

Previously, every request went through: handler -> processUcbGet() -> SemiFuture -> .get() -> callback. The SemiFuture machinery added heap allocations and extra event loop iterations per request. At 3M+ QPS, this overhead was significant.

Now the path is: handler -> processUcbGetSync() -> callback->result() — all inline on the IO thread, zero future overhead.

The async SemiFuture methods are preserved for backward compatibility but now delegate to the sync versions.

Differential Revision: D106825048
…representative %sys

Summary:
Three features to close the kernel overhead gap between benchmark and production ucache:

1. warmup_max_inflight (client): Use high concurrency during warmup for fast cache population, then low inflight during measurement for realistic packet patterns. When max_inflight <= 5, run.py auto-sets warmup_max_inflight=50 and disables adaptive load control.

2. CPU thread dispatch (server): When rpc_num_cpu_worker_threads > 1, requests dispatch from IO threads to a CPUThreadPoolExecutor, matching production's tao:slow 20K-thread pool pattern. Generates real kernel context switches (IO->CPU->IO).

3. IO latency simulation (server): --io_latency_us / IO_LATENCY_US env var yields the fiber via folly::fibers::Baton::try_wait_for(), simulating downstream I/O waits.

Also fixes admin_port not being forwarded to the server in the debug benchmark.

Tuning results (7 client hosts, 500K connections, fibers enabled):
- %sys: 6.7% (up from 1.0% baseline), production: 14%
- %soft: 4.8%, production: 12%
- Hit rate: 94%, QPS: 2.6M
- Analysis: %sys scales linearly with QPS. At 6M QPS (16+ client hosts), projected %sys matches production.

Differential Revision: D106825049
Summary:
Add a connection activation scan at warmup start that sends concurrent requests through each McRouter proxy to force lazy TCP connection creation for all additional_fanout destinations. Without this, only ~23% of destinations get connections (limited by furc_hash coverage from sequential requests).

Also set reset_inactive_connection_interval=0 to prevent McRouter from pruning idle connections during measurement.

Key changes:
- Create one scan client per proxy (40 total) to ensure full proxy coverage
- Each scan sends 625 concurrent coroutines x 100 sequential requests
- furc_hash(randomKey) % 625 covers all destination buckets
- Write scan diagnostics to /tmp/ucache_scan_debug.txt for remote debugging

Differential Revision: D106825054
Summary:
Add benchmark configs from the May 2026 %sys tuning investigation. These configs explore different combinations of max_inflight, connection count, client scaling, and IP aliasing to match production ucache CPU profiles.

Key configs:
- *_low_inflight.json: max_inflight=1-5 for high %sys per QPS
- *_500k_*.json: 500K connection targets with various client counts
- *_2h_*.json: 2-host configs testing IP aliasing and instance scaling
- *_max_qps.json: Maximum QPS config for server saturation testing
- *_caret.json: Caret transport experiment
- *_simplified.json: Simplified single-knob connection control

These are test/debug configs — the recommended production-representative config is ucache_bench_2h_400k_highqps.json.

Differential Revision: D106825045
Summary:
The distributed run never finalized cleanly: warmup completed (`warmup_done=N`) but `benchmark_done` stalled below `num_clients` (e.g. 13/16). Because the admin server only broadcasts `ALL_DONE` once `benchmarkCompleteClients_ >= numExpectedClients_`, the missing clients block EVERY client forever at `waitForNotification(0)`, so the run is torn down with no clean client-side results (only server-aggregated metrics survive).

Root cause is a `google::log_mutex`/stdio deadlock, not a coroutine hang. `gstack` on a stuck client showed 48 of 189 threads blocked forever (`futex abstime=0x0`) on `google::log_mutex` inside `LogMessage::Flush()`; the lock holder was blocked in `__libc_write(fd=1)` to stdout, its buffer full of `Benchmark GET error (sample 1/1000): mc_res_tko` repeated. The stdout/stderr pipe to benchpress/automark was backpressured, so `write()` blocked while holding the stdio FILE lock and glog's `log_mutex`, wedging every logging thread including the coro worker event-base threads -> `mainScope.joinAsync()` never returns -> `BENCHMARK_DONE` is never sent. Under the `mc_res_tko` storm at multi-M QPS, even 1-in-1000 error sampling is thousands of synchronous prints/sec, which is enough to back the pipe up.

Fix: the four hot-path sampled error prints now fire only on the FIRST error per worker (`errors == 1` instead of `errors % 1000 == 1`) -> at most one line per worker for the whole run, flood-proof, while keeping the error-type diagnostic (still gated by `--verbose`). The error counts themselves are unchanged and still reported at the end.

This change also folds in the related straggler-drain robustness the completion path needs: per-request `co_await` now uses `folly::makePromiseContract` + `SemiFuture::within(10s)` (global-timekeeper based, abandons the underlying future so the await always completes even if the McRouter callback never fires), and the warmup/measurement worker scopes use `CancellableAsyncScope::cancelAndJoinAsync()` to cancel rather than wait on requests still outstanding at window-end.

This entire diff stack contributed to the final result.

Differential Revision: D110091576
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 29, 2026
@meta-codesync

meta-codesync Bot commented Jun 29, 2026

Copy link
Copy Markdown

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110091576.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant