Fix completion-hang deadlock so all clients report BENCHMARK_DONE by charles-typ · Pull Request #699 · facebookresearch/DCPerf

charles-typ · 2026-06-29T22:19:38Z

Summary:
The distributed run never finalized cleanly: warmup completed (warmup_done=N) but benchmark_done stalled below num_clients (e.g. 13/16). Because the admin server only broadcasts ALL_DONE once benchmarkCompleteClients_ >= numExpectedClients_, the missing clients block EVERY client forever at waitForNotification(0), so the run is torn down with no clean client-side results (only server-aggregated metrics survive).

Root cause is a google::log_mutex/stdio deadlock, not a coroutine hang. gstack on a stuck client showed 48 of 189 threads blocked forever (futex abstime=0x0) on google::log_mutex inside LogMessage::Flush(); the lock holder was blocked in __libc_write(fd=1) to stdout, its buffer full of Benchmark GET error (sample 1/1000): mc_res_tko repeated. The stdout/stderr pipe to benchpress/automark was backpressured, so write() blocked while holding the stdio FILE lock and glog's log_mutex, wedging every logging thread including the coro worker event-base threads -> mainScope.joinAsync() never returns -> BENCHMARK_DONE is never sent. Under the mc_res_tko storm at multi-M QPS, even 1-in-1000 error sampling is thousands of synchronous prints/sec, which is enough to back the pipe up.

Fix: the four hot-path sampled error prints now fire only on the FIRST error per worker (errors == 1 instead of errors % 1000 == 1) -> at most one line per worker for the whole run, flood-proof, while keeping the error-type diagnostic (still gated by --verbose). The error counts themselves are unchanged and still reported at the end.

This change also folds in the related straggler-drain robustness the completion path needs: per-request co_await now uses folly::makePromiseContract + SemiFuture::within(10s) (global-timekeeper based, abandons the underlying future so the await always completes even if the McRouter callback never fires), and the warmup/measurement worker scopes use CancellableAsyncScope::cancelAndJoinAsync() to cancel rather than wait on requests still outstanding at window-end.

This entire diff stack contributed to the final result.

Differential Revision: D110091576

Summary: ## Problem When `additional_fanout=500` is used to simulate production's high connection count (num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO cascade during warmup: 1. **Connection storm**: All 32K lazy connections are established simultaneously on first requests, overwhelming the server's TCP accept queue (default backlog=1024). 2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight per client fire simultaneously. With 2 clients, this is 128K concurrent requests hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error) and server TKO marking. Once TKO is set, all subsequent requests fail immediately. **Previous 2-client benchmark results (without this fix):** - Client 0: 97.7% error rate - Client 1: 48.8% error rate ## Solution Three changes to prevent TKO: ### 1. Server: Increase TCP listen backlog (65536) Prevents connection refusals during connection storms from multiple clients. ### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`) Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period (default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming the server. ### 3. Client: Adaptive load control during warmup (TCP congestion control) Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD (Additive Increase, Multiplicative Decrease): - Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads) - **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128) - **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500) - **Backoff**: halves inflight if error rate > 5% - All workers share a dynamic `currentMaxInflight` atomic variable New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2) ## Results **2-client benchmark with fix (adaptive load control):** | Metric | Client 0 | Client 1 | |--------|----------|----------| | Warmup QPS | 428,540 | 428,989 | | Warmup Errors | **0** | **0** | | Benchmark QPS | **482,367** | **482,775** | | GET Errors | **0** | **0** | | SET Errors | **0** | **0** | | Hit Ratio | 100% | 100% | | P50 Latency | 130ms | 130ms | | P99 Latency | 263ms | 263ms | Combined: **~965K QPS with 0 errors** across both clients. Differential Revision: D98351095

Summary: - Add createSameThreadClient() support to eliminate cross-thread message queue hops - Workers run directly on McRouter proxy EventBases instead of separate thread pool - Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark) - Add experiment config files for various benchmark configurations Differential Revision: D98968871

Summary: Add configurable per-request CPU overhead simulation to ucachebench server to help close the CPU utilization gap between ucachebench (~35% idle) and production ucache (~9% idle). The simulation includes hash computation, clock_gettime calls, and memory allocations that mimic production ACL checks, CacheTable key construction, and serialization overhead. Changes: - Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy) - Wire flag through run.py and jobs_internal.yml - Add folly::hash and BenchmarkUtil deps - Add exp_y config (fibers enabled) and exp_z config (fibers + overhead) Experiment results: - Exp V (baseline): 35% idle, 6.91M QPS - Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp - Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective Differential Revision: D99338676

Summary: Implement production-like per-request overhead features to close the CPU utilization gap between ucachebench (~46% idle) and production ucache (~9% idle). Features added: - Compound key construction (McStoredKey-style: "uc:pool:key:v1") - MurmurHash2 key hashing (matching production getHashForKey) - ACL prefix checks with F14FastMap lookup - Overload protection with inflight request counting - Stats tracking (12+ atomic increments per request) - Ticket staleness checks - Egress hash computation - Response timestamps via clock_gettime Also adds --production-features flag to run.py, jobs_internal.yml, and server main.cpp to enable these features via automark config. Differential Revision: D99338673

Summary: Adds three new production-like CPU overhead simulations to close the CPU utilization gap between ucachebench and production ucache: - CRC32C hardware-accelerated value checksums (integrity verification) - Thrift compact protocol serialization simulation (varint encoding, field headers) - IOBuf chain construction and coalescing (header + value chaining) Also adds benchmark config files for various experiment configurations. Differential Revision: D99338677

Summary: Adds thread-local HotHashDetector matching production TLHotKeyTracker. Production maintains two detectors per IO thread (QPS + egress hotness), calling bumpHash() on every request and response. Each bumpHash() does L1 counter increment, conditional L2 probe, and periodic maintenance (counter decay, threshold adjustment). This adds ~2-3% CPU overhead matching production ucache. Differential Revision: D99338674

Summary: Three additional production-like CPU overhead simulations: - Egress rate limiting: thread-local F14 map simulating NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window + token bucket per-response checks - KCB double-lookup: second CacheLib find for ~25% of requests, matching production Key Client Binding version mismatch path - Per-thread CPU load measurement: CLOCK_THREAD_CPUTIME_ID reads matching production shouldLoadShed() per-request checks Differential Revision: D99338675

Summary: - Add cpu_work_us parameter: configurable microseconds of CPU busy-work per request using hash computation loops. Allows tuning server CPU utilization to match production levels. - Enable bucket_lock_power=20 in config (production default), adding CacheTable-style fiber mutex contention per request. - Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON. Differential Revision: D99338680

Differential Revision: D99338672

Differential Revision: D99338678

Summary: - bind_source.c: LD_PRELOAD library that round-robins connect() across multiple source IPv6 addresses, allowing a single client process to exceed the 64K ephemeral port limit per source IP - automark integration: auto-discovers host IPv6, adds secondary IPs, sets LD_PRELOAD+BIND_ADDRESSES, cleans up on exit via trap - New config param num_source_ips controls how many secondary IPs per client - conntest config for 1-client testing with num_source_ips=2 Differential Revision: D99406079

Summary: Simplify ucachebench client configuration by replacing the interacting knobs (num_proxies, additional_fanout, max_inflight) with two high-level controls: 1. `--num_connections=N`: Specify the target TCP connection count directly. Internally derives num_proxies and additional_fanout, enforcing the mcrouter 32768 ProxyDestination limit. Example: `--num_connections=16000` instead of `--num_proxies=32 --additional_fanout=500`. 2. `--auto_concurrency`: Automatically discovers optimal concurrency during the benchmark using AIMD (Additive Increase, Multiplicative Decrease). Splits the benchmark into a ramp phase (searches for peak QPS) and a steady phase (holds optimal concurrency). Replaces manual max_inflight tuning. Supports `--target_utilization=0.9` to back off from peak. Also: - Default `use_same_thread_client` to true (matches production, +4% QPS in testing) - Add flags to run.py argument parser and CLIENT_ONLY_PARAMS - Regenerate carbon protocol files Differential Revision: D106825050

Summary: Add synchronous request handler methods (processUcbGetSync, processUcbSetSync, processUcbDeleteSync) that return replies directly without wrapping in SemiFuture. The Thrift request handler now calls these sync methods inline, matching production ucache's synchronous request processing pattern. Previously, every request went through: handler -> processUcbGet() -> SemiFuture -> .get() -> callback. The SemiFuture machinery added heap allocations and extra event loop iterations per request. At 3M+ QPS, this overhead was significant. Now the path is: handler -> processUcbGetSync() -> callback->result() — all inline on the IO thread, zero future overhead. The async SemiFuture methods are preserved for backward compatibility but now delegate to the sync versions. Differential Revision: D106825048

…representative %sys Summary: Three features to close the kernel overhead gap between benchmark and production ucache: 1. warmup_max_inflight (client): Use high concurrency during warmup for fast cache population, then low inflight during measurement for realistic packet patterns. When max_inflight <= 5, run.py auto-sets warmup_max_inflight=50 and disables adaptive load control. 2. CPU thread dispatch (server): When rpc_num_cpu_worker_threads > 1, requests dispatch from IO threads to a CPUThreadPoolExecutor, matching production's tao:slow 20K-thread pool pattern. Generates real kernel context switches (IO->CPU->IO). 3. IO latency simulation (server): --io_latency_us / IO_LATENCY_US env var yields the fiber via folly::fibers::Baton::try_wait_for(), simulating downstream I/O waits. Also fixes admin_port not being forwarded to the server in the debug benchmark. Tuning results (7 client hosts, 500K connections, fibers enabled): - %sys: 6.7% (up from 1.0% baseline), production: 14% - %soft: 4.8%, production: 12% - Hit rate: 94%, QPS: 2.6M - Analysis: %sys scales linearly with QPS. At 6M QPS (16+ client hosts), projected %sys matches production. Differential Revision: D106825049

Summary: Add a connection activation scan at warmup start that sends concurrent requests through each McRouter proxy to force lazy TCP connection creation for all additional_fanout destinations. Without this, only ~23% of destinations get connections (limited by furc_hash coverage from sequential requests). Also set reset_inactive_connection_interval=0 to prevent McRouter from pruning idle connections during measurement. Key changes: - Create one scan client per proxy (40 total) to ensure full proxy coverage - Each scan sends 625 concurrent coroutines x 100 sequential requests - furc_hash(randomKey) % 625 covers all destination buckets - Write scan diagnostics to /tmp/ucache_scan_debug.txt for remote debugging Differential Revision: D106825054

Summary: Add benchmark configs from the May 2026 %sys tuning investigation. These configs explore different combinations of max_inflight, connection count, client scaling, and IP aliasing to match production ucache CPU profiles. Key configs: - *_low_inflight.json: max_inflight=1-5 for high %sys per QPS - *_500k_*.json: 500K connection targets with various client counts - *_2h_*.json: 2-host configs testing IP aliasing and instance scaling - *_max_qps.json: Maximum QPS config for server saturation testing - *_caret.json: Caret transport experiment - *_simplified.json: Simplified single-knob connection control These are test/debug configs — the recommended production-representative config is ucache_bench_2h_400k_highqps.json. Differential Revision: D106825045

Summary: The distributed run never finalized cleanly: warmup completed (`warmup_done=N`) but `benchmark_done` stalled below `num_clients` (e.g. 13/16). Because the admin server only broadcasts `ALL_DONE` once `benchmarkCompleteClients_ >= numExpectedClients_`, the missing clients block EVERY client forever at `waitForNotification(0)`, so the run is torn down with no clean client-side results (only server-aggregated metrics survive). Root cause is a `google::log_mutex`/stdio deadlock, not a coroutine hang. `gstack` on a stuck client showed 48 of 189 threads blocked forever (`futex abstime=0x0`) on `google::log_mutex` inside `LogMessage::Flush()`; the lock holder was blocked in `__libc_write(fd=1)` to stdout, its buffer full of `Benchmark GET error (sample 1/1000): mc_res_tko` repeated. The stdout/stderr pipe to benchpress/automark was backpressured, so `write()` blocked while holding the stdio FILE lock and glog's `log_mutex`, wedging every logging thread including the coro worker event-base threads -> `mainScope.joinAsync()` never returns -> `BENCHMARK_DONE` is never sent. Under the `mc_res_tko` storm at multi-M QPS, even 1-in-1000 error sampling is thousands of synchronous prints/sec, which is enough to back the pipe up. Fix: the four hot-path sampled error prints now fire only on the FIRST error per worker (`errors == 1` instead of `errors % 1000 == 1`) -> at most one line per worker for the whole run, flood-proof, while keeping the error-type diagnostic (still gated by `--verbose`). The error counts themselves are unchanged and still reported at the end. This change also folds in the related straggler-drain robustness the completion path needs: per-request `co_await` now uses `folly::makePromiseContract` + `SemiFuture::within(10s)` (global-timekeeper based, abandons the underlying future so the await always completes even if the McRouter callback never fires), and the warmup/measurement worker scopes use `CancellableAsyncScope::cancelAndJoinAsync()` to cancel rather than wait on requests still outstanding at window-end. This entire diff stack contributed to the final result. Differential Revision: D110091576

meta-codesync · 2026-06-29T22:20:10Z

@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110091576.

Yupeng Tang and others added 17 commits June 29, 2026 14:22

Add T2 traffic distribution config

2e25256

Differential Revision: D99338672

Add T2 traffic distribution and server improvements

596fe40

Differential Revision: D99338678

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 29, 2026

meta-codesync Bot added the meta-exported label Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix completion-hang deadlock so all clients report BENCHMARK_DONE#699

Fix completion-hang deadlock so all clients report BENCHMARK_DONE#699
charles-typ wants to merge 17 commits into
facebookresearch:v2-betafrom
charles-typ:export-D110091576-to-v2-beta

charles-typ commented Jun 29, 2026

Uh oh!

meta-codesync Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

charles-typ commented Jun 29, 2026

Uh oh!

meta-codesync Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant