Fix completion-hang deadlock so all clients report BENCHMARK_DONE#699
Open
charles-typ wants to merge 17 commits into
Open
Fix completion-hang deadlock so all clients report BENCHMARK_DONE#699charles-typ wants to merge 17 commits into
charles-typ wants to merge 17 commits into
Conversation
Summary: ## Problem When `additional_fanout=500` is used to simulate production's high connection count (num_proxies=64 × 501 = 32K connections per client), multiple clients cause a TKO cascade during warmup: 1. **Connection storm**: All 32K lazy connections are established simultaneously on first requests, overwhelming the server's TCP accept queue (default backlog=1024). 2. **Warmup burst**: After connections are established, all 128 threads × 500 max_inflight per client fire simultaneously. With 2 clients, this is 128K concurrent requests hitting the server at once, causing mcrouter internal queue overflow (mc_res_local_error) and server TKO marking. Once TKO is set, all subsequent requests fail immediately. **Previous 2-client benchmark results (without this fix):** - Client 0: 97.7% error rate - Client 1: 48.8% error rate ## Solution Three changes to prevent TKO: ### 1. Server: Increase TCP listen backlog (65536) Prevents connection refusals during connection storms from multiple clients. ### 2. Client: Connection ramp-up phase (new flag: `--connection_ramp_seconds`) Before warmup, sends paced requests with `maxOutstanding=1` over a configurable period (default 10s) to gradually establish mcrouter's lazy TCP connections without overwhelming the server. ### 3. Client: Adaptive load control during warmup (TCP congestion control) Instead of launching all 128 threads at max_inflight=500 simultaneously, uses AIMD (Additive Increase, Multiplicative Decrease): - Starts at `--warmup_initial_inflight=2` per thread (256 total with 128 threads) - **Slow start**: doubles inflight every 2s while error rate < 1% (2→4→8→16→32→64→128) - **Congestion avoidance**: linear increase (+50/step) once past 25% of max (128→178→228→...→500) - **Backoff**: halves inflight if error rate > 5% - All workers share a dynamic `currentMaxInflight` atomic variable New flags: `--warmup_adaptive_load` (default true), `--warmup_initial_inflight` (default 2) ## Results **2-client benchmark with fix (adaptive load control):** | Metric | Client 0 | Client 1 | |--------|----------|----------| | Warmup QPS | 428,540 | 428,989 | | Warmup Errors | **0** | **0** | | Benchmark QPS | **482,367** | **482,775** | | GET Errors | **0** | **0** | | SET Errors | **0** | **0** | | Hit Ratio | 100% | 100% | | P50 Latency | 130ms | 130ms | | P99 Latency | 263ms | 263ms | Combined: **~965K QPS with 0 errors** across both clients. Differential Revision: D98351095
Summary: - Add createSameThreadClient() support to eliminate cross-thread message queue hops - Workers run directly on McRouter proxy EventBases instead of separate thread pool - Add use_same_thread_client flag/config through full stack (client, run.py, jobs YAML, automark) - Add experiment config files for various benchmark configurations Differential Revision: D98968871
Summary: Add configurable per-request CPU overhead simulation to ucachebench server to help close the CPU utilization gap between ucachebench (~35% idle) and production ucache (~9% idle). The simulation includes hash computation, clock_gettime calls, and memory allocations that mimic production ACL checks, CacheTable key construction, and serialization overhead. Changes: - Add cpu_overhead_level flag (0=disabled, 1=light, 2=medium, 3=heavy) - Wire flag through run.py and jobs_internal.yml - Add folly::hash and BenchmarkUtil deps - Add exp_y config (fibers enabled) and exp_z config (fibers + overhead) Experiment results: - Exp V (baseline): 35% idle, 6.91M QPS - Exp Y (fibers only): 24% idle, 6.91M QPS -- fibers reduce idle by 11pp - Exp Z (fibers + overhead=3): 37% idle, 6.94M QPS -- simulation ineffective Differential Revision: D99338676
Summary: Implement production-like per-request overhead features to close the CPU utilization gap between ucachebench (~46% idle) and production ucache (~9% idle). Features added: - Compound key construction (McStoredKey-style: "uc:pool:key:v1") - MurmurHash2 key hashing (matching production getHashForKey) - ACL prefix checks with F14FastMap lookup - Overload protection with inflight request counting - Stats tracking (12+ atomic increments per request) - Ticket staleness checks - Egress hash computation - Response timestamps via clock_gettime Also adds --production-features flag to run.py, jobs_internal.yml, and server main.cpp to enable these features via automark config. Differential Revision: D99338673
Summary: Adds three new production-like CPU overhead simulations to close the CPU utilization gap between ucachebench and production ucache: - CRC32C hardware-accelerated value checksums (integrity verification) - Thrift compact protocol serialization simulation (varint encoding, field headers) - IOBuf chain construction and coalescing (header + value chaining) Also adds benchmark config files for various experiment configurations. Differential Revision: D99338677
Summary: Adds thread-local HotHashDetector matching production TLHotKeyTracker. Production maintains two detectors per IO thread (QPS + egress hotness), calling bumpHash() on every request and response. Each bumpHash() does L1 counter increment, conditional L2 probe, and periodic maintenance (counter decay, threshold adjustment). This adds ~2-3% CPU overhead matching production ucache. Differential Revision: D99338674
Summary: Three additional production-like CPU overhead simulations: - Egress rate limiting: thread-local F14 map simulating NetworkOverloadProtector's ConcurrentLRUHashMap + sliding window + token bucket per-response checks - KCB double-lookup: second CacheLib find for ~25% of requests, matching production Key Client Binding version mismatch path - Per-thread CPU load measurement: CLOCK_THREAD_CPUTIME_ID reads matching production shouldLoadShed() per-request checks Differential Revision: D99338675
Summary: - Add cpu_work_us parameter: configurable microseconds of CPU busy-work per request using hash computation loops. Allows tuning server CPU utilization to match production levels. - Enable bucket_lock_power=20 in config (production default), adding CacheTable-style fiber mutex contention per request. - Wire cpu_work_us through main.cpp gflags, run.py argparse, and config JSON. Differential Revision: D99338680
Differential Revision: D99338672
Differential Revision: D99338678
Summary: - bind_source.c: LD_PRELOAD library that round-robins connect() across multiple source IPv6 addresses, allowing a single client process to exceed the 64K ephemeral port limit per source IP - automark integration: auto-discovers host IPv6, adds secondary IPs, sets LD_PRELOAD+BIND_ADDRESSES, cleans up on exit via trap - New config param num_source_ips controls how many secondary IPs per client - conntest config for 1-client testing with num_source_ips=2 Differential Revision: D99406079
Summary: Simplify ucachebench client configuration by replacing the interacting knobs (num_proxies, additional_fanout, max_inflight) with two high-level controls: 1. `--num_connections=N`: Specify the target TCP connection count directly. Internally derives num_proxies and additional_fanout, enforcing the mcrouter 32768 ProxyDestination limit. Example: `--num_connections=16000` instead of `--num_proxies=32 --additional_fanout=500`. 2. `--auto_concurrency`: Automatically discovers optimal concurrency during the benchmark using AIMD (Additive Increase, Multiplicative Decrease). Splits the benchmark into a ramp phase (searches for peak QPS) and a steady phase (holds optimal concurrency). Replaces manual max_inflight tuning. Supports `--target_utilization=0.9` to back off from peak. Also: - Default `use_same_thread_client` to true (matches production, +4% QPS in testing) - Add flags to run.py argument parser and CLIENT_ONLY_PARAMS - Regenerate carbon protocol files Differential Revision: D106825050
Summary: Add synchronous request handler methods (processUcbGetSync, processUcbSetSync, processUcbDeleteSync) that return replies directly without wrapping in SemiFuture. The Thrift request handler now calls these sync methods inline, matching production ucache's synchronous request processing pattern. Previously, every request went through: handler -> processUcbGet() -> SemiFuture -> .get() -> callback. The SemiFuture machinery added heap allocations and extra event loop iterations per request. At 3M+ QPS, this overhead was significant. Now the path is: handler -> processUcbGetSync() -> callback->result() — all inline on the IO thread, zero future overhead. The async SemiFuture methods are preserved for backward compatibility but now delegate to the sync versions. Differential Revision: D106825048
…representative %sys Summary: Three features to close the kernel overhead gap between benchmark and production ucache: 1. warmup_max_inflight (client): Use high concurrency during warmup for fast cache population, then low inflight during measurement for realistic packet patterns. When max_inflight <= 5, run.py auto-sets warmup_max_inflight=50 and disables adaptive load control. 2. CPU thread dispatch (server): When rpc_num_cpu_worker_threads > 1, requests dispatch from IO threads to a CPUThreadPoolExecutor, matching production's tao:slow 20K-thread pool pattern. Generates real kernel context switches (IO->CPU->IO). 3. IO latency simulation (server): --io_latency_us / IO_LATENCY_US env var yields the fiber via folly::fibers::Baton::try_wait_for(), simulating downstream I/O waits. Also fixes admin_port not being forwarded to the server in the debug benchmark. Tuning results (7 client hosts, 500K connections, fibers enabled): - %sys: 6.7% (up from 1.0% baseline), production: 14% - %soft: 4.8%, production: 12% - Hit rate: 94%, QPS: 2.6M - Analysis: %sys scales linearly with QPS. At 6M QPS (16+ client hosts), projected %sys matches production. Differential Revision: D106825049
Summary: Add a connection activation scan at warmup start that sends concurrent requests through each McRouter proxy to force lazy TCP connection creation for all additional_fanout destinations. Without this, only ~23% of destinations get connections (limited by furc_hash coverage from sequential requests). Also set reset_inactive_connection_interval=0 to prevent McRouter from pruning idle connections during measurement. Key changes: - Create one scan client per proxy (40 total) to ensure full proxy coverage - Each scan sends 625 concurrent coroutines x 100 sequential requests - furc_hash(randomKey) % 625 covers all destination buckets - Write scan diagnostics to /tmp/ucache_scan_debug.txt for remote debugging Differential Revision: D106825054
Summary: Add benchmark configs from the May 2026 %sys tuning investigation. These configs explore different combinations of max_inflight, connection count, client scaling, and IP aliasing to match production ucache CPU profiles. Key configs: - *_low_inflight.json: max_inflight=1-5 for high %sys per QPS - *_500k_*.json: 500K connection targets with various client counts - *_2h_*.json: 2-host configs testing IP aliasing and instance scaling - *_max_qps.json: Maximum QPS config for server saturation testing - *_caret.json: Caret transport experiment - *_simplified.json: Simplified single-knob connection control These are test/debug configs — the recommended production-representative config is ucache_bench_2h_400k_highqps.json. Differential Revision: D106825045
Summary: The distributed run never finalized cleanly: warmup completed (`warmup_done=N`) but `benchmark_done` stalled below `num_clients` (e.g. 13/16). Because the admin server only broadcasts `ALL_DONE` once `benchmarkCompleteClients_ >= numExpectedClients_`, the missing clients block EVERY client forever at `waitForNotification(0)`, so the run is torn down with no clean client-side results (only server-aggregated metrics survive). Root cause is a `google::log_mutex`/stdio deadlock, not a coroutine hang. `gstack` on a stuck client showed 48 of 189 threads blocked forever (`futex abstime=0x0`) on `google::log_mutex` inside `LogMessage::Flush()`; the lock holder was blocked in `__libc_write(fd=1)` to stdout, its buffer full of `Benchmark GET error (sample 1/1000): mc_res_tko` repeated. The stdout/stderr pipe to benchpress/automark was backpressured, so `write()` blocked while holding the stdio FILE lock and glog's `log_mutex`, wedging every logging thread including the coro worker event-base threads -> `mainScope.joinAsync()` never returns -> `BENCHMARK_DONE` is never sent. Under the `mc_res_tko` storm at multi-M QPS, even 1-in-1000 error sampling is thousands of synchronous prints/sec, which is enough to back the pipe up. Fix: the four hot-path sampled error prints now fire only on the FIRST error per worker (`errors == 1` instead of `errors % 1000 == 1`) -> at most one line per worker for the whole run, flood-proof, while keeping the error-type diagnostic (still gated by `--verbose`). The error counts themselves are unchanged and still reported at the end. This change also folds in the related straggler-drain robustness the completion path needs: per-request `co_await` now uses `folly::makePromiseContract` + `SemiFuture::within(10s)` (global-timekeeper based, abandons the underlying future so the await always completes even if the McRouter callback never fires), and the warmup/measurement worker scopes use `CancellableAsyncScope::cancelAndJoinAsync()` to cancel rather than wait on requests still outstanding at window-end. This entire diff stack contributed to the final result. Differential Revision: D110091576
|
@charles-typ has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110091576. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The distributed run never finalized cleanly: warmup completed (
warmup_done=N) butbenchmark_donestalled belownum_clients(e.g. 13/16). Because the admin server only broadcastsALL_DONEoncebenchmarkCompleteClients_ >= numExpectedClients_, the missing clients block EVERY client forever atwaitForNotification(0), so the run is torn down with no clean client-side results (only server-aggregated metrics survive).Root cause is a
google::log_mutex/stdio deadlock, not a coroutine hang.gstackon a stuck client showed 48 of 189 threads blocked forever (futex abstime=0x0) ongoogle::log_mutexinsideLogMessage::Flush(); the lock holder was blocked in__libc_write(fd=1)to stdout, its buffer full ofBenchmark GET error (sample 1/1000): mc_res_tkorepeated. The stdout/stderr pipe to benchpress/automark was backpressured, sowrite()blocked while holding the stdio FILE lock and glog'slog_mutex, wedging every logging thread including the coro worker event-base threads ->mainScope.joinAsync()never returns ->BENCHMARK_DONEis never sent. Under themc_res_tkostorm at multi-M QPS, even 1-in-1000 error sampling is thousands of synchronous prints/sec, which is enough to back the pipe up.Fix: the four hot-path sampled error prints now fire only on the FIRST error per worker (
errors == 1instead oferrors % 1000 == 1) -> at most one line per worker for the whole run, flood-proof, while keeping the error-type diagnostic (still gated by--verbose). The error counts themselves are unchanged and still reported at the end.This change also folds in the related straggler-drain robustness the completion path needs: per-request
co_awaitnow usesfolly::makePromiseContract+SemiFuture::within(10s)(global-timekeeper based, abandons the underlying future so the await always completes even if the McRouter callback never fires), and the warmup/measurement worker scopes useCancellableAsyncScope::cancelAndJoinAsync()to cancel rather than wait on requests still outstanding at window-end.This entire diff stack contributed to the final result.
Differential Revision: D110091576