[release-7.3] Fix stale peers for client facing roles (storage server, commit proxy, grv proxy)#13367
[release-7.3] Fix stale peers for client facing roles (storage server, commit proxy, grv proxy)#13367spraza wants to merge 20 commits into
Conversation
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang-73 on Linux CentOS 7
|
Result of foundationdb-pr-73 on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests-73 on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang-73 on Linux CentOS 7
|
Result of foundationdb-pr-73 on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests-73 on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-macos on macOS 14.x
|
|
please merge in latest |
Result of foundationdb-pr-clang-73 on Linux CentOS 7
|
Result of foundationdb-pr-73 on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests-73 on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-cluster-tests-73 on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS 14.x
|
| // proxy list changes, so a killed proxy's RequestStream is dropped from | ||
| // cx->commitProxies/grvProxies on clientInfo rotation rather than waiting | ||
| // for the next transaction's lazy getCommitProxies()/getGrvProxies(). | ||
| if (CLIENT_KNOBS->DBCONTEXT_EAGER_PROXY_UPDATE) { |
There was a problem hiding this comment.
Do you think this one we'd want to turn on in 7.4? Unlike the location cache sweep or the shrink proxy fix in general, I could see it being more debatable. Maybe we want to to be more lazy with this update, but I'm not sure. My guess is it probably doesn't make too much of a difference in production workloads, though
gxglass
left a comment
There was a problem hiding this comment.
AI review attached. The thing it calls B1 may be pretty harmless, I can't really tell. My suggestion is either take its advice on the assignment logic or alternatively if there is a very simple thing that can be done to suppress this class of false positive (a comment above the assignment, maybe) then do it. Alternatively just say "the AI is being too wishy washy about future risks here" and we move on.
gxglass
left a comment
There was a problem hiding this comment.
I don't fully understand the tracking (off by default) or the test case, but the production logic seems pretty tightly constrained and anyway can be disabled, so in the interest of progress I am going to click Approve.
|
Made a small change: 231f326 @alecgrieser and @gxglass - can you reapprove when you get a chance? |
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-73 on Linux CentOS 7
|
Result of foundationdb-pr-clang-73 on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-73 on Linux CentOS 7
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-73 on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests-73 on Linux CentOS 7
|
Result of foundationdb-pr-cluster-tests-73 on Linux CentOS 7
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-macos on macOS 14.x
|
Problem
When a server process a client talks to is killed (storage server, commit/grv proxy, etc.), the client can keep stale references to the dead process (via RequestStream). For example:
Storage servers:
DatabaseContext::locationCacheholds the dead SS'sStorageServerInterfaceuntil either a read hits that shard and fails (baseline error-driveninvalidateCache), orclientInfo/shard reassignment replaces it. ADatabaseContextthat doesn't actively read the affected shards (for example a server-hosted/internal client, or an idle one) can hold the dead SS's interface for a long time.Proxies: when the recruited proxy count exceeds
MAX_*_PROXY_CONNECTIONS,MonitorLeader::shrinkProxyListcaches a sampled subset (lastCommitProxies/lastGrvProxies). If a proxy is killed and the recruited count then drops back under the cap, the cached subset keeps pinning the killed proxy'sRequestStreams.Each pinned
RequestStreamholds aFlowTransportpeer reference, which keepsconnectionKeeperalive retrying the dead address. That produces connection-timeout (CTO) churn and wasted reconnect attempts. At scale this can show up as a problem with various symptoms.Solution
The core fixes make the client drain its connections to dead peers. Each fix is an opt-in knob (default off in production, randomized in simulation, forced on in the new test):
locationCachePeerEvictor(NativeAPI, knobLOCATION_CACHE_PEER_EVICTOR_ENABLED): a per-DatabaseContextactor that periodically (LOCATION_CACHE_PEER_EVICTOR_DELAY) samples FlowTransport's persistent per-address connect-failed counter, and for any address whose count went up beyondLOCATION_CACHE_PEER_EVICTOR_FAILED_THRESHOLDsince the previous sweep, callsinvalidateCacheByAddressto evict every cached location that references it. Dropping the cache'sReference<LocationInfo>lets the dead SS's streams release.FlowTransport, knobPERSISTENT_CONNECT_FAILED_COUNT_TTL):connectionKeeperrecords a per-destination cumulative connect-failure count plus last-failure time that survivesPeerdestruction. That's the stable signal the evictor reads. It's TTL-pruned to stay bounded.shrinkProxyListunder-cap cache-clear (MonitorLeader, knobSHRINK_PROXY_LIST_CLEAR_CACHE_BELOW_THRESHOLD): clears the cached sampled subset on iterations where the recruited count is below the cap, so a killed proxy's pinnedRequestStreams release.Eager proxy rebuild (knob
DBCONTEXT_EAGER_PROXY_UPDATE): rebuildscx->commitProxies/grvProxiesright away on aclientInfoproxy-list change instead of waiting for the next transaction's lazy lookup.Note: this PR is about the problem and the fix, but quite a lot of time was spent on figuring out where and what were the problems leading to high connection timeouts. These changes assisted with that:
InterfaceTracker(knobSTALE_PEER_OBSERVABILITY, default off): at various layers of the RPC stack, diagnostic per-(address, role) interface create/delete accounting plus per-ref backtraces. This helps get close to where the leak was coming from. Every method is a no-op when the knob is off.StalePeerTestworkload + toml: kills a configurable client-facing role, waits for recovery, then asserts no leaked interface copies (Delta == 0) remain on the client. A big part of this PR is a result of constant iteration on this failing test, and leveraging observability above to figure out why the test was failing. The goal was to make the test pass, and meanwhile, ensure general db functionality is not broken (general 100K signal).Testing
Simulation: ran
StalePeerTest10K times, all passed:20260622-234822-praza-stalepeer-v41m4-on7.3.77-c37d2130797b9 compressed=True data_size=34905210 duration=433655 ended=10000 fail_fast=100 max_runs=10000 pass=10000 priority=100 remaining=0 runtime=0:08:02 sanity=False started=10000 stopped=20260622-235624 submitted=20260622-234822 timeout=5400 username=praza-stalepeer-v41m4-on7.3.77-c37d2130797b99acc9da0313be445d751fee5d62
Simulation: ran the general 100K, all passed:
20260622-234826-praza-stalepeer-v41m4-on7.3.77-c37d2130797b9 compressed=True data_size=34872287 duration=3302230 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:39:25 sanity=False started=100000 stopped=20260623-002751 submitted=20260622-234826 timeout=5400 username=praza-stalepeer-v41m4-on7.3.77-c37d2130797b99acc9da0313be445d751fee5d62
Perf cluster: saw up to a 100x reduction in CTO/min aggregated across clients (CTO = connection timeout).