DD finishMoveKeys: move waitForShardReady outside transaction#13364
DD finishMoveKeys: move waitForShardReady outside transaction#13364saintstack wants to merge 1 commit into
Conversation
e563bce to
a835ec7
Compare
|
(Address copilot feedback suggesting we sort dest before comparing...) |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
|
Some initial review comments. I'd prefer this on main only for a few reasons:
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Yes and yes usually, but was trying target backport to 7.3 so minimizing change.
Hard to discuss here an offline doc. Lets talk offline.
I was trying to do better than opinions by going off and spending a bunch of time reproducing the incident so we could see which patches actually rather speculatively helped. FDB already has admission control against 'too much work' whether constraint on loading by RateKeeper, the bound on how many concurrent starts and stop datamoves are allowed, through to DD's limit of how many datamoves each SS can have running at any one time. An incident occurred when an operator performed a task that they have done many times before without incident. In fact, the cluster ran for hours at its configured limit but then it went out of equilibrium when an exclude completed and a team rebuild was triggered. Adding another 'admission control' that bounds datamove is a useful just-in-case but why now do we need it (as you'd say yourself). It may mitigate. But then exclude moves bypass this 'cap' mechanism (though they add to the overall total count) and it was exclude moves that triggered the incident. The 'cap' does not address the cause of the cascade where the finish datamove transaction is unable to complete because getShardState puts an already involved transaction over the 5s limit. Thats what this PR is about.
Yeah. Sorry. Didn't really do compares. Was focused on pass/fail (the test runs take a while to setup and then run long enough to allow for assessment). My admission control test ran with the max set to 100, and then 200... which was probably too constraining. I could do reruns? |
alecgrieser
left a comment
There was a problem hiding this comment.
This LGTM. I guess I agree with @gxglass in the abstract that it would be nice if this were also refactored a bit to avoid (another) large method in the code base with a lot of duplicate work. But I also think this is a pretty clear win, and that we do want this on 7.3. It seems like with just the admission control fixes, we will still get into cases where DD can't make progress, though not spiraling to infinity, but we still very much care about making DD succeed more (which is what this PR does). I guess the extra ablation experiment is useful if not everyone is convinced
| if (checkDest != destServers) { destChanged = true; break; } | ||
| } | ||
| if (destChanged) { | ||
| CODE_PROBE(true, "finishMoveShards dest changed during waitForShardReady"); |
There was a problem hiding this comment.
Have you looked at the simulation code coverage to confirm whether we've been able to hit this?
There was a problem hiding this comment.
Tried. Nada. Let me mix in some buggify ... it's appropriate adding it in here. Will be back to you... Thanks.
There was a problem hiding this comment.
I ran a bunch of seeds with buggify via MoveKeysCycle, MoveKeysClean, and then MoveKeysSideband trying to trip this code probe but no luck. I suppose it makes sense. CancelConflictingDataMoves cancels existing move if a conflicting one so we don't get here (at least not in single DD test scenario). Would need two DDs contending.... I could try writing a test? Thanks @alecgrieser
There was a problem hiding this comment.
Related to this, the comment says:
If the destination changed during the wait (another DD reassigned the
shard),
By "another DD" we mean - the current DD failed another DD took over that role and that DD completed another MoveKeys transaction?
There was a problem hiding this comment.
Another comment related to this: does using multiple transactions violate the transactional total ordering property in any way? Wanted to confirm this.
There was a problem hiding this comment.
Transaction 2: re-verify state hasn't changed (dest still ours,
dataMove still in Running phase for finishMoveShards),
then commit metadata writes
I think we need to make sure that the read and write sets of this transaction ("Transaction 2") are same as the read and write sets of the original transaction. Then the resolver would have the context needed to check for any conflicts (of this transaction with other transactions, and of other transactions with this transaction).
There was a problem hiding this comment.
Nod. Refactored so both transactions use a single method readShardState (Only the second transaction writes)
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr on Linux RHEL 9
|
|
Address agent feedback
Fail is I ran it locally and it OOM'd because of backup lag... -- seems unrelated. |
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Fixes for above are inbound: this #13395 and others coming. |
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS 14.x
|
SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction with a
~5s lifetime (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination
servers were slow to respond, the wait alone consumed the txn budget,
and commits failed with transaction_too_old — retries too, cascading
into the DD pipeline stalls observed in incidents.
Restructure both finish-move functions (finishMoveKeys / finishMoveShards,
dispatched on SHARD_ENCODE_LOCATION_METADATA) into a two-transaction
pattern with the wait in between:
Transaction 1: read keyServers / serverTags / serverList (and dataMove
metadata for finishMoveShards) via the new
readShardState() helper. Save the read version and
drop the transaction (tr.reset()).
Wait: waitForShardReady — runs OUTSIDE any transaction;
the 15s timeout is now safe.
Transaction 2: re-verify via destUnchanged() (dest hasn't changed,
dataMove still in Running phase for finishMoveShards),
then commit metadata writes.
If the destination changed during the wait, the inner loop retries from
the top — same as today's behaviour on transient errors, just without
burning the txn budget on the wait itself.
Verification & retry details:
* destUnchanged() loops every sub-range, not just keyServers[0]. The
keys-flavor (`expectedDataMoveId={}`) tolerates empty-dest entries
whose src ⊆ expectedDest — matching the planning loop's
`alreadyMoved = dest2.empty() && isSubset` branch, which lets sibling
iterations of OUR move that already completed pass through. A foreign
src on an empty-dest entry signals a different move owns the range
and forces a retry to avoid clobbering it. The shards-flavor uses
the dataMoveId stamp as the per-sub-range invariant.
* All retry paths (dest-changed, data-move-deleted, phase-changed, plus
the count-mismatch else-branch in finishMoveShards) are bounded by
FINISH_MOVE_KEYS_MAX_RETRIES and back off via finishMoveKeysBackoff().
Without the cap the dest-changed branch could livelock; the shards-
side count-mismatch path was previously unbounded.
* Txn 2's writes use the post-wait snapshot: keyServersValue uses
reread.uidToTagMap; finishMoveShards introduces a postWaitDataMove
local so the partial-complete / deleteCheckpoints / dataMoveValue
writes reflect the fresh dataMove state.
* runPreCheck=false is set inline on each retry path. The
partial-complete success-continue deliberately leaves runPreCheck
alone so chunks 2..N of a multi-transaction move each get their own
AUDIT_DATAMOVE_PRE_CHECK.
* finishMoveShards now takes finishMoveKeysParallelismLock per
iteration (mirroring finishMoveKeys). The lock had been function-
scoped and released mid-iteration before the wait, so retries ran
lockless and silently exceeded MOVE_KEYS_PARALLELISM.
* taskID = TaskPriority::MoveKeys is restored on txn 2 in both functions
(tr.reset() drops it).
* New CODE_PROBEs ("dest changed", "data move deleted",
"phase changed") are marked probe::decoration::rare; reaching them
requires a concurrent reassignment racing into the wait window.
* SHARD_READY_DELAY is buggified to 5.0s to exercise the slow-dest
scenario in simulation.
Validation:
* k8s rig: 3 transaction_too_old events vs 360,289 in the prior run
without the patch; 0 OOMs.
* Simulation (DDPipelineStall.toml, knob ON): cascade trigger
eliminated; transaction_too_old goes to zero; residual
TryFinishMoveShardsError events are not_committed which retry
cleanly.
Thanks |
6496def to
0323a86
Compare
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS 14.x
|
Result of foundationdb-pr-macos on macOS 14.x
|
Result of foundationdb-pr-clang-arm on Linux RHEL 9
|
|
Addressed feedback in round 2. Rebased, squashed commits to remove noise. Two joshua runs below: |
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
(Forward port of #12981 -- though it needs update to match this version)
SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers are slow to respond, the wait alone consumes the txn budget — and the surrounding transaction has additional reads (\xff/serverTags, \xff/keyServers, \xff/dataMoves with the SHARD_ENCODE_LOCATION_METADATA knob ON, serverList per dest) as well as writes. Result: commits start to fail with transaction_too_old as do the retries.
We saw this issue in recent incidents:
DD has TWO finish-move functions, dispatched on the metadata knob in rawFinishMovement: finishMoveKeys (knob OFF) and finishMoveShards (knob ON). cluster1 had the knob ON, so its code path was finishMoveShards. This patch applies the same fix to BOTH.
For each function: split the single transaction into two, with the wait in between:
If the destination changed during the wait (another DD reassigned the shard), the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself.
Notes
What finishMoveKeys / finishMoveShards actually does
A single transaction does ~10–14 async round-trips to FDB:
On a healthy cluster the whole transaction averages ~1.8 seconds — already 36 % of the 5 s budget, with waitForShardReady returning in milliseconds when the dest is already ready. With the metadata knob ON, steps 1 and 9 add two extra round-trips that further reduce headroom.
waitForShardReady (step 5) polls each dest SS via getShardState at intervals of SHARD_READY_DELAY (default 0.25 s) until a quorum reports ready, with an outer cap of SERVER_READY_QUORUM_TIMEOUT=15 s. The 15 s cap is rarely the trigger in practice — transaction_too_old fires at ~5 s for the whole txn first.
What happed on cluster1
dest SSes were CPU-saturated from concurrent fetchKeys operations on large shards (100–500 MB). At 80 % CPU the SS event loop couldn't process any RPC promptly. The entire finishMoveShards transaction slowed down, not just step 5: reads (steps 1–3) hit slow SSes, waitForShardReady (step 5) saw more "not ready" responses, writes (steps 6–7) hit the same storage layer. The 1.8 s baseline became 5+ seconds, transaction_too_old fired, retries hit the same wall, and the storm was self-sustaining.
The dest-overload was the trigger; the multi-step transaction with waitForShardReady embedded inside it was the latent bug. With the metadata knob ON, steps 1 and 9 also actively contributed by inflating the critical-path transaction — beyond the well-known restart-fatal-via-isRestore-replay issue.
What the simulation and the k8s emulation do
We reproduced the 'cascade' of unfinished moves phenomenon in a k8s test rig and in simulation.
https://github.com/saintstack/foundationdb/tree/dd-pipeline-stall-test is a simulation test that manufactures a cluster1 like situation. https://github.com/saintstack/fdb-kubernetes-tests/tree/backup_recreate is a k8s test that does a similar reproduction.
Two failure modes emerge from the same recipe:
Both converge on the same death: DD OOMs from accumulated actor state, restarts, isRestore replays accumulated \xff/dataMoves/, DD OOMs again.
The convergence-check trace events (FinishMoveShardsDestChanged, *DataMoveDeletedAfterWait, *PhaseChangedAfterWait) fire 0 times in our runs — the safety check is conservative without false-positive retries.
How we triggered the cascade across rigs
All three paths end at the same transaction_too_old. The k8s run with the extended fix exercised the actual production code path (finishMoveShards, knob ON) and showed the cascade trigger eliminated.
Tests
Here are the k8s test runs synopsized:
Running #12981 in Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated on the code path; transaction_too_old goes to zero, residual TryFinishMoveShardsError events are not_committed which retry cleanly.
Why this matters
The latent bug has existed for years. A reasonable concern is whether it's worth the added complexity. Two points of evidence:
Two incidents. Both had this exact trigger. cluster1 plus the metadata knob compounded into a fatal isRestore replay loop on every DD restart.
Admission control complements but doesn't substitute. PR #13112 (cap) and PR #12981 (root cause) work on different parts of the failure chain. Whether the cap alone is sufficient depends on how persistent the trigger is. Intermittent trigger (production team-rebuild bursts → some slow polls → eventual recovery): cap can ride out the burst (Run 35 alone drained 2.3 TB). Sustained trigger (every poll slow because dests stay saturated): simulation shows the cap bounds memory but the queue doesn't drain. Deploying both gives PR #12981 removing the structural anti-pattern and PR #13112 as defense-in-depth.
Why not a simpler fix? Setting SERVER_READY_QUORUM_TIMEOUT below 5 s is two characters in a knob file, but that's not what fires in production. The budget is blown by accumulated 0.25 s polls × N + RPC time well before the 15 s outer timeout, so trimming it changes very little. To cover the real failure mode that way you'd also have to shrink SHARD_READY_DELAY, raising the rate of move failures under transient slowness. The two-transaction restructure costs more lines but eliminates the structural problem regardless of any knob value.