DD finishMoveKeys: move waitForShardReady outside transaction by saintstack · Pull Request #13364 · apple/foundationdb

saintstack · 2026-06-17T19:31:16Z

(Forward port of #12981 -- though it needs update to match this version)

SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction that must commit within ~5s (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers are slow to respond, the wait alone consumes the txn budget — and the surrounding transaction has additional reads (\xff/serverTags, \xff/keyServers, \xff/dataMoves with the SHARD_ENCODE_LOCATION_METADATA knob ON, serverList per dest) as well as writes. Result: commits start to fail with transaction_too_old as do the retries.

We saw this issue in recent incidents:

cluster1: SHARD_ENCODE_LOCATION_METADATA=true compounded into a isRestore replay loop after DD died.
cluster2: same trigger, knob OFF, OOMed but recovered.

DD has TWO finish-move functions, dispatched on the metadata knob in rawFinishMovement: finishMoveKeys (knob OFF) and finishMoveShards (knob ON). cluster1 had the knob ON, so its code path was finishMoveShards. This patch applies the same fix to BOTH.

For each function: split the single transaction into two, with the wait in between:

  Transaction 1: read keyServers/serverTags/serverList (and dataMoves
                 metadata for finishMoveShards)
  Save the read version, drop the transaction (tr.reset())
  Wait:          waitForShardReady — runs OUTSIDE any transaction;
                 the 15s timeout is now safe
  Transaction 2: re-verify state hasn't changed (dest still ours,
                 dataMove still in Running phase for finishMoveShards),
                 then commit metadata writes

If the destination changed during the wait (another DD reassigned the shard), the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself.

Notes

What finishMoveKeys / finishMoveShards actually does

A single transaction does ~10–14 async round-trips to FDB:

Read \xff/dataMoves/<id> — fetch metadata (knob ON only)
Read \xff/serverTags/
Read \xff/keyServers/ range via krmGetRanges
Decode src/dest, validate
waitForShardReady for each destination server
Write updated \xff/keyServers/ via krmSetRangeCoalescing
Write updated \xff/serverKeys/ (~6 servers per move)
Delete checkpoints
Clear \xff/dataMoves/<id> (knob ON only)
Commit

On a healthy cluster the whole transaction averages ~1.8 seconds — already 36 % of the 5 s budget, with waitForShardReady returning in milliseconds when the dest is already ready. With the metadata knob ON, steps 1 and 9 add two extra round-trips that further reduce headroom.

waitForShardReady (step 5) polls each dest SS via getShardState at intervals of SHARD_READY_DELAY (default 0.25 s) until a quorum reports ready, with an outer cap of SERVER_READY_QUORUM_TIMEOUT=15 s. The 15 s cap is rarely the trigger in practice — transaction_too_old fires at ~5 s for the whole txn first.

What happed on cluster1

dest SSes were CPU-saturated from concurrent fetchKeys operations on large shards (100–500 MB). At 80 % CPU the SS event loop couldn't process any RPC promptly. The entire finishMoveShards transaction slowed down, not just step 5: reads (steps 1–3) hit slow SSes, waitForShardReady (step 5) saw more "not ready" responses, writes (steps 6–7) hit the same storage layer. The 1.8 s baseline became 5+ seconds, transaction_too_old fired, retries hit the same wall, and the storm was self-sustaining.

The dest-overload was the trigger; the multi-step transaction with waitForShardReady embedded inside it was the latent bug. With the metadata knob ON, steps 1 and 9 also actively contributed by inflating the critical-path transaction — beyond the well-known restart-fatal-via-isRestore-replay issue.

What the simulation and the k8s emulation do

We reproduced the 'cascade' of unfinished moves phenomenon in a k8s test rig and in simulation.
https://github.com/saintstack/foundationdb/tree/dd-pipeline-stall-test is a simulation test that manufactures a cluster1 like situation. https://github.com/saintstack/fdb-kubernetes-tests/tree/backup_recreate is a k8s test that does a similar reproduction.

Two failure modes emerge from the same recipe:

Mode | Trigger | Distinctive error pattern
cluster1 cascade | mako + exclude | 96–99% transaction_too_old
Rebalance overload | mako only | 70% not_committed, 16% transaction_too_old

Both converge on the same death: DD OOMs from accumulated actor state, restarts, isRestore replays accumulated \xff/dataMoves/, DD OOMs again.

The convergence-check trace events (FinishMoveShardsDestChanged, *DataMoveDeletedAfterWait, *PhaseChangedAfterWait) fire 0 times in our runs — the safety check is conservative without false-positive retries.

How we triggered the cascade across rigs

Trigger	What slows waitForShardReady
cluster1	Dest SS event loop CPU-saturated → all reads/RPCs slow; getShardState returns not-ready across polls
K8s rig	SHARD_READY_DELAY raised from 0.25 s to 2.5 s → 2 not-ready polls = 5 s
Simulation	buggify_get_shard_state_delay = 2 (concurrency limit) simulates SS event loop saturation

All three paths end at the same transaction_too_old. The k8s run with the extended fix exercised the actual production code path (finishMoveShards, knob ON) and showed the cascade trigger eliminated.

Tests

Here are the k8s test runs synopsized:

PR	Mako-only k8s run	Mako+exclude k8s run
PR #12981	—	✅ test-5jmt4wg6: 3 tx_too_old (vs 360,289 in the previous run without the patch), 0 Phase-4 OOMs, 0 cap engagements in Phase 4 ('cap' refers to PR #13112)
PR #13112 alone	Run 34: ✅ 1 restart in 4 h, self-recovered	Run 35: ✅ 2.3 TB drained, 274 cap engagements

Running #12981 in Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated on the code path; transaction_too_old goes to zero, residual TryFinishMoveShardsError events are not_committed which retry cleanly.

Why this matters

The latent bug has existed for years. A reasonable concern is whether it's worth the added complexity. Two points of evidence:

Two incidents. Both had this exact trigger. cluster1 plus the metadata knob compounded into a fatal isRestore replay loop on every DD restart.

Admission control complements but doesn't substitute. PR #13112 (cap) and PR #12981 (root cause) work on different parts of the failure chain. Whether the cap alone is sufficient depends on how persistent the trigger is. Intermittent trigger (production team-rebuild bursts → some slow polls → eventual recovery): cap can ride out the burst (Run 35 alone drained 2.3 TB). Sustained trigger (every poll slow because dests stay saturated): simulation shows the cap bounds memory but the queue doesn't drain. Deploying both gives PR #12981 removing the structural anti-pattern and PR #13112 as defense-in-depth.

Why not a simpler fix? Setting SERVER_READY_QUORUM_TIMEOUT below 5 s is two characters in a knob file, but that's not what fires in production. The budget is blown by accumulated 0.25 s polls × N + RPC time well before the 15 s outer timeout, so trimming it changes very little. To cover the real failure mode that way you'd also have to shrink SHARD_READY_DELAY, raising the rate of move failures under transient slowness. The two-transaction restructure costs more lines but eliminates the structural problem regardless of any knob value.

saintstack · 2026-06-17T19:48:55Z

(Address copilot feedback suggesting we sort dest before comparing...)

foundationdb-ci · 2026-06-17T19:53:53Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: e563bce
Duration 0:22:28
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:03:45Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: e563bce
Duration 0:32:17
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:16:53Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: e563bce
Duration 0:45:24
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:16:57Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: e563bce
Duration 0:45:32
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:19:45Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: e563bce
Duration 0:48:20
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:27:17Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: e563bce
Duration 0:55:51
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:28:47Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: a835ec7
Duration 0:40:09
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:33:55Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: a835ec7
Duration 0:45:13
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T20:34:47Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: a835ec7
Duration 0:46:07
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

gxglass · 2026-06-17T20:37:33Z

Some initial review comments. I'd prefer this on main only for a few reasons:

Echoing a concern on a prior PR, this involves is a super long and complicated function that is growing here (1759 - 1324 => 400+ lines) and I'd like to refactor that to simplify it. I realize this sort of thing (giant ~untestable-in-isolation functions) is somewhat endemic in the code base, but it doesn't mean we shouldn't make improvements as we go. We keep running into thinly tested code in complicated actor chains with actual bugs surfacing in production (e.g. PR 13200, PR 13312 in the last six weeks), so this isn't just a stylistic/code smell preference.
I'm not convinced this is actually needed on release-7.3. The offline doc has language (obviously agent written) which to me suggests that the agent is overly indexed on this change, for example it refers to this as the fixing the root cause of various problems. In my opinion the root cause was letting too much work into the system, and this change improves a long standing weakness of the design that gets exacerbated when there is too much work. "Too much work" explains various other problems, notably a) excess recovery work on startup, b) OOMs due to just flat out too many actors. I also don't see an ablation experiment to assess system performance (as opposed to "runs to completion with minimal OOMs") without this change. Specifically, with admission control only, how fast did the storage migration take, and how does that compare to expected/optimal? If it's already reasonably close to expected/optimal, then I think release-7.3 need not wait for this. By "reasonably close" I mean anything sane, say 20% of expected speed or faster. We have not been doing migrations for months and $NORMAL_MIGRATION_SPEED/0.2 probably implies something measured in single digit days.

foundationdb-ci · 2026-06-17T20:40:06Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: e563bce
Duration 1:08:38
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-06-17T21:11:44Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: a835ec7
Duration 1:22:05
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-06-17T21:12:19Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: a835ec7
Duration 1:23:40
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T21:39:02Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: a835ec7
Duration 1:50:23
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-17T22:18:25Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: a835ec7
Duration 2:29:47
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

saintstack · 2026-06-18T00:00:35Z

I realize this sort of thing (giant ~untestable-in-isolation functions) is somewhat endemic in the code base, but it doesn't mean we shouldn't make improvements as we go.

Yes and yes usually, but was trying target backport to 7.3 so minimizing change.

The offline doc has language (obviously agent written) which to me suggests that the agent is overly indexed on this change, for example it refers to this as the fixing the root cause of various problems.

Hard to discuss here an offline doc. Lets talk offline.

In my opinion the root cause was letting too much work into the system...

I was trying to do better than opinions by going off and spending a bunch of time reproducing the incident so we could see which patches actually rather speculatively helped.

FDB already has admission control against 'too much work' whether constraint on loading by RateKeeper, the bound on how many concurrent starts and stop datamoves are allowed, through to DD's limit of how many datamoves each SS can have running at any one time. An incident occurred when an operator performed a task that they have done many times before without incident. In fact, the cluster ran for hours at its configured limit but then it went out of equilibrium when an exclude completed and a team rebuild was triggered. Adding another 'admission control' that bounds datamove is a useful just-in-case but why now do we need it (as you'd say yourself). It may mitigate. But then exclude moves bypass this 'cap' mechanism (though they add to the overall total count) and it was exclude moves that triggered the incident. The 'cap' does not address the cause of the cascade where the finish datamove transaction is unable to complete because getShardState puts an already involved transaction over the 5s limit. Thats what this PR is about.

I also don't see an ablation experiment to assess system performance (as opposed to "runs to completion with minimal OOMs") without this change. Specifically, with admission control only, how fast did the storage migration take, and how does that compare to expected/optimal?

Yeah. Sorry. Didn't really do compares. Was focused on pass/fail (the test runs take a while to setup and then run long enough to allow for assessment). My admission control test ran with the max set to 100, and then 200... which was probably too constraining. I could do reruns?

alecgrieser

This LGTM. I guess I agree with @gxglass in the abstract that it would be nice if this were also refactored a bit to avoid (another) large method in the code base with a lot of duplicate work. But I also think this is a pretty clear win, and that we do want this on 7.3. It seems like with just the admission control fixes, we will still get into cases where DD can't make progress, though not spiraling to infinity, but we still very much care about making DD succeed more (which is what this PR does). I guess the extra ablation experiment is useful if not everyone is convinced

alecgrieser · 2026-06-18T14:24:29Z

+						if (checkDest != destServers) { destChanged = true; break; }
+					}
+					if (destChanged) {
+						CODE_PROBE(true, "finishMoveShards dest changed during waitForShardReady");


Have you looked at the simulation code coverage to confirm whether we've been able to hit this?

Tried. Nada. Let me mix in some buggify ... it's appropriate adding it in here. Will be back to you... Thanks.

I ran a bunch of seeds with buggify via MoveKeysCycle, MoveKeysClean, and then MoveKeysSideband trying to trip this code probe but no luck. I suppose it makes sense. CancelConflictingDataMoves cancels existing move if a conflicting one so we don't get here (at least not in single DD test scenario). Would need two DDs contending.... I could try writing a test? Thanks @alecgrieser

Related to this, the comment says:

If the destination changed during the wait (another DD reassigned the
shard),

By "another DD" we mean - the current DD failed another DD took over that role and that DD completed another MoveKeys transaction?

Another comment related to this: does using multiple transactions violate the transactional total ordering property in any way? Wanted to confirm this.

Transaction 2: re-verify state hasn't changed (dest still ours,
dataMove still in Running phase for finishMoveShards),
then commit metadata writes

I think we need to make sure that the read and write sets of this transaction ("Transaction 2") are same as the read and write sets of the original transaction. Then the resolver would have the context needed to check for any conflicts (of this transaction with other transactions, and of other transactions with this transaction).

Nod. Refactored so both transactions use a single method readShardState (Only the second transaction writes)

foundationdb-ci · 2026-06-18T18:36:47Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 7c2372f
Duration 0:38:20
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-18T18:37:26Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: b4504cd
Duration 0:39:24
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-18T18:44:45Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: 7c2372f
Duration 0:46:17
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-18T18:46:34Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: b4504cd
Duration 0:48:30
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-18T18:51:46Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: b4504cd
Duration 0:53:39
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-18T18:52:15Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: b4504cd
Duration 0:54:06
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-18T19:02:03Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 7c2372f
Duration 1:03:35
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

saintstack · 2026-06-25T22:08:40Z

Address agent feedback

20260625-211414-stack_review3-61dd05b199483a82 compressed=True data_size=37286408 duration=2515327 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:30:37 sanity=False started=100000 stopped=20260625-214451 submitted=20260625-211414 timeout=5400 username=stack_review3

Fail is

RandomSeed="1638661882" SourceVersion="6496defe0f07d5d53589c01d8fdbf8740c0f95a1" Time="1782422454" BuggifyEnabled="1"
DeterminismCheck="0" FaultInjectionEnabled="1" TestFile="tests/fast/AtomicBackupCorrectness.toml"

I ran it locally and it OOM'd because of backup lag... -- seems unrelated.

foundationdb-ci · 2026-06-25T22:26:32Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 6496def
Duration 0:23:11
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-25T22:36:29Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: 6496def
Duration 0:33:05
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-25T22:50:47Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 6496def
Duration 0:47:23
Result: ❌ FAILED
Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-25T22:51:06Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: 6496def
Duration 0:47:42
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-25T22:51:10Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 6496def
Duration 0:47:45
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-25T23:09:16Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 6496def
Duration 1:05:53
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

saintstack · 2026-06-25T23:27:24Z

ThreadID="16013673033581053930" Machine="127.0.0.1:60705" LogGroup="default" Roles="CC,CP,SS,TL" />

      Start 50: fdb_c_wiggle_and_upgrade
62/62 Test #50: fdb_c_wiggle_and_upgrade ..........................   Passed   69.25 sec

98% tests passed, 1 tests failed out of 62

Total Test time (real) = 514.58 sec

The following tests FAILED:
	 49 - fdb_c_wiggle_only (Failed)
Errors while running CTest

Fixes for above are inbound: this #13395 and others coming.

foundationdb-ci · 2026-06-25T23:43:24Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: 6496def
Duration 1:40:02
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-25T23:49:19Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 6496def
Duration 0:21:39
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-25T23:59:40Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: 6496def
Duration 0:32:02
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-26T00:13:34Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: 6496def
Duration 0:45:58
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-26T00:22:44Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 6496def
Duration 0:55:08
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-26T00:25:03Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 6496def
Duration 0:57:23
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-26T00:31:05Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 6496def
Duration 1:03:26
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-06-26T02:26:38Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: 6496def
Duration 2:58:59
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

gxglass

Another round.

pr13364-review-round2.md

SERVER_READY_QUORUM_TIMEOUT (15s) was used inside a transaction with a ~5s lifetime (MAX_WRITE_TRANSACTION_LIFE_VERSIONS). When destination servers were slow to respond, the wait alone consumed the txn budget, and commits failed with transaction_too_old — retries too, cascading into the DD pipeline stalls observed in incidents. Restructure both finish-move functions (finishMoveKeys / finishMoveShards, dispatched on SHARD_ENCODE_LOCATION_METADATA) into a two-transaction pattern with the wait in between: Transaction 1: read keyServers / serverTags / serverList (and dataMove metadata for finishMoveShards) via the new readShardState() helper. Save the read version and drop the transaction (tr.reset()). Wait: waitForShardReady — runs OUTSIDE any transaction; the 15s timeout is now safe. Transaction 2: re-verify via destUnchanged() (dest hasn't changed, dataMove still in Running phase for finishMoveShards), then commit metadata writes. If the destination changed during the wait, the inner loop retries from the top — same as today's behaviour on transient errors, just without burning the txn budget on the wait itself. Verification & retry details: * destUnchanged() loops every sub-range, not just keyServers[0]. The keys-flavor (`expectedDataMoveId={}`) tolerates empty-dest entries whose src ⊆ expectedDest — matching the planning loop's `alreadyMoved = dest2.empty() && isSubset` branch, which lets sibling iterations of OUR move that already completed pass through. A foreign src on an empty-dest entry signals a different move owns the range and forces a retry to avoid clobbering it. The shards-flavor uses the dataMoveId stamp as the per-sub-range invariant. * All retry paths (dest-changed, data-move-deleted, phase-changed, plus the count-mismatch else-branch in finishMoveShards) are bounded by FINISH_MOVE_KEYS_MAX_RETRIES and back off via finishMoveKeysBackoff(). Without the cap the dest-changed branch could livelock; the shards- side count-mismatch path was previously unbounded. * Txn 2's writes use the post-wait snapshot: keyServersValue uses reread.uidToTagMap; finishMoveShards introduces a postWaitDataMove local so the partial-complete / deleteCheckpoints / dataMoveValue writes reflect the fresh dataMove state. * runPreCheck=false is set inline on each retry path. The partial-complete success-continue deliberately leaves runPreCheck alone so chunks 2..N of a multi-transaction move each get their own AUDIT_DATAMOVE_PRE_CHECK. * finishMoveShards now takes finishMoveKeysParallelismLock per iteration (mirroring finishMoveKeys). The lock had been function- scoped and released mid-iteration before the wait, so retries ran lockless and silently exceeded MOVE_KEYS_PARALLELISM. * taskID = TaskPriority::MoveKeys is restored on txn 2 in both functions (tr.reset() drops it). * New CODE_PROBEs ("dest changed", "data move deleted", "phase changed") are marked probe::decoration::rare; reaching them requires a concurrent reassignment racing into the wait window. * SHARD_READY_DELAY is buggified to 5.0s to exercise the slow-dest scenario in simulation. Validation: * k8s rig: 3 transaction_too_old events vs 360,289 in the prior run without the patch; 0 OOMs. * Simulation (DDPipelineStall.toml, knob ON): cascade trigger eliminated; transaction_too_old goes to zero; residual TryFinishMoveShardsError events are not_committed which retry cleanly.

saintstack · 2026-06-28T01:14:47Z

Another round.

pr13364-review-round2.md

Thanks

foundationdb-ci · 2026-06-28T01:36:08Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 0323a86
Duration 0:20:50
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T01:49:44Z

Result of foundationdb-pr-macos-m1 on macOS 14.x

Commit ID: 0323a86
Duration 0:34:25
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T02:01:42Z

Result of foundationdb-pr-macos on macOS 14.x

Commit ID: 0323a86
Duration 0:46:23
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T02:02:42Z

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Commit ID: 0323a86
Duration 0:47:23
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

saintstack · 2026-06-28T02:03:59Z

Addressed feedback in round 2. Rebased, squashed commits to remove noise. Two joshua runs below:

  20260627-231305-stack_watch-26cda105e0b9b575       compressed=True data_size=37295307 duration=3162465 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:28:51 sanity=False started=100000 stopped=20260627-234156 submitted=20260627-231305 timeout=5400 username=stack_watch
  20260627-234937-stack-2-26cda105e0b9b575           compressed=True data_size=37295307 duration=3690997 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:28:17 sanity=False started=100000 stopped=20260628-001754 submitted=20260627-234937 timeout=5400 username=stack-2

foundationdb-ci · 2026-06-28T02:07:37Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 0323a86
Duration 0:52:16
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T02:12:26Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 0323a86
Duration 0:57:09
Result: ❌ FAILED
Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-06-28T02:17:17Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 0323a86
Duration 1:01:58
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

saintstack force-pushed the pr12981-extended-refactor branch from e563bce to a835ec7 Compare June 17, 2026 19:48

saintstack requested review from gxglass and sbodagala June 17, 2026 19:49

saintstack mentioned this pull request Jun 17, 2026

Move waitForShardReady outside transaction in finishMoveKeys #12981

Open

saintstack requested a review from alecgrieser June 17, 2026 20:21

alecgrieser approved these changes Jun 18, 2026

View reviewed changes

saintstack closed this Jun 25, 2026

saintstack reopened this Jun 25, 2026

gxglass reviewed Jun 26, 2026

View reviewed changes

saintstack force-pushed the pr12981-extended-refactor branch from 6496def to 0323a86 Compare June 28, 2026 01:15

Uh oh!

Conversation

saintstack commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Notes

What finishMoveKeys / finishMoveShards actually does

What happed on cluster1

What the simulation and the k8s emulation do

How we triggered the cascade across rigs

Tests

Why this matters

Uh oh!

saintstack commented Jun 17, 2026

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-macos-m1 on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-macos on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-macos-m1 on macOS 14.x

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-clang-arm on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

gxglass commented Jun 17, 2026

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Jun 17, 2026

Result of foundationdb-pr-macos on macOS 14.x

Uh oh!

saintstack commented Jun 18, 2026

Uh oh!

alecgrieser left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alecgrieser Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

saintstack Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

saintstack Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

sbodagala Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbodagala Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

saintstack commented Jun 17, 2026 •

edited

Loading

sbodagala Jun 22, 2026 •

edited

Loading

sbodagala Jun 22, 2026 •

edited

Loading

sbodagala Jun 22, 2026 •

edited

Loading