Skip to content

refactor(fdv2): add server FDv2 heuristic fallback and recovery#531

Open
beekld wants to merge 10 commits into
mainfrom
beeklimt/SDK-2099
Open

refactor(fdv2): add server FDv2 heuristic fallback and recovery#531
beekld wants to merge 10 commits into
mainfrom
beeklimt/SDK-2099

Conversation

@beekld
Copy link
Copy Markdown
Contributor

@beekld beekld commented May 11, 2026

Summary

Heuristic FDv2 fallback (interrupted ≥ 2 min) and recovery (valid ≥ 5 min) for the server SDK, modeled on Java's event-driven Condition design. The orchestrator races a condition future against synchronizer.Next(); whichever resolves first wins. Synchronizer rotation is cyclic with Available/Blocked state, mirroring Java's SourceManager.

What's in

  • IFDv2Condition + concrete FallbackCondition / RecoveryCondition with internal timers and an Inform / Execute interface.
  • Conditions aggregator over multiple conditions, built on a new async::WhenAny vector overload.
  • SourceManager owns the synchronizer factory list with per-factory Available/Blocked state. Iteration is cyclic (wraps + skips blocked). BlockCurrentSynchronizer on TerminalError; ResetSourceIndex on recovery.
  • FDv2DataSystem builds conditions keyed off AvailableSynchronizerCount/IsPrimeSynchronizer (no conditions if only one available; prime gets fallback only; others get fallback + recovery).
  • Drops the timeout argument from IFDv2Synchronizer::Next() and removes the unreachable FDv2SourceResult::Timeout variant.

Deferred to future PRs

  • FDv1 fallback directive (X-LD-FD-Fallback response header) — server-driven, one-way switch to an FDv1 synchronizer.
  • FDv1 synchronizer adapter wrapping the existing FDv1 sources as IFDv2Synchronizer.
  • Most of the spec's orchestration-logging requirements.

Test plan

Unit tests for each new component and integration tests for fallback advance, wrap-around on last synchronizer, recovery reset, and exhaustion via
block.


Note

Medium Risk
Changes FDv2 orchestration and synchronizer interfaces (removing Next() timeouts/Timeout results) while introducing timer-driven fallback/recovery and cyclic source rotation, which could affect update reliability and shutdown behavior.

Overview
Adds a new FDv2 condition abstraction (IFDv2Condition) with timed FallbackCondition/RecoveryCondition implementations plus a Conditions aggregator that races condition futures against synchronizer.Next() to trigger source transitions.

Refactors FDv2 synchronizer rotation into a new SourceManager that cycles through synchronizer factories, skips blocked sources, blocks on TerminalError, and resets to the most-preferred source on recovery.

Removes FDv2 per-Next() timeouts by dropping the timeout parameter from IFDv2Synchronizer::Next() and deleting the FDv2SourceResult::Timeout variant; updates polling/streaming synchronizers and tests accordingly. Also adds a vector overload of async::WhenAny and comprehensive unit/integration tests for the new rotation and condition behavior.

Reviewed by Cursor Bugbot for commit 1f7f26e. Bugbot is set up for automated code reviews on this repo. Configure here.

@beekld beekld marked this pull request as ready for review May 12, 2026 22:30
@beekld beekld requested a review from a team as a code owner May 12, 2026 22:30
Comment thread libs/server-sdk/src/data_systems/fdv2/conditions.cpp
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0bb7c75. Configure here.

OnConditionFired(*cond_future.GetResult());
} else {
OnSynchronizerResult(*next_future.GetResult());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded continuation accumulation on conditions future in orchestration loop

Medium Severity

Each iteration of RunSynchronizerNext calls WhenAny(cond_future, next_future). The variadic WhenAny registers a Then continuation on cond_future, which shares the same PromiseInternal across all iterations (obtained via active_conditions_->GetFuture()). When the synchronizer result wins the race (normal path), the continuation on cond_future remains in the PromiseInternal::continuations_ vector, each holding a shared_ptr to its WhenAny's promise. For a stable primary connection where the fallback condition never fires, these continuations accumulate for the lifetime of the synchronizer — potentially days or weeks in a server SDK — causing linear memory growth proportional to the number of processed results.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0bb7c75. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I gotta think about this. Normally, I would cancel the promise that wasn't resolved, but for conditions, we re-use the same promise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so...

  1. Having the conditions have a single long-lived future that multiple callers wait on inherently has this problem.
  2. This isn't a problem with our Promise implementation -- every promise implementation would have this issue, even if the language had GC.
  3. AFAICT, our Java FDv2 implementation has this same issue, using CompletableFuture and anyOf, and no one has noticed an issue.
  4. The only reasonable solution that I can see here would be creating some kind of "subscriber" interface for conditions so that callers could "unsubscribe". But it's a bunch more code, and kind of messy.

Given that there's no simple workaround, and this hasn't been an issue for Java, I think we should just punt on this and maybe clean it up in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant