Skip to content

adapter: add LaunchDarkly reconnect integration test#37391

Draft
jasonhernandez wants to merge 1 commit into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-reconnect-test-standalone
Draft

adapter: add LaunchDarkly reconnect integration test#37391
jasonhernandez wants to merge 1 commit into
MaterializeInc:mainfrom
jasonhernandez:jason/ld-reconnect-test-standalone

Conversation

@jasonhernandez

Copy link
Copy Markdown
Contributor

Motivation

incident-984 was a runtime failure: the LaunchDarkly data source stopped reconnecting after its streaming connection dropped with a non-Eof error, silently wedging flag sync. The existing test/launchdarkly nightly covers value sync, persistence, targeting, and the kill switch, but nothing exercises reconnect after a mid-stream drop. A prior attempt to add such a test was abandoned because the failure couldn't be reproduced against real LaunchDarkly ("the test would need to cut the connection"). This reproduces it deterministically against a mock.

This is the reconnect test from #37026, un-stacked from the SDK bump (#37025) and rebased onto current main, so it lands ahead of the upstream-SDK migration. The goal is to simplify review and increase confidence: the test characterizes reconnect behavior against today's (fork) SDK first, and the SDK bump must keep it green.

What changed

Production: a hidden --launchdarkly-base-uri flag (env LAUNCHDARKLY_BASE_URI) that overrides the SDK's streaming/polling/events endpoints with a single base URL, via the SDK's ServiceEndpointsBuilder::relay_proxy. Generally useful for relay-proxy setups; here it lets tests point the SDK at a mock. Threaded through SystemParameterSyncClientConfig::LaunchDarklyld_config. The current fork SDK (2.6.x) already exposes this API, so no SDK change is needed.

Test (test/launchdarkly-reconnect):

  • mock_ld.py — a minimal mock of the LD streaming API. The first streaming client gets an initial flag value (2 GiB), then the connection is reset mid-stream with a TCP RST (SO_LINGER 0) so the SDK sees a non-Eof transport error (the incident class), not the Eof it always recovered from. Every reconnecting client gets an updated value (3 GiB).
  • mzcompose.py — boots environmentd pointed at the mock (mapping max_result_size to the flag) and asserts SHOW max_result_size reaches 3GB. That can only happen if the data source reconnected after the reset; a regressed SDK stays stuck at 2GB and the assertion times out.
  • Wired into the nightly pipeline. Needs no real LaunchDarkly credentials (unlike test/launchdarkly).

Validation status

  • cargo check clean (mz-adapter, mz-environmentd, mz-sqllogictest); bin/fmt clean.
  • The test was previously validated green against the fixed upstream 3.1.1 SDK (in the stacked form on adapter: add LaunchDarkly reconnect integration test #37026). It has not yet been run against the current fork SDK on main. It should pass (we reverted to the fork precisely because it handles incident-984's non-Eof reconnect), but the mock uses a TCP RST specifically and that path is unverified on the fork.
  • Draft until the nightly LaunchDarkly reconnect job is confirmed green on this branch (fork SDK). Trigger via the ci-nightly label.

🤖 Generated with Claude Code

Add an integration test that reproduces incident-984: the LaunchDarkly data
source must reconnect after its streaming connection drops with a non-Eof
error, so flag updates keep syncing. This lands ahead of the upstream SDK bump
so the behavior is characterized against the current (fork) SDK first, and the
bump must keep the test green.

To make this testable against a controlled server, add a hidden
`--launchdarkly-base-uri` flag (env `LAUNCHDARKLY_BASE_URI`) that overrides the
SDK's streaming, polling, and events endpoints with a single base URL via the
SDK's relay-proxy support. This is also generally useful for pointing at a
LaunchDarkly relay proxy. It is threaded through
`SystemParameterSyncClientConfig::LaunchDarkly` into `ld_config`.

The test (test/launchdarkly-reconnect) runs a mock LaunchDarkly streaming
server that serves an initial flag value, resets the first streaming connection
mid-stream with a TCP RST (a non-Eof transport error, as in the incident), and
serves an updated value to every reconnecting client. environmentd is pointed
at the mock, so reaching the updated value proves the data source reconnected;
a regressed SDK stays stuck on the initial value. Unlike test/launchdarkly,
this needs no real LaunchDarkly credentials, and runs in the nightly pipeline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jasonhernandez jasonhernandez added the ci-nightly PR CI control: also trigger Nightly label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-nightly PR CI control: also trigger Nightly

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant