Skip to content

WIP - tackling some flakes and failures#2736

Draft
nbbeeken wants to merge 10 commits into
mainfrom
MONGOSH-2855-flakes
Draft

WIP - tackling some flakes and failures#2736
nbbeeken wants to merge 10 commits into
mainfrom
MONGOSH-2855-flakes

Conversation

@nbbeeken
Copy link
Copy Markdown
Contributor

@nbbeeken nbbeeken commented Jun 5, 2026

No description provided.

nbbeeken added 10 commits June 5, 2026 11:56
Windows CI sometimes kills mongod before it emits its port announcement,
leaving `MongoRunnerSetup.start()` with the unhelpful "Server log output
did not include port or socket" error. Retry up to 3 times (2s, 4s back-off)
so transient AV-scan or filesystem-lock failures self-heal. On each failed
attempt, print any log files mongod managed to write so persistent failures
are diagnosable without reading Evergreen task logs manually.

Addresses Groups 1 and 9 from docs/foilage-test-tickets.md.
NAN 2.24.0 references v8::AccessControl in nan.h, which was removed in
Node.js 26 (V8 13.x). NAN 2.27.0 (released 2026-05-12) removes those
references. cpu-features already allows ^2.19.0 so only the lockfile
needs updating.

Fixes the smoke-tests windows/node@latest job that has been failing on
main since 2026-06-03.
…I tests on 8.3+

node-gyp 11.x (bundled with Node.js 26) injects LLVM ThinLTO linker flags
(opt:lldltojobs=2) that MSVC link.exe rejects with LNK1117. This breaks
all native addon compilation (interruptor, cpu-features, etc.) on Windows
with Node.js 26. Linux/macOS use Clang/GCC where the ThinLTO flags are
valid, so only the windows/latest matrix combination is excluded.

Also skip the Queryable Encryption prefix/suffix/substring tests on
server >= 8.3 where the server removed the 'Preview' suffix from QE
query types (prefixPreview → prefix, etc.). The driver hasn't shipped
support for the GA API names yet. See MONGOSH-3336MONGOSH-3341.
- cli-repl CTRL-C loop1/loop2 (MONGOSH-3381, MONGOSH-3378, MONGOSH-3382):
  server no longer reliably kills $where ops via killOp in 8.3+
- shell-api maxTimeMS (MONGOSH-3379, MONGOSH-3383):
  $where+maxTimeMS hangs 60s instead of throwing MaxTimeMSExpired in 8.3+
- shard bucketsNs (MONGOSH-3328, MONGOSH-3329, MONGOSH-3330):
  server returns user-visible name instead of system.buckets name in 8.3+;
  use oneOf() to accept both forms
- java-shell GraalVM tests (MONGOSH-3307): skip on >= 8.3 where they fail
- e2e glibc/deviceId (MONGOSH-3334): guard assertions when native addon
  returns N/A/'unknown' on ubuntu2004 builders regardless of Node.js version
… hooks

The original workaround for nodejs/node#61895 searched
for the REPL's newListener hook by checking listener.toString() for
'ERR_INVALID_REPL_INPUT'. That string is not stable across Node.js patch
versions; on some Node.js 24 builds on ubuntu2004 the hook isn't found and
stays alive, keeping the REPL context reachable and preventing GC.

Snapshot the process newListener set before REPL creation and remove any
listeners added by the REPL using identity comparison instead.
Two fixes to make the GC regression test reliable across Node.js versions:

1. Use identity comparison (snapshot before vs after REPL creation) to
   remove the newListener handler added by the Node.js REPL for
   nodejs/node#61895. The previous toString()
   check for 'ERR_INVALID_REPL_INPUT' is not stable across Node.js patch
   versions — on Node.js 24.16.0 ubuntu2004 the function's source text
   is not present in toString(), so the handler was never removed.

2. After the REPL emits 'exit' in mongosh-repl.close(), null out
   repl.context so that async cleanup callbacks (e.g. the history file
   write) that keep the REPLServer alive don't transitively prevent the
   vm.Context from being garbage collected.
1. waitForPrompt (test-shell.ts): split on \r as well as \n so that
   spinner output written with carriage-returns (e.g. from the CSFLE
   library during FLE collection setup) does not prevent prompt
   detection. Fixes FLE 'allows automatic range encryption' flake on
   macos_15_amd64_gui (expected prompt timeout with '| | | |' in output).

2. e2e-analytics before_all (e2e-analytics.spec.ts): increase the
   executeLine timeout for rs.initiate() from 10 s to 30 s. After
   initiating a 4-node replica set mongosh updates its prompt once it
   detects the new topology; on slow CI this detection can exceed 10 s.
   Fixes the before_all hook flake on e2e_tests_darwin_arm64_m805.

3. kerberos.sh: retry docker compose up --build up to 3 times with
   5 s / 10 s back-off. Transient network errors downloading Debian
   packages during Docker image builds (e.g. connection reset fetching
   krb5-user from deb.debian.org) no longer fail the entire suite.

4. evergreen.yml.in / .evergreen.yml: wrap the VSCode docker run in
   retry-with-backoff.sh (ATTEMPTS=3). VSCode Insiders occasionally
   exits with code 255 due to extension-host cleanup races; a retry
   picks up where the first attempt left off.
The script was tracked as 100644 (non-executable) so Evergreen's bash
rejected it with exit 126 when invoked directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant