fix(bg): cancel superseded connect job to stop false sidecar-readiness failures#64
Conversation
…ness
onStartCommand launched the connect coroutine via runOnMainDispatcher { } but
never stored the returned Job in data.connectingJob (only nulled it in finally).
So stopRunner(restart=true)'s connectingJob?.cancelAndJoin() was a no-op: an
in-flight awaitExternalProcessesReady() from a superseded start (reload/profile
switch) was never cancelled. With MasterDnsVPN's 60s readiness window the orphan
kept polling a now-killed sidecar port and then threw 'sidecar listener not ready',
surfacing a false 'connection failed' even though the live instance was carrying
traffic.
- BaseService: assign the connect Job to data.connectingJob so stopRunner can
actually cancel it.
- BoxInstance.awaitExternalProcessesReady: honor cancellation in the poll loop
(ensureActive) and, if the process pool is no longer active, treat unbound ports
as orphans of a superseded start (log + return) instead of throwing.
📝 WalkthroughWalkthroughTwo coordinated fixes address a race condition where a superseded service start continued polling after teardown. ChangesSuperseded-start race condition fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
Comment |
Summary
Fixes an intermittent "connection failed" shown when starting (or switching to) a
MasterDnsVPN profile, even though the MasterDnsVPN sidecar actually comes up and proxies
traffic. The real cause is a service start/reload race, not a duplicate outbound.
BaseService.Interface.onStartCommandlaunched the connect coroutine viarunOnMainDispatcher { ... }but never stored the returnedJobindata.connectingJob(it was only set to
nullin thefinally). SostopRunner(restart = true)'s existingdata.connectingJob?.cancelAndJoin()was a no-op, and an in-flightawaitExternalProcessesReady()from a superseded start was never cancelled.For a slow-binding sidecar with a long readiness window (MasterDnsVPN waits up to 60s because
it runs a DNS MTU probe + session init before opening its local SOCKS listener), the orphaned
readiness wait kept polling a port whose sidecar had already been torn down by the restart, then
threw
sidecar listener not ready on port(s): <port>after the full window — surfacing a falsefailure even though the live instance had bound a different port and was carrying traffic.
This is a pre-existing bug (the connect
Jobhas never been assigned in this code path);it is independent of the 1.13 port and applies to either line, so it is branched off
main.Changes
BaseService.kt— assign the connect coroutine todata.connectingJobsostopRunner()/reload()can actually cancel an in-flight start. No new field/lifecycle:connectingJobalready exists and is already cancelled instopRunner.BoxInstance.awaitExternalProcessesReady(defensive hardening):ensureActive()), so a superseded start exitspromptly instead of running to the full readiness deadline.
GuardedProcessPoolis no longer active (its sidecars were torn down by arestart), unbound ports are treated as orphans of a superseded start — logged and ignored
— instead of throwing.
No changes to config generation (
ConfigBuilder.kt), the MasterDnsVPN protect-hook path, thesidecar's protocol/encryption, or
libcore. Each generated config already contains exactly oneMasterDnsVPN outbound; the duplication was across separate start attempts, which the cancel fix
resolves.
Testing
:app:compileOssDebugKotlin—BUILD SUCCESSFUL(JDK 17), no newwarnings in the changed files.
coderabbit review --plain) on the committed diff: No findings.switch into it, and a regression check of an existing profile) to follow in a comment.
Notes / follow-up
MTU probing + session init complete. An optional client-side "bind early, queue until session
ready" refactor could shorten perceived start time, but it is out of scope here and
unnecessary once this race is fixed. No server-side rebuild is required (protocol/encryption
unchanged).
Greptile Summary
This PR fixes an intermittent "connection failed" false positive that appeared during rapid start/reload of profiles that use an external sidecar (especially MasterDnsVPN) with a long readiness window. The root cause was that the
Jobreturned byrunOnMainDispatcherinonStartCommandwas never stored indata.connectingJob, makingstopRunner'scancelAndJoin()a no-op against in-flight starts.BaseService.kt: Assigns the returnedJobfromrunOnMainDispatchertodata.connectingJob, sostopRunner/reloadcan actually cancel an in-flight connect coroutine instead of silently discarding it.BoxInstance.kt: AddsensureActive()at the top of each readiness-poll iteration so cancellation is detected promptly, and adds a!processes.isActiveguard after the poll loop to treat a timed-out read against an already-stopped process pool as an orphaned start rather than a hard failure.Confidence Score: 5/5
Safe to merge. The change is minimal and surgical — one assignment in BaseService.kt and two cooperative-cancellation guards in BoxInstance.kt, all on a pre-existing field and code path.
The primary fix correctly assigns the Job returned by runOnMainDispatcher to data.connectingJob, which has always been read by stopRunner's cancelAndJoin(). runOnMainDispatcher is GlobalScope.launch(Dispatchers.Main.immediate, ...) which returns a Job synchronously before the coroutine body runs, so the assignment is safe. The ensureActive() addition is a straightforward cooperative cancellation check, and the !processes.isActive guard is a correct fallback for the deadline-expiry path.
No files require special attention. Both changed files are small and self-contained within the existing start/stop lifecycle.
Important Files Changed
Sequence Diagram
%%{init: {'theme': 'neutral'}}%% sequenceDiagram participant SC as onStartCommand participant CJ as connectingJob participant SR as stopRunner participant AR as awaitExternalProcessesReady participant PP as GuardedProcessPool Note over SC,PP: Before fix SC->>CJ: runOnMainDispatcher CJ->>AR: awaitExternalProcessesReady() SC-->>SR: reload triggers stopRunner SR->>SR: cancelAndJoin() no-op (null) SR->>PP: killProcesses() AR--xCJ: false failure after 60s Note over SC,PP: After fix SC->>CJ: "data.connectingJob = runOnMainDispatcher" CJ->>AR: awaitExternalProcessesReady() SC-->>SR: reload triggers stopRunner SR->>CJ: cancelAndJoin() cancels job AR->>AR: ensureActive() throws CancellationException CJ->>CJ: exits cleanly SR->>PP: killProcesses() orderly%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%% sequenceDiagram participant SC as onStartCommand participant CJ as connectingJob participant SR as stopRunner participant AR as awaitExternalProcessesReady participant PP as GuardedProcessPool Note over SC,PP: Before fix SC->>CJ: runOnMainDispatcher CJ->>AR: awaitExternalProcessesReady() SC-->>SR: reload triggers stopRunner SR->>SR: cancelAndJoin() no-op (null) SR->>PP: killProcesses() AR--xCJ: false failure after 60s Note over SC,PP: After fix SC->>CJ: "data.connectingJob = runOnMainDispatcher" CJ->>AR: awaitExternalProcessesReady() SC-->>SR: reload triggers stopRunner SR->>CJ: cancelAndJoin() cancels job AR->>AR: ensureActive() throws CancellationException CJ->>CJ: exits cleanly SR->>PP: killProcesses() orderlyReviews (1): Last reviewed commit: "fix(bg): cancel superseded connect job; ..." | Re-trigger Greptile