Skip to content

fix(bg): cancel superseded connect job to stop false sidecar-readiness failures#64

Merged
hawkff merged 1 commit into
mainfrom
fix/mdvpn-sidecar-race
Jun 24, 2026
Merged

fix(bg): cancel superseded connect job to stop false sidecar-readiness failures#64
hawkff merged 1 commit into
mainfrom
fix/mdvpn-sidecar-race

Conversation

@hawkff

@hawkff hawkff commented Jun 24, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes an intermittent "connection failed" shown when starting (or switching to) a
MasterDnsVPN profile, even though the MasterDnsVPN sidecar actually comes up and proxies
traffic. The real cause is a service start/reload race, not a duplicate outbound.

BaseService.Interface.onStartCommand launched the connect coroutine via
runOnMainDispatcher { ... } but never stored the returned Job in data.connectingJob
(it was only set to null in the finally). So stopRunner(restart = true)'s existing
data.connectingJob?.cancelAndJoin() was a no-op, and an in-flight
awaitExternalProcessesReady() from a superseded start was never cancelled.

For a slow-binding sidecar with a long readiness window (MasterDnsVPN waits up to 60s because
it runs a DNS MTU probe + session init before opening its local SOCKS listener), the orphaned
readiness wait kept polling a port whose sidecar had already been torn down by the restart, then
threw sidecar listener not ready on port(s): <port> after the full window — surfacing a false
failure even though the live instance had bound a different port and was carrying traffic.

This is a pre-existing bug (the connect Job has never been assigned in this code path);
it is independent of the 1.13 port and applies to either line, so it is branched off main.

Changes

  • BaseService.kt — assign the connect coroutine to data.connectingJob so
    stopRunner() / reload() can actually cancel an in-flight start. No new field/lifecycle:
    connectingJob already exists and is already cancelled in stopRunner.
  • BoxInstance.awaitExternalProcessesReady (defensive hardening):
    • The poll loop now honors cancellation (ensureActive()), so a superseded start exits
      promptly instead of running to the full readiness deadline.
    • If the GuardedProcessPool is no longer active (its sidecars were torn down by a
      restart), unbound ports are treated as orphans of a superseded start — logged and ignored
      — instead of throwing.

No changes to config generation (ConfigBuilder.kt), the MasterDnsVPN protect-hook path, the
sidecar's protocol/encryption, or libcore. Each generated config already contains exactly one
MasterDnsVPN outbound; the duplication was across separate start attempts, which the cancel fix
resolves.

Testing

  • Local Kotlin typecheck: :app:compileOssDebugKotlinBUILD SUCCESSFUL (JDK 17), no new
    warnings in the changed files.
  • CodeRabbit CLI (coderabbit review --plain) on the committed diff: No findings.
  • On-device verification on Android (cold start of the MasterDnsVPN profile, a live profile
    switch into it, and a regression check of an existing profile) to follow in a comment.

Notes / follow-up

  • The sidecar's ~12–27s bind delay is structural: its local SOCKS listener is opened only after
    MTU probing + session init complete. An optional client-side "bind early, queue until session
    ready" refactor could shorten perceived start time, but it is out of scope here and
    unnecessary once this race is fixed. No server-side rebuild is required (protocol/encryption
    unchanged).

Greptile Summary

This PR fixes an intermittent "connection failed" false positive that appeared during rapid start/reload of profiles that use an external sidecar (especially MasterDnsVPN) with a long readiness window. The root cause was that the Job returned by runOnMainDispatcher in onStartCommand was never stored in data.connectingJob, making stopRunner's cancelAndJoin() a no-op against in-flight starts.

  • BaseService.kt: Assigns the returned Job from runOnMainDispatcher to data.connectingJob, so stopRunner/reload can actually cancel an in-flight connect coroutine instead of silently discarding it.
  • BoxInstance.kt: Adds ensureActive() at the top of each readiness-poll iteration so cancellation is detected promptly, and adds a !processes.isActive guard after the poll loop to treat a timed-out read against an already-stopped process pool as an orphaned start rather than a hard failure.

Confidence Score: 5/5

Safe to merge. The change is minimal and surgical — one assignment in BaseService.kt and two cooperative-cancellation guards in BoxInstance.kt, all on a pre-existing field and code path.

The primary fix correctly assigns the Job returned by runOnMainDispatcher to data.connectingJob, which has always been read by stopRunner's cancelAndJoin(). runOnMainDispatcher is GlobalScope.launch(Dispatchers.Main.immediate, ...) which returns a Job synchronously before the coroutine body runs, so the assignment is safe. The ensureActive() addition is a straightforward cooperative cancellation check, and the !processes.isActive guard is a correct fallback for the deadline-expiry path.

No files require special attention. Both changed files are small and self-contained within the existing start/stop lifecycle.

Important Files Changed

Filename Overview
app/src/main/java/io/nekohasekai/sagernet/bg/BaseService.kt One-line fix: assigns the Job returned by runOnMainDispatcher to data.connectingJob. The field already existed and was already read by stopRunner's cancelAndJoin(), so this is a minimal, correct change with no new lifecycle implications.
app/src/main/java/io/nekohasekai/sagernet/bg/proto/BoxInstance.kt Adds ensureActive() in the readiness-poll loop to honor cooperative cancellation, and a !processes.isActive fallback after the loop to suppress false failures on an already-stopped process pool. Both changes are correct and complementary to the primary fix.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant SC as onStartCommand
    participant CJ as connectingJob
    participant SR as stopRunner
    participant AR as awaitExternalProcessesReady
    participant PP as GuardedProcessPool
    Note over SC,PP: Before fix
    SC->>CJ: runOnMainDispatcher
    CJ->>AR: awaitExternalProcessesReady()
    SC-->>SR: reload triggers stopRunner
    SR->>SR: cancelAndJoin() no-op (null)
    SR->>PP: killProcesses()
    AR--xCJ: false failure after 60s
    Note over SC,PP: After fix
    SC->>CJ: "data.connectingJob = runOnMainDispatcher"
    CJ->>AR: awaitExternalProcessesReady()
    SC-->>SR: reload triggers stopRunner
    SR->>CJ: cancelAndJoin() cancels job
    AR->>AR: ensureActive() throws CancellationException
    CJ->>CJ: exits cleanly
    SR->>PP: killProcesses() orderly
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant SC as onStartCommand
    participant CJ as connectingJob
    participant SR as stopRunner
    participant AR as awaitExternalProcessesReady
    participant PP as GuardedProcessPool
    Note over SC,PP: Before fix
    SC->>CJ: runOnMainDispatcher
    CJ->>AR: awaitExternalProcessesReady()
    SC-->>SR: reload triggers stopRunner
    SR->>SR: cancelAndJoin() no-op (null)
    SR->>PP: killProcesses()
    AR--xCJ: false failure after 60s
    Note over SC,PP: After fix
    SC->>CJ: "data.connectingJob = runOnMainDispatcher"
    CJ->>AR: awaitExternalProcessesReady()
    SC-->>SR: reload triggers stopRunner
    SR->>CJ: cancelAndJoin() cancels job
    AR->>AR: ensureActive() throws CancellationException
    CJ->>CJ: exits cleanly
    SR->>PP: killProcesses() orderly
Loading

Reviews (1): Last reviewed commit: "fix(bg): cancel superseded connect job; ..." | Re-trigger Greptile

…ness

onStartCommand launched the connect coroutine via runOnMainDispatcher { } but
never stored the returned Job in data.connectingJob (only nulled it in finally).
So stopRunner(restart=true)'s connectingJob?.cancelAndJoin() was a no-op: an
in-flight awaitExternalProcessesReady() from a superseded start (reload/profile
switch) was never cancelled. With MasterDnsVPN's 60s readiness window the orphan
kept polling a now-killed sidecar port and then threw 'sidecar listener not ready',
surfacing a false 'connection failed' even though the live instance was carrying
traffic.

- BaseService: assign the connect Job to data.connectingJob so stopRunner can
  actually cancel it.
- BoxInstance.awaitExternalProcessesReady: honor cancellation in the poll loop
  (ensureActive) and, if the process pool is no longer active, treat unbound ports
  as orphans of a superseded start (log + return) instead of throwing.
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Two coordinated fixes address a race condition where a superseded service start continued polling after teardown. BaseService.onStartCommand now saves the connect coroutine to data.connectingJob so stopRunner/reload can cancel it. BoxInstance.awaitExternalProcessesReady adds ensureActive() in its polling loop and early-exits when the process pool is already stopped.

Changes

Superseded-start race condition fix

Layer / File(s) Summary
Track connectingJob in onStartCommand
app/src/main/java/io/nekohasekai/sagernet/bg/BaseService.kt
onStartCommand assigns the connect coroutine to data.connectingJob so stopRunner/reload can call cancelAndJoin() on it. Previously the coroutine was launched anonymously, leaving connectingJob null and making cancellation a no-op.
Cancellation-aware sidecar readiness polling
app/src/main/java/io/nekohasekai/sagernet/bg/proto/BoxInstance.kt
awaitExternalProcessesReady calls ensureActive() each polling iteration so a cancelled connect job exits the loop immediately. Adds a separate guard: if the processes pool is already inactive while ports remain pending, logs a warning about orphaned/superseded ports and returns early instead of reaching the timeout/exception path.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • hawkff/NekoBoxForAndroid#46: Also modifies BoxInstance.awaitExternalProcessesReady(), adjusting timeout and fatality rules for the sidecar readiness loop that this PR now makes cancellation-aware.

Poem

🐇 A job once launched and lost to air,
Now tracked with care beyond compare.
The polling loop checks "am I here?"
And orphaned ports no longer sneer.
No false alarms when starts are shed —
The rabbit fixed the race instead! 🎉

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly states the main fix: canceling superseded connect jobs to prevent false sidecar-readiness failures.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description accurately explains the same race fix and BoxInstance cancellation hardening reflected in the changed files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands.

@hawkff hawkff merged commit f977e30 into main Jun 24, 2026
5 checks passed
@hawkff hawkff deleted the fix/mdvpn-sidecar-race branch June 24, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant