fix: kill zombie geckodriver when Firefox is closed#86
Conversation
When Firefox is closed while the MCP server is running, geckodriver survives as a zombie. Calling close() hangs because quit() sends a Marionette command to the dead Firefox and waits forever. resetFirefox() now races close() against a 5s timeout. If it doesn't resolve, falls back to killService() (SIGTERM via selenium's onQuit_) and reset() (tear down in-memory state). - Add killService() to FirefoxCore and FirefoxClient - Add log file fd close to reset() - Add .catch() on fire-and-forget quit() in reset() - Make resetFirefox() async with closeSucceeded pattern - Update all resetFirefox() callers to await - Replace raw close() in SIGTERM handler with resetFirefox()
|
I updated the PR text for better understanding. |
juliandescottes
left a comment
There was a problem hiding this comment.
Thanks for the patch. Let's try to keep the complexity in firefox/core.ts
| */ | ||
| killService(): void { | ||
| if (this.driver) { | ||
| void (this.driver as any).onQuit_().catch(() => {}); |
There was a problem hiding this comment.
We should check if onQuit_ is available before, it's an optional argument on the webdriver side if I remember correctly.
| // already closed | ||
| } | ||
| this.logFileFd = undefined; | ||
| } |
There was a problem hiding this comment.
This is duplicated from close(), I imagine you had issues in the next connect() call when trying to open the same file again?
At this point it seems like reset() and close() should apply the same logic and could be merged. For instance, reset is currently not resetting env variables, but that's most likely a bug which means on the next connect() the "custom" env variables will now be considered as original ones.
|
Will review today, thanks |
juliandescottes
left a comment
There was a problem hiding this comment.
A few comments to address, especially on the test, but we should be close.
| try { | ||
| webdriver._bidiConnection.close(); | ||
| } catch { | ||
| /* already dead */ | ||
| } | ||
| // In connect-existing mode, geckodriver's DELETE /session releases Marionette | ||
| // without terminating Firefox (since geckodriver was started with --connect-existing). | ||
| if ('quit' in this.driver) { | ||
| await (this.driver as { quit(): Promise<void> }).quit(); | ||
| webdriver._bidiConnection = undefined; |
There was a problem hiding this comment.
| try { | |
| webdriver._bidiConnection.close(); | |
| } catch { | |
| /* already dead */ | |
| } | |
| // In connect-existing mode, geckodriver's DELETE /session releases Marionette | |
| // without terminating Firefox (since geckodriver was started with --connect-existing). | |
| if ('quit' in this.driver) { | |
| await (this.driver as { quit(): Promise<void> }).quit(); | |
| webdriver._bidiConnection = undefined; | |
| try { | |
| webdriver._bidiConnection.close(); | |
| } finally { | |
| webdriver._bidiConnection = undefined; | |
| } |
Use try / finally here?
| try { | ||
| await this.core.close(); | ||
| } catch { | ||
| /* never throws per contract */ |
There was a problem hiding this comment.
This comment is a bit vague. Which contract is this referring to? Is firefox.close never supposed to throw?
It's fine to swallow failures here, but let's drop the confusing comment and instead log something to note that close failed.
| // Timer has passed but webdriver.quit() still hasn't returned. | ||
| // This means Geckodriver is unresponsive. Kill geckodriver. | ||
| // Calling webdriver.onQuit_ (notice the underscore at the end) kills the geckodriver process |
There was a problem hiding this comment.
We also hit this codepath if quit() throws/rejects early (and one of the tests you added exercises that). Let's simplify the comment.
| // Timer has passed but webdriver.quit() still hasn't returned. | |
| // This means Geckodriver is unresponsive. Kill geckodriver. | |
| // Calling webdriver.onQuit_ (notice the underscore at the end) kills the geckodriver process | |
| // webdriver.quit() either failed or timed out. | |
| // Call webdriver.onQuit_ to kill geckodriver. |
(edit: I just noticed you have a logDebug below, it might be self-explanatory enough thanks to this... feel free to completely remove the comment)
| // --------------------------------------------------------------- | ||
| // Scenario A: Firefox frozen (SIGSTOP) | ||
| // --------------------------------------------------------------- | ||
| console.log('Scenario A: Firefox frozen (SIGSTOP)'); |
There was a problem hiding this comment.
The comment just repeats the content of the log, let's remove it. Same comment on the rest of the file.
| // Verify geckodriver death BEFORE cleaning up frozen Firefox (avoids false positives) | ||
| console.log(' 4. Verifying geckodriver was killed by close()...'); | ||
| if (!(await waitForDeath(a.geckoPid, 5000))) { | ||
| console.log(' [FAIL] Geckodriver still alive after close()'); |
There was a problem hiding this comment.
Let's use console.error for failures, same comment for the other [FAIL] logs in the file.
| killAll(a.firefoxPids); | ||
|
|
||
| console.log(' 6. Reconnecting...'); | ||
| usedPids.push(a.geckoPid, await reconnect(geckosBefore, usedPids)); |
There was a problem hiding this comment.
If the intent here is to have a.geckoPid in usedPids for the reconnect call, we should split it on 2 lines
| usedPids.push(a.geckoPid, await reconnect(geckosBefore, usedPids)); | |
| usedPids.push(a.geckoPid); | |
| usedPids.push(await reconnect(geckosBefore, usedPids)); |
Same comment for the other similar spots.
When Firefox is closed while the MCP server is running, the server freezes and never recovers. This is especially frustrating when you accidentally close Firefox windows that the AI agents use. This fix makes it recover automatically.
The problem
TLDR: The broken geckodriver TCP connection with Firefox is never detected and Selenium has no command timeout. The MCP server waits "forever".
getFirefox()checks isConnected() which fails because Firefox is gonegetFirefox()then callsresetFirefox()which callsFirefoxCore.close()which calls SeleniumsWebDriver.quit()WebDriver.quit()sends HTTP DELETE/sessionto geckodriverWebDriver.quit()keeps waiting for the HTTP response from geckodriver which never comesWebDriver.quit()never executes itsonQuitcallback that stops geckodriver.So there are 2 upstream bugs: One in Selenium and one in Geckodriver. If those are fixed, this patch becomes unecessary. But the MCP server can still work even without those fixes.
The solution
TLDR: Give
FirefoxCore.close()a 5-second timeout. If it hangs, force-kill geckodriver and tear down the dead session . The next tool call starts a fresh Firefox automatically.The code that should kill geckodriver but is never triggered, the
onQuitcallback, is actually secretly exposed in the Selenium WebDriver object:WebDriver.onQuit_(). Notice the underscore at the end. If you call this, geckodriver just dies. We use this to bypass the mistake in Selenium.There is no custom kill command.
onQuit_()uses the internal Seleniumproc.kill('SIGTERM'), which is cross-platform and has no shell commands.Changes
FirefoxCore.killService()-- kills the geckodriver process using selenium's onQuit_()FirefoxClient.killService()-- exposes killService on the public facadeFirefoxCore.reset()-- close log file fd, swallowquit()rejectionresetFirefox()-- Add 5s timeout to close, with a fallback to killService() + reset()Test
scripts/test-closed-window.jscovers two scenarios:Run test with:
node scripts/test-closed-window.js