Skip to content

fix: kill zombie geckodriver when Firefox is closed#86

Open
csamu wants to merge 3 commits into
mozilla:mainfrom
csamu:fix/zombie-geckodriver
Open

fix: kill zombie geckodriver when Firefox is closed#86
csamu wants to merge 3 commits into
mozilla:mainfrom
csamu:fix/zombie-geckodriver

Conversation

@csamu

@csamu csamu commented May 14, 2026

Copy link
Copy Markdown
Contributor

When Firefox is closed while the MCP server is running, the server freezes and never recovers. This is especially frustrating when you accidentally close Firefox windows that the AI agents use. This fix makes it recover automatically.

The problem

TLDR: The broken geckodriver TCP connection with Firefox is never detected and Selenium has no command timeout. The MCP server waits "forever".

  1. A non-headless MCP session is started
  2. User closes Firefox (or it crashes)
  3. Any MCP tool call arrives (e.g. take_snapshot)
    • getFirefox() checks isConnected() which fails because Firefox is gone
    • getFirefox() then calls resetFirefox() which calls FirefoxCore.close() which calls Seleniums WebDriver.quit()
    • WebDriver.quit() sends HTTP DELETE /session to geckodriver
      • geckodriver translates the DELETE into a Marionette command and sends it to Firefox over raw TCP
      • Since Firefox is gone, geckodriver tries to send commands over a broken TCP connection, and that blocks indefinitely. Probably a race condition according to AI.
    • WebDriver.quit() keeps waiting for the HTTP response from geckodriver which never comes
    • WebDriver.quit() never executes its onQuit callback that stops geckodriver.
    • Geckodriver is now zombie
  4. The tool call never returns. The entire MCP server is now stuck.

So there are 2 upstream bugs: One in Selenium and one in Geckodriver. If those are fixed, this patch becomes unecessary. But the MCP server can still work even without those fixes.

The solution

TLDR: Give FirefoxCore.close() a 5-second timeout. If it hangs, force-kill geckodriver and tear down the dead session . The next tool call starts a fresh Firefox automatically.

The code that should kill geckodriver but is never triggered, the onQuit callback, is actually secretly exposed in the Selenium WebDriver object: WebDriver.onQuit_(). Notice the underscore at the end. If you call this, geckodriver just dies. We use this to bypass the mistake in Selenium.

There is no custom kill command. onQuit_() uses the internal Selenium proc.kill('SIGTERM'), which is cross-platform and has no shell commands.

Changes

  • Add: FirefoxCore.killService() -- kills the geckodriver process using selenium's onQuit_()
  • Add: FirefoxClient.killService() -- exposes killService on the public facade
  • Edit: FirefoxCore.reset() -- close log file fd, swallow quit() rejection
  • Edit: resetFirefox() -- Add 5s timeout to close, with a fallback to killService() + reset()
  • Minor related edits

Test

scripts/test-closed-window.js covers two scenarios:

  • SIGSTOP: Freezes Firefox so close() genuinely hangs
  • SIGKILL: Kills Firefox completely

Run test with: node scripts/test-closed-window.js

When Firefox is closed while the MCP server is running, geckodriver
survives as a zombie. Calling close() hangs because quit() sends a
Marionette command to the dead Firefox and waits forever.

resetFirefox() now races close() against a 5s timeout. If it doesn't
resolve, falls back to killService() (SIGTERM via selenium's onQuit_)
and reset() (tear down in-memory state).

- Add killService() to FirefoxCore and FirefoxClient
- Add log file fd close to reset()
- Add .catch() on fire-and-forget quit() in reset()
- Make resetFirefox() async with closeSucceeded pattern
- Update all resetFirefox() callers to await
- Replace raw close() in SIGTERM handler with resetFirefox()
@csamu

csamu commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

I updated the PR text for better understanding.

@juliandescottes juliandescottes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. Let's try to keep the complexity in firefox/core.ts

Comment thread src/index.ts
Comment thread src/firefox/core.ts Outdated
*/
killService(): void {
if (this.driver) {
void (this.driver as any).onQuit_().catch(() => {});

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check if onQuit_ is available before, it's an optional argument on the webdriver side if I remember correctly.

Comment thread src/firefox/core.ts Outdated
// already closed
}
this.logFileFd = undefined;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated from close(), I imagine you had issues in the next connect() call when trying to open the same file again?

At this point it seems like reset() and close() should apply the same logic and could be merged. For instance, reset is currently not resetting env variables, but that's most likely a bug which means on the next connect() the "custom" env variables will now be considered as original ones.

@juliandescottes juliandescottes self-requested a review June 2, 2026 07:08
@juliandescottes

Copy link
Copy Markdown
Collaborator

Will review today, thanks

@juliandescottes juliandescottes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments to address, especially on the test, but we should be close.

Comment thread src/firefox/core.ts
Comment on lines +390 to +395
try {
webdriver._bidiConnection.close();
} catch {
/* already dead */
}
// In connect-existing mode, geckodriver's DELETE /session releases Marionette
// without terminating Firefox (since geckodriver was started with --connect-existing).
if ('quit' in this.driver) {
await (this.driver as { quit(): Promise<void> }).quit();
webdriver._bidiConnection = undefined;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try {
webdriver._bidiConnection.close();
} catch {
/* already dead */
}
// In connect-existing mode, geckodriver's DELETE /session releases Marionette
// without terminating Firefox (since geckodriver was started with --connect-existing).
if ('quit' in this.driver) {
await (this.driver as { quit(): Promise<void> }).quit();
webdriver._bidiConnection = undefined;
try {
webdriver._bidiConnection.close();
} finally {
webdriver._bidiConnection = undefined;
}

Use try / finally here?

Comment thread src/firefox/index.ts
try {
await this.core.close();
} catch {
/* never throws per contract */

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit vague. Which contract is this referring to? Is firefox.close never supposed to throw?

It's fine to swallow failures here, but let's drop the confusing comment and instead log something to note that close failed.

Comment thread src/firefox/core.ts
Comment on lines +411 to +413
// Timer has passed but webdriver.quit() still hasn't returned.
// This means Geckodriver is unresponsive. Kill geckodriver.
// Calling webdriver.onQuit_ (notice the underscore at the end) kills the geckodriver process

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also hit this codepath if quit() throws/rejects early (and one of the tests you added exercises that). Let's simplify the comment.

Suggested change
// Timer has passed but webdriver.quit() still hasn't returned.
// This means Geckodriver is unresponsive. Kill geckodriver.
// Calling webdriver.onQuit_ (notice the underscore at the end) kills the geckodriver process
// webdriver.quit() either failed or timed out.
// Call webdriver.onQuit_ to kill geckodriver.

(edit: I just noticed you have a logDebug below, it might be self-explanatory enough thanks to this... feel free to completely remove the comment)

Comment on lines +140 to +143
// ---------------------------------------------------------------
// Scenario A: Firefox frozen (SIGSTOP)
// ---------------------------------------------------------------
console.log('Scenario A: Firefox frozen (SIGSTOP)');

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment just repeats the content of the log, let's remove it. Same comment on the rest of the file.

// Verify geckodriver death BEFORE cleaning up frozen Firefox (avoids false positives)
console.log(' 4. Verifying geckodriver was killed by close()...');
if (!(await waitForDeath(a.geckoPid, 5000))) {
console.log(' [FAIL] Geckodriver still alive after close()');

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use console.error for failures, same comment for the other [FAIL] logs in the file.

killAll(a.firefoxPids);

console.log(' 6. Reconnecting...');
usedPids.push(a.geckoPid, await reconnect(geckosBefore, usedPids));

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intent here is to have a.geckoPid in usedPids for the reconnect call, we should split it on 2 lines

Suggested change
usedPids.push(a.geckoPid, await reconnect(geckosBefore, usedPids));
usedPids.push(a.geckoPid);
usedPids.push(await reconnect(geckosBefore, usedPids));

Same comment for the other similar spots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants