Skip to content

fix(cli): bypass HTTP_PROXY for runner webhook via raw socket#563

Closed
XWang20 wants to merge 1 commit into
tiann:mainfrom
XWang20:fix/runner-webhook-bypass-bun-proxy
Closed

fix(cli): bypass HTTP_PROXY for runner webhook via raw socket#563
XWang20 wants to merge 1 commit into
tiann:mainfrom
XWang20:fix/runner-webhook-bypass-bun-proxy

Conversation

@XWang20

@XWang20 XWang20 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Problem

In an environment with HTTP_PROXY set and a NO_PROXY that does not literally include 127.0.0.1, every "new session" / "resume session" from the web UI fails after a 15s wait with:

Runner last spawn error: Session webhook timeout for PID <n>

Reproducer: HTTP_PROXY=http://127.0.0.1:7890 + NO_PROXY="127.*,localhost" (the wildcard form glibc / wget accept but libcurl-style parsers don't). bun routes the loopback webhook through the proxy, the runner never sees it, and the parent gives up at the 15s mark. cloudflared escapes this only because its origin is the literal localhost, which NO_PROXY does match.

Root cause

bun honors HTTP_PROXY at process start for both fetch and node:http; neither API exposes a per-request bypass that actually takes effect (verified proxy: '', proxy: undefined, runtime mutation of process.env.NO_PROXY). The only transport in bun that ignores the proxy stack is node:net, since it is raw TCP.

Fix

Replace runnerPost's fetch with a minimal HTTP/1.1 client on node:net.connect. Public signature, return shape, error message format, and HAPI_RUNNER_HTTP_TIMEOUT semantics are unchanged, so all callers (notifyRunnerSessionStarted, listRunnerSessions, stopRunnerSession, spawnRunnerSession, stopRunnerHttp) keep working without modification.

The client sends Connection: close, so the runner control server closes after each response and the parser resolves on end.

Testing

  • cli unit tests pass (bun test runnerIdentity buildCliArgs)
  • End-to-end on a real spawn under bun + bad proxy env: webhook delivery 389ms vs the previous 15s timeout

Known follow-up

The minimal HTTP/1.1 parser does not honor Content-Length for early return and does not handle close without end. With Connection: close against the existing runner control server this is fine in practice, but a future patch should add a close handler so a mid-response runner crash fails immediately rather than waiting for HAPI_RUNNER_HTTP_TIMEOUT.

When HAPI runs in an environment with HTTP_PROXY/HTTPS_PROXY set and a
NO_PROXY that does not literally bypass 127.0.0.1, the loopback webhook
each spawned session sends back to the runner gets routed through the
user's proxy, fails to reach the runner's control port, and the parent
runner reports "Session webhook timeout for PID <n>" 15s later — every
"new session" / "resume session" from the web UI.

Reproducer: with HTTP_PROXY=http://127.0.0.1:7890 and the common
NO_PROXY="127.*,localhost" (the form glibc / wget accept but libcurl-
style parsers don't), bun's fetch and bun's node:http both forward
loopback requests to the proxy. cloudflared escapes this only because
its origin uses the literal "localhost", which NO_PROXY does match.
The runner uses 127.0.0.1 and gets caught.

bun honors HTTP_PROXY at process start for both fetch and node:http;
neither API exposes a per-request bypass that actually takes effect
(verified `proxy: ''`, `proxy: undefined`, runtime mutation of
`process.env.NO_PROXY`). The only transport in bun that ignores the
proxy stack is node:net, since it is raw TCP.

Replace `runnerPost`'s `fetch` call with a minimal HTTP/1.1 client
written on `node:net.connect`. Public signature, return shape, error
messages and `HAPI_RUNNER_HTTP_TIMEOUT` semantics are unchanged, so
all callers (`notifyRunnerSessionStarted`, `listRunnerSessions`,
`stopRunnerSession`, `spawnRunnerSession`, `stopRunnerHttp`) keep
working without modification.

Verified end-to-end on a real spawn under bun + bad proxy env: webhook
delivery 389ms vs the previous 15s timeout.
Copilot AI review requested due to automatic review settings May 5, 2026 06:30

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • No high-confidence issues found in the added/modified lines.

Summary

  • Review mode: initial
  • Reviewed cli/src/runner/controlClient.ts plus surrounding runner control server/client context. Residual risk: this hand-rolled HTTP path is covered only by the existing runner integration flow, which depends on the integration environment; a small focused unit test with a local node:net/HTTP server would better lock down response parsing and proxy-bypass behavior.

Testing

  • Not run (automation); static review only.

HAPI Bot

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the CLI’s runner control-plane HTTP client to reliably reach the local runner control server even when HTTP_PROXY is set and NO_PROXY is misformatted for Bun’s proxy parsing, by bypassing the proxy stack via a raw TCP socket.

Changes:

  • Replaced fetch()-based POSTs to the runner control server with a minimal HTTP/1.1 client over node:net.connect().
  • Preserved existing caller-facing API surface (return shape and error message patterns) while changing the underlying transport to avoid proxy interference.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


if (!response.ok) {
const errorMessage = `Request failed: ${path}, HTTP ${response.status}`;
const timeout = process.env.HAPI_RUNNER_HTTP_TIMEOUT ? parseInt(process.env.HAPI_RUNNER_HTTP_TIMEOUT) : 10_000;
`Content-Length: ${payload.length}\r\n` +
`Connection: close\r\n\r\n`;
socket.write(head);
socket.write(payload);

socket.on('error', (error) => {
settle(() => fail(error.message));
});
@tiann

tiann commented May 6, 2026

Copy link
Copy Markdown
Owner

Maybe you should set process.env.NO_PROXY earlier, i think this should work.

@tiann

tiann commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Thanks for the report and the detailed reproducer — the diagnosis of what breaks (bun routing the loopback webhook through HTTP_PROXY when NO_PROXY uses the 127.* wildcard form) is spot on, and it helped us connect this to a wider family of failures (Claude SessionStart hook delivery / transcript sync, local MCP connections, not just the runner webhook).

However, we couldn't reproduce the core claim that no bypass short of raw TCP works. On Bun 1.3.14 (the exact version our release CI builds with), runtime mutation of process.env.NO_PROXY does take effect for both fetch and node:http — including when mutated after a prior fetch (no startup caching):

control (no mutation):    fetch → HTTP 503 via proxy, node:http → HTTP 503 via proxy
after env mutation:       fetch → ConnectionRefused (direct), node:http → ECONNREFUSED (direct)

(My guess: the verification kept the 127.* entry, which Bun indeed doesn't match — that's the bug itself rather than evidence that mutation is ignored.)

Given that, we went with a root-cause fix in #868: normalize NO_PROXY (append literal localhost,127.0.0.1,::1) once at the CLI entrypoint. That covers runnerPost and the other loopback paths this PR doesn't reach (hook-forwarder → hook server, claude → local MCP server, health checks), and child processes inherit the patched env. It also fixes your exact reproducer: NO_PROXY="127.*,localhost"127.*,localhost,127.0.0.1,::1.

On the raw-socket client itself, a few things we'd want to avoid maintaining long-term: socket.setTimeout is an idle timeout (the previous AbortSignal.timeout was a total deadline), close-without-end hangs until timeout (you flagged this), and the parser doesn't handle chunked transfer-encoding — it works today only because Fastify sets Content-Length on JSON responses.

Closing in favor of #868 — could you verify it in your 127.* proxy environment? Much appreciated, and thanks again for the legwork on the reproducer.

tiann added a commit that referenced this pull request Jun 10, 2026
Bun's fetch and node:http honor HTTP_PROXY/HTTPS_PROXY env vars, which can
route loopback traffic through a configured proxy (e.g. Surge/Clash). When
NO_PROXY doesn't explicitly exclude localhost, this breaks loopback
communication: SessionStart hooks fail to arrive (transcripts don't sync, web
UI stays empty), runner control client times out, and MCP server connections
fail.

Normalize NO_PROXY at the CLI entrypoint to always cover loopback
(localhost, 127.0.0.1, ::1). Child processes inherit the patched env so their
loopback traffic is covered too. Non-loopback traffic continues using the
configured proxy. Supersedes the runner control client workaround from #563.
@tiann tiann closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants