Skip to content

V3: 启动时接管自己旧实例占用的端口(更新即用)#160

Merged
AlvinShenSSW merged 14 commits into
mainfrom
feat/port-takeover-stale-instance
Jul 2, 2026
Merged

V3: 启动时接管自己旧实例占用的端口(更新即用)#160
AlvinShenSSW merged 14 commits into
mainfrom
feat/port-takeover-stale-instance

Conversation

@AlvinShenSSW

Copy link
Copy Markdown
Owner

Closes #159

问题

每次装新版本启动都失败(「backend did not start within 30s / port in use」)——上一版本后端还在跑、占着 agent/hub 端口。

方案(你要的「启动时自己清除」)

启动时,若目标端口被本应用自己的旧后端占用,自动终止该陈旧实例并等端口释放,再绑定。agent 接管 network+control 端口,hub 接管 API 端口。

安全(关键)

绝不误杀端口上的外部进程:仅当占用者被明确识别为本应用同角色后端(名 taskpaw-backend[.exe] 或源码 backend_main.py,且 argv 含 agent/hub 角色)才终止;外部/异角色一律不动,claim_port 对真实冲突仍 fail-loud(宪法 §3)。优雅终止(terminate→wait→kill 兜底),轮询等端口释放;psutil 缺失/权限不足/异常均降级为「什么都不做」,绝不崩溃;全程日志。

位置

  • core/net.py:reclaim_port_from_stale_instance + _listener_pids/_is_our_backend(模块级可选 psutil,便于 mock)。
  • agent/server/launcher.pyhub/server/app.py:claim 前调用。

测试(test_net_reclaim.py,伪 psutil,不杀真进程)

接管同角色陈旧后端;外部进程(nginx)不动;异角色不动;源码 backend_main.py 命中;无 psutil no-op;_is_our_backend 名+角色匹配。uv run pytest 465 passed;ruff/mypy 全绿。

🤖 Generated with Claude Code

AlvinShenSSW and others added 14 commits July 2, 2026 22:31
…tes)

Every version update failed to launch because the old backend still held the
agent/hub ports. On startup, reclaim a port held by THIS app's own stale backend
(same role) — terminate it + wait for release, then bind. Strictly self-only:
identifies the holder as our taskpaw-backend/backend_main.py of the matching role;
a foreign service is left untouched and claim_port still fails loudly.

- core/net.py: reclaim_port_from_stale_instance + _listener_pids/_is_our_backend
  (module-level optional psutil, mockable).
- agent launcher: reclaim network + control ports (role=agent) before claim_port.
- hub run_hub: reclaim API port (role=hub) before claim_port.
- test_net_reclaim.py: fake-psutil tests incl. "never kill a foreign process".

Design: docs/specs/2026-07-02-port-takeover-design.md
Closes 159

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…out (Codex 外门); ad-hoc sign the macOS bundle

- _is_our_backend matches taskpaw-backend* by prefix (incl. the target-triple
  sidecar the Tauri shell may launch), not an exact-name set.
- reclaim's outer handler also catches psutil.TimeoutExpired so a stuck process
  that won't die after kill() can't abort startup (claim_port still fails loudly).
- tauri.conf bundle.macOS.signingIdentity="-" → Tauri produces a consistent ad-hoc
  signature so locally-built DMGs aren't rejected as "damaged" on Apple Silicon
  (spctl was failing: "no resources but signature indicates they must be present").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t takeover semantics (Codex 外门 r2)

- [P1] Revert the hardcoded tauri.conf signingIdentity="-" (it would override
  APPLE_SIGNING_IDENTITY and break Developer-ID signing + notarization in
  release.yml). Instead build.py adds ad-hoc signing to the --config override ONLY
  when no APPLE_SIGNING_IDENTITY is set → local/unsigned builds get a consistent
  ad-hoc signature (no more "damaged" on Apple Silicon), signed releases untouched.
  Verified: codesign --verify --deep --strict → "valid on disk / satisfies its
  Designated Requirement".
- [P2] Document reclaim as intentional "last-launch-wins supersede" for a
  single-agent/hub-per-machine box; preventing accidental double-launch of the same
  version is the Tauri shell's single-instance job (follow-up), not this port logic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
backend_main defaults to agent when launched with no role arg, so an agent must
also reclaim a role-less taskpaw-backend; hub still requires an explicit "hub".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 外门 r4)

The agent needs BOTH its network and control ports. Reclaiming them one-by-one
could kill our own stale agent for one port and then still fail claim_port on the
other if a FOREIGN service holds it — leaving no agent running. Add
reclaim_ports_from_stale_instance(): inspect every required port first and abort
the whole reclaim if any holder is foreign, so the old agent is only superseded
when all its ports are free or ours. Factor out _terminate_backend (shared with
the single-port hub path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odex 外门 r5)

_listener_pids matched by port number alone, so a foreign 127.0.0.1:P listener
would (falsely) block an agent configured for 192.168.x.y:P and abort the reclaim.
Filter by address conflict (_addr_conflicts): same address, or a wildcard on
either side, and only within the same IP family.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odex 外门 r6)

- _addr_conflicts: a `localhost` bind (allowed by the Hub guard) is reported by
  psutil as a numeric 127.x, so treat any two loopback addresses as conflicting —
  a stale localhost-bound backend is now reclaimed.
- _is_our_backend: match the full package path taskpaw_v3/packaging/backend_main.py
  (or the -m module) instead of a bare backend_main.py, so an unrelated project's
  script on the port is never mistaken for ours.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A stale V3 process launched with the documented headless commands (deployment.md:
python -m taskpaw_v3.agent / python -m taskpaw_v3.hub run) has process name
`python` and no backend_main in its argv, so it was treated as foreign and the port
wasn't reclaimed. Add _backend_role/_role_from_module: derive the role from the
module name (agent|hub), covering the -m module string and its resolved path form,
alongside the existing sidecar + packaging entrypoints. Doc updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CRITICAL: system-wide psutil.net_connections() needs root on macOS, so on the
logged-in-user desktop app it raised AccessDenied and the whole takeover silently
no-op'd — the exact update/restart failure this feature fixes. Flip the approach:
enumerate OUR OWN processes (process_iter) and read each one's own sockets
(_proc_listen_conns), which works for a same-user process without root. Foreign
holders are no longer directly visible, so the agent's all-or-nothing multi-port
reclaim classifies a port as foreign when it's occupied (not port_available) yet not
ours. Handles the psutil 6 Process.connections→net_connections rename. Doc updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- _backend_role: parse the role as an EXACT argv token ('agent'/'hub') via
  _explicit_role instead of `"hub" in cmd`, so a path/flag that merely contains the
  word (e.g. /Users/hubert/…) can't misclassify an agent backend as a hub and leave
  the stale agent running.
- Tighten the sidecar name check from a loose startswith to a regex matching only the
  base name or a base+target-triple (underscores allowed, e.g. x86_64), so a foreign
  helper like taskpaw-backend-logger isn't treated as ours.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…x (Kimi 终审 r2)

- Catch the full psutil.Error hierarchy (incl. ZombieProcess) in identity/enumerate/
  terminate paths — a zombie's name()/cmdline()/wait() no longer crashes startup;
  degrades to the documented no-op.
- Stronger positive ID: the sidecar must match the name regex AND (when available) the
  real proc.exe() basename, so a foreign process can't pass by spoofing proc.name().
  Match the packaging module only as an actual `-m <module>` pair, and the headless
  taskpaw_v3.agent|hub module only as a `-m` value or a package .py script path — not a
  bare arg. (Install-dir containment intentionally avoided: onefile runs from a _MEI
  temp path, which would miss our own stale backend.)
- _addr_conflicts: check loopback-equivalence BEFORE the IPv4/IPv6 split so a stale ::1
  backend is reclaimed for a `localhost` start.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…able (Codex 外门)

If control_port == bind_port on the same/overlapping host (accepted by AgentConfig,
editable in the UI), the all-or-nothing reclaim would kill our old working agent and
then fail startup on the self-colliding second socket. Detect non-mutually-bindable
required ports up front and reclaim nothing; claim_port fails loudly, old agent lives.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(Kimi 终审 r2)

- _role_from_module: anchor the `-m` module to exactly taskpaw_v3.agent|hub or a
  submodule prefix (and the script path to `/taskpaw_v3/{agent,hub}/`), so a foreign
  `python -m my.taskpaw_v3.agent` on the port is no longer misidentified as ours.
- Duplicate required-port check: use a stricter _same_bind_target (wildcard / same
  literal address / localhost↔canonical-loopback) instead of _addr_conflicts, so a
  valid two-loopback config (127.0.0.1 + 127.0.0.2) isn't wrongly judged non-bindable
  and skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…loopback (Kimi 终审)

- Anchor the from-source backend_main match to a `/` boundary (or exact relative path)
  so `.../clonetaskpaw_v3/packaging/backend_main.py` isn't taken as ours.
- _role_from_module: match path COMPONENTS (taskpaw_v3/agent|hub), not a substring, so
  `.../mytaskpaw_v3/agent/…` can't match.
- _addr_conflicts: replace blanket "both loopback" with _loopback_equiv — localhost ≡
  any loopback, 127.0.0.1 ≡ ::1, distinct numeric loopbacks (127.0.0.1 vs .2) do NOT
  collide.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@AlvinShenSSW AlvinShenSSW merged commit 2fca4dc into main Jul 2, 2026
7 checks passed
@AlvinShenSSW AlvinShenSSW deleted the feat/port-takeover-stale-instance branch July 2, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

V3: 更新后启动失败(旧实例占用端口)——启动时自动接管自己的陈旧后端端口

1 participant