fix(codex-models): cache model list + reduce RPC timeout to prevent intermittent 500s#807
fix(codex-models): cache model list + reduce RPC timeout to prevent intermittent 500s#807creeep123 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Findings
- [Major] Cursor model RPCs now inherit the shortened Codex timeout —
MODEL_LIST_RPC_TIMEOUT_MSis shared by Codex and Cursor model listing, so lowering it to 15s also changeslistCursorModelsForSession/listCursorModelsForMachine; the CLI Cursor probe is allowed to run for 30s, so the hub can now abort first and return intermittent 500s before the CLI reports success/failure. Evidence:hub/src/sync/rpcGateway.ts:24,hub/src/sync/rpcGateway.ts:246,cli/src/modules/common/cursorModels.ts:71.
Suggested fix:const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000 const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000 async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> { return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse } async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> { return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse }
Summary
- Review mode: initial
- One issue found: the Codex timeout change also regresses Cursor model discovery because both paths share the same timeout constant. The existing RPC timeout test should also be updated to cover the split behavior.
Testing
- Not run (automation):
bunis not installed in this runner image.
HAPI Bot
|
|
||
| const DEFAULT_RPC_TIMEOUT_MS = 30_000 | ||
| const MODEL_LIST_RPC_TIMEOUT_MS = 120_000 | ||
| const MODEL_LIST_RPC_TIMEOUT_MS = 15_000 |
There was a problem hiding this comment.
[MAJOR] This constant is also used by Cursor model RPCs at listCursorModelsForSession and listCursorModelsForMachine, so changing it from 120s to 15s shortens Cursor discovery too. The CLI Cursor probe itself waits up to 30s (cli/src/modules/common/cursorModels.ts:71), which means the hub can now time out at 15s and return a 500 before the CLI finishes.
Suggested fix:
const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000
async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> {
return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
}
async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> {
return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
}…ntermittent 500s Fixes tiann#806 Problem: GET /api/sessions/{id}/codex-models intermittently returned 500 after 120s because listCodexModels() spawned a new codex app-server process on every call, and ~3% of those spawns exited early, causing the hub-side RPC to wait the full 120s timeout. Changes: - cli: Add module-level model list cache (TTL 5 min) with separate caches for visible and hidden models, plus clearCodexModelCache() export - cli: Preserve app-server exit details in error cause chain for faster failure detection - hub: Reduce MODEL_LIST_RPC_TIMEOUT_MS from 120s to 15s (normal requests complete in 1-5s; cached requests are instant)
b5dc56e to
db2cb9d
Compare
There was a problem hiding this comment.
Findings
- [Major] Cursor model RPCs now inherit the shortened Codex timeout, and the existing timeout test is left stale.
MODEL_LIST_RPC_TIMEOUT_MSis shared by Codex and Cursor listing, so changing it to 15s athub/src/sync/rpcGateway.ts:24also changeslistCursorModelsForSession/listCursorModelsForMachineathub/src/sync/rpcGateway.ts:246. The CLI Cursor probe can run for 30s before returning its own result (cli/src/modules/common/cursorModels.ts:71), so the hub can now time out first and return an intermittent 500. The existinghub/src/sync/rpcGateway.test.ts:61assertion still expects 120s, so this diff also leaves that test failing/stale.
Suggested fix:const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000 const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000 async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> { return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse } async listCodexModelsForMachine(machineId: string): Promise<RpcListCodexModelsResponse> { return await this.machineRpc(machineId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse } async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> { return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse } async listCursorModelsForMachine(machineId: string): Promise<RpcListCursorModelsResponse> { return await this.machineRpc(machineId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- One confirmed issue remains: the Codex timeout change still affects Cursor model discovery because both paths share the same timeout constant, and the existing timeout test was not updated to cover the intended split.
Testing
- Not run:
bunis not installed in this runner image.
HAPI Bot
|
|
||
| const DEFAULT_RPC_TIMEOUT_MS = 30_000 | ||
| const MODEL_LIST_RPC_TIMEOUT_MS = 120_000 | ||
| const MODEL_LIST_RPC_TIMEOUT_MS = 15_000 |
There was a problem hiding this comment.
[MAJOR] This constant is also used by Cursor model RPCs at listCursorModelsForSession and listCursorModelsForMachine, so lowering it to 15s shortens Cursor discovery too. The CLI Cursor probe itself waits up to 30s (cli/src/modules/common/cursorModels.ts:71), which means the hub can now time out at 15s and return a 500 before the CLI finishes. The existing hub/src/sync/rpcGateway.test.ts:61 assertion also still expects the old 120s value, so the timeout test is stale/failing.
Suggested fix:
const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000
async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> {
return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
}
async listCodexModelsForMachine(machineId: string): Promise<RpcListCodexModelsResponse> {
return await this.machineRpc(machineId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
}
async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> {
return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
}
async listCursorModelsForMachine(machineId: string): Promise<RpcListCursorModelsResponse> {
return await this.machineRpc(machineId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
}…oid Cursor regression
There was a problem hiding this comment.
Findings
- [Major] Cursor model RPC timeout no longer matches the current Cursor discovery path. The PR sets Cursor model listing to 30s at
hub/src/sync/rpcGateway.ts:25, but the current target branch added a Cursor model-list timeout expectation of 120s inhub/src/sync/rpcGateway.test.ts:77, and the Cursor discovery path can run an ACP catalog probe before falling back to the 30sagent --list-modelsprobe (cli/src/modules/common/cursorModels.ts:238andcli/src/modules/common/cursorModels.ts:243). With the hub timeout equal to only one fallback probe, the hub can time out first on cold/slow Cursor discovery and return a 500 even though the CLI is still working.
Suggested fix:const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000 const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 120_000
Questions
- None.
Summary
- Review mode: follow-up after new commits
- One confirmed issue remains after the follow-up: the previous shared-timeout regression is fixed, but Cursor’s timeout was reduced below the current target branch’s expected budget and can still make cold Cursor model discovery fail early.
Testing
- Not run:
bunis not installed in this runner image. Static check found the merge-ref test expectation conflict.
HAPI Bot
| const DEFAULT_RPC_TIMEOUT_MS = 30_000 | ||
| const MODEL_LIST_RPC_TIMEOUT_MS = 120_000 | ||
| const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000 | ||
| const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000 |
There was a problem hiding this comment.
[MAJOR] This reduces Cursor model-list RPCs to 30s, but the current target branch expects Cursor model discovery to keep the previous 120s budget (hub/src/sync/rpcGateway.test.ts:77). Cursor listing can now run an ACP catalog probe and then fall back to the 30s agent --list-models probe (cli/src/modules/common/cursorModels.ts:238 and cli/src/modules/common/cursorModels.ts:243), so a 30s hub timeout can fire before the CLI finishes cold/slow discovery and surface another 500.
Suggested fix:
const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 120_000
Summary
Fixes #806
GET /api/sessions/{id}/codex-modelsintermittently returned 500 after 120 seconds becauselistCodexModels()spawned a freshcodex app-serverprocess on every call. ~3% of those spawns exited early (~1s), but the hub-side RPC waited the full 120s timeout before returning an error.Changes
cli/src/modules/common/codexModels.tsclearCodexModelCache()export for manual invalidationerror.causefor faster failure detectionhub/src/sync/rpcGateway.tsMODEL_LIST_RPC_TIMEOUT_MSfrom 120s → 15s (normal requests complete in 1-5s; cached requests are instant)Impact
CodexAppServerClientclass or other RPC methods