Skip to content

fix(codex-models): cache model list + reduce RPC timeout to prevent intermittent 500s#807

Open
creeep123 wants to merge 2 commits into
tiann:mainfrom
creeep123:fix/codex-model-list-cache-timeout
Open

fix(codex-models): cache model list + reduce RPC timeout to prevent intermittent 500s#807
creeep123 wants to merge 2 commits into
tiann:mainfrom
creeep123:fix/codex-model-list-cache-timeout

Conversation

@creeep123

Copy link
Copy Markdown

Summary

Fixes #806

GET /api/sessions/{id}/codex-models intermittently returned 500 after 120 seconds because listCodexModels() spawned a fresh codex app-server process on every call. ~3% of those spawns exited early (~1s), but the hub-side RPC waited the full 120s timeout before returning an error.

Changes

cli/src/modules/common/codexModels.ts

  • Added model list cache with 5-minute TTL, separate caches for visible and hidden models
  • Added clearCodexModelCache() export for manual invalidation
  • Improved error chain — app-server exit details are now preserved via error.cause for faster failure detection

hub/src/sync/rpcGateway.ts

  • Reduced MODEL_LIST_RPC_TIMEOUT_MS from 120s → 15s (normal requests complete in 1-5s; cached requests are instant)
  • Applies to both Codex and Cursor model listing (same constant)

Impact

  • 97%+ of requests will now hit the cache and return instantly (0ms)
  • On cache miss, requests still complete in 1-5s
  • Worst case (app-server spawn fails): fails in 15s instead of 120s
  • No changes to CodexAppServerClient class or other RPC methods

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Cursor model RPCs now inherit the shortened Codex timeout — MODEL_LIST_RPC_TIMEOUT_MS is shared by Codex and Cursor model listing, so lowering it to 15s also changes listCursorModelsForSession / listCursorModelsForMachine; the CLI Cursor probe is allowed to run for 30s, so the hub can now abort first and return intermittent 500s before the CLI reports success/failure. Evidence: hub/src/sync/rpcGateway.ts:24, hub/src/sync/rpcGateway.ts:246, cli/src/modules/common/cursorModels.ts:71.
    Suggested fix:
    const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
    const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000
    
    async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> {
        return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
    }
    
    async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> {
        return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
    }

Summary

  • Review mode: initial
  • One issue found: the Codex timeout change also regresses Cursor model discovery because both paths share the same timeout constant. The existing RPC timeout test should also be updated to cover the split behavior.

Testing

  • Not run (automation): bun is not installed in this runner image.

HAPI Bot

Comment thread hub/src/sync/rpcGateway.ts Outdated

const DEFAULT_RPC_TIMEOUT_MS = 30_000
const MODEL_LIST_RPC_TIMEOUT_MS = 120_000
const MODEL_LIST_RPC_TIMEOUT_MS = 15_000

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This constant is also used by Cursor model RPCs at listCursorModelsForSession and listCursorModelsForMachine, so changing it from 120s to 15s shortens Cursor discovery too. The CLI Cursor probe itself waits up to 30s (cli/src/modules/common/cursorModels.ts:71), which means the hub can now time out at 15s and return a 500 before the CLI finishes.

Suggested fix:

const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000

async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> {
    return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
}

async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> {
    return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
}

…ntermittent 500s

Fixes tiann#806

Problem: GET /api/sessions/{id}/codex-models intermittently returned 500
after 120s because listCodexModels() spawned a new codex app-server process
on every call, and ~3% of those spawns exited early, causing the hub-side
RPC to wait the full 120s timeout.

Changes:
- cli: Add module-level model list cache (TTL 5 min) with separate caches
  for visible and hidden models, plus clearCodexModelCache() export
- cli: Preserve app-server exit details in error cause chain for faster
  failure detection
- hub: Reduce MODEL_LIST_RPC_TIMEOUT_MS from 120s to 15s (normal requests
  complete in 1-5s; cached requests are instant)
@tiann tiann force-pushed the fix/codex-model-list-cache-timeout branch from b5dc56e to db2cb9d Compare June 6, 2026 03:13

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Cursor model RPCs now inherit the shortened Codex timeout, and the existing timeout test is left stale. MODEL_LIST_RPC_TIMEOUT_MS is shared by Codex and Cursor listing, so changing it to 15s at hub/src/sync/rpcGateway.ts:24 also changes listCursorModelsForSession / listCursorModelsForMachine at hub/src/sync/rpcGateway.ts:246. The CLI Cursor probe can run for 30s before returning its own result (cli/src/modules/common/cursorModels.ts:71), so the hub can now time out first and return an intermittent 500. The existing hub/src/sync/rpcGateway.test.ts:61 assertion still expects 120s, so this diff also leaves that test failing/stale.
    Suggested fix:
    const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
    const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000
    
    async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> {
        return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
    }
    
    async listCodexModelsForMachine(machineId: string): Promise<RpcListCodexModelsResponse> {
        return await this.machineRpc(machineId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
    }
    
    async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> {
        return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
    }
    
    async listCursorModelsForMachine(machineId: string): Promise<RpcListCursorModelsResponse> {
        return await this.machineRpc(machineId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • One confirmed issue remains: the Codex timeout change still affects Cursor model discovery because both paths share the same timeout constant, and the existing timeout test was not updated to cover the intended split.

Testing

  • Not run: bun is not installed in this runner image.

HAPI Bot

Comment thread hub/src/sync/rpcGateway.ts Outdated

const DEFAULT_RPC_TIMEOUT_MS = 30_000
const MODEL_LIST_RPC_TIMEOUT_MS = 120_000
const MODEL_LIST_RPC_TIMEOUT_MS = 15_000

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This constant is also used by Cursor model RPCs at listCursorModelsForSession and listCursorModelsForMachine, so lowering it to 15s shortens Cursor discovery too. The CLI Cursor probe itself waits up to 30s (cli/src/modules/common/cursorModels.ts:71), which means the hub can now time out at 15s and return a 500 before the CLI finishes. The existing hub/src/sync/rpcGateway.test.ts:61 assertion also still expects the old 120s value, so the timeout test is stale/failing.

Suggested fix:

const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000

async listCodexModelsForSession(sessionId: string): Promise<RpcListCodexModelsResponse> {
    return await this.sessionRpc(sessionId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
}

async listCodexModelsForMachine(machineId: string): Promise<RpcListCodexModelsResponse> {
    return await this.machineRpc(machineId, RPC_METHODS.ListCodexModels, {}, CODEX_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCodexModelsResponse
}

async listCursorModelsForSession(sessionId: string): Promise<RpcListCursorModelsResponse> {
    return await this.sessionRpc(sessionId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
}

async listCursorModelsForMachine(machineId: string): Promise<RpcListCursorModelsResponse> {
    return await this.machineRpc(machineId, RPC_METHODS.ListCursorModels, {}, CURSOR_MODEL_LIST_RPC_TIMEOUT_MS) as RpcListCursorModelsResponse
}

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Cursor model RPC timeout no longer matches the current Cursor discovery path. The PR sets Cursor model listing to 30s at hub/src/sync/rpcGateway.ts:25, but the current target branch added a Cursor model-list timeout expectation of 120s in hub/src/sync/rpcGateway.test.ts:77, and the Cursor discovery path can run an ACP catalog probe before falling back to the 30s agent --list-models probe (cli/src/modules/common/cursorModels.ts:238 and cli/src/modules/common/cursorModels.ts:243). With the hub timeout equal to only one fallback probe, the hub can time out first on cold/slow Cursor discovery and return a 500 even though the CLI is still working.
    Suggested fix:
    const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
    const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 120_000

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • One confirmed issue remains after the follow-up: the previous shared-timeout regression is fixed, but Cursor’s timeout was reduced below the current target branch’s expected budget and can still make cold Cursor model discovery fail early.

Testing

  • Not run: bun is not installed in this runner image. Static check found the merge-ref test expectation conflict.

HAPI Bot

const DEFAULT_RPC_TIMEOUT_MS = 30_000
const MODEL_LIST_RPC_TIMEOUT_MS = 120_000
const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 30_000

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MAJOR] This reduces Cursor model-list RPCs to 30s, but the current target branch expects Cursor model discovery to keep the previous 120s budget (hub/src/sync/rpcGateway.test.ts:77). Cursor listing can now run an ACP catalog probe and then fall back to the 30s agent --list-models probe (cli/src/modules/common/cursorModels.ts:238 and cli/src/modules/common/cursorModels.ts:243), so a 30s hub timeout can fire before the CLI finishes cold/slow discovery and surface another 500.

Suggested fix:

const CODEX_MODEL_LIST_RPC_TIMEOUT_MS = 15_000
const CURSOR_MODEL_LIST_RPC_TIMEOUT_MS = 120_000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

listCodexModels spawns new codex app-server on every request, causing intermittent 120s timeouts

1 participant