Skip to content

RAG training becomes slow over time with frequent MCP retrieve timeouts #504

Description

@wolf-yang

Describe the bug

When training a RAG task with AgentLightning, the training gradually becomes extremely slow and starts to produce frequent errors related to MCP retrieve timeouts and OTLP trace handling.

From the logs:

  • The time per iteration keeps increasing to thousands of seconds (s/it).
Training Progress:   0%|          | 0/125000 [00:00<?, ?it/s]
Training Progress:   0%|          | 1/125000 [01:05<2288:35:44, 65.91s/it]
Training Progress:   0%|          | 2/125000 [02:13<2329:03:57, 67.08s/it]
Training Progress:   0%|          | 3/125000 [04:13<3167:15:15, 91.22s/it]
……
Training Progress:   1%|          | 698/125000 [103:28:02<47135:04:52, 1365.11s/it]
ERROR:2026-02-18 04:35:13,188:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
(TaskRunner pid=2492560)
Training Progress:   1%|          | 699/125000 [103:46:37<44547:19:03, 1290.18s/it]
ERROR:2026-02-18 07:30:21,360:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
ERROR:2026-02-18 07:36:57,659:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
ERROR:2026-02-18 08:27:00,963:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.
(TaskRunner pid=2492560)
Training Progress:   1%|          | 700/125000 [108:31:43<208373:35:24, 6034.96s/it]
...
(TaskRunner pid=2492560)
Training Progress:   1%|          | 719/125000 [118:23:48<42790:52:24, 1239.51s/it]
ERROR:2026-02-18 20:20:41,745:Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds.

Observed behavior:

  • In the early phase training speed is acceptable.
  • After several hundred global steps (~700), each step becomes slower and slower.
  • Logs start to show many Error invoking MCP tool retrieve: Timed out while waiting for response to ClientRequest. Waited 60.0 seconds. messages.
  • starlette.requests.ClientDisconnect is raised in the OTLP traces endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions