fix(dataplane): fix batch replay oom#2669
Conversation
91e8f6c to
5c00519
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5c00519. Configure here.
5c00519 to
c11a9a0
Compare
| eventRouter.With(handler.RequireEnabledProject(), handler.RequireEnabledOrganisation()).Post("/broadcast", handler.CreateBroadcastEvent) | ||
| eventRouter.With(handler.RequireEnabledProject(), handler.RequireEnabledOrganisation()).Post("/dynamic", handler.CreateDynamicEvent) | ||
| eventRouter.With(handler.RequireEnabledProject(), handler.RequireEnabledOrganisation()).Post("/batchreplay", handler.BatchReplayEvents) | ||
| eventRouter.With(handler.RequireEnabledProject(), handler.RequireEnabledOrganisation(), middleware.Pagination).Post("/batchreplay", handler.BatchReplayEvents) |
There was a problem hiding this comment.
This changes batch replay semantics when the caller includes list pagination params. The dashboard filter type already carries next_page_cursor, prev_page_cursor, and direction, and batchReplayEvent() sends the saved queryParams to /events/batchreplay. With this middleware, the replay starts from that cursor while /countbatchreplayevents still counts the full filter set, so the confirmation count can say N events but only the current page/window is replayed.
For a bulk replay endpoint, we should ignore caller cursors and use pagination only internally: start from the first cursor, force direction=next, cap perPage, and loop until done. Alternatively, do not attach middleware.Pagination to this route and let BatchReplayEventService build its own internal pageable.
c11a9a0 to
ef2a027
Compare

Change fixes a batch replay OOM risk where the handler defaulted to
a 2B(2000000000) page size and loaded all matching events into memory in one request.
It caps page size at 1000, paginates through results in
BatchReplayEventService, and applies pagination middleware to batch replay
routes.
Note
Medium Risk
Changes event replay throughput and behavior on mid-run failures (partial replays may occur); memory risk is reduced but large replays still enqueue many queue jobs in one request.
Overview
Fixes batch replay OOM by stopping the handler from forcing an enormous single-page fetch (~2B events) and loading everything into memory at once.
NormalizeBatchReplayPageable(page size 1000, forward direction, cursors reset from the start) is applied inBatchReplayEventsandBatchReplayEventService, which now loops throughLoadEventsPageduntil there is no next page. Dashboard list-view cursors are ignored so replay starts from the full filtered set, not a UI page.If a fetch fails after some replays, the API returns 500 with success/failure counts and an incomplete message instead of a generic error only.
Reviewed by Cursor Bugbot for commit ef2a027. Bugbot is set up for automated code reviews on this repo. Configure here.