Summary
The gateway supports resume-after-disconnect via a per-session event cursor, but the replay is not integrity-checked. The retained event buffer is trimmed to a fixed size, and when a reconnecting client asks to replay from a cursor older than the oldest retained event, the server silently returns a partial (or empty) set of events — with no signal that a gap occurred and a full resync is required. The client believes it is up to date when it is not. For a gateway that promises reliable real-time delivery across flaky networks, resume must either replay completely or explicitly tell the client it must resync.
Current behaviour
Events get a monotonic cursor, and the buffer is trimmed once it grows past a bound:
# src/praisonai/praisonai/gateway/server.py:104-117
def add_event(self, event: GatewayEvent) -> int:
self._event_cursor += 1
event.data['cursor'] = self._event_cursor
self._events.append(event)
self._last_activity = time.time()
# Keep events bounded to prevent unbounded growth
if len(self._events) > self._max_messages * 2:
self._events = self._events[-self._max_messages:] # drops oldest events
return self._event_cursor
def get_events_since(self, cursor: int) -> List[GatewayEvent]:
return [e for e in self._events if e.data.get('cursor', 0) > cursor] # no floor check
On reconnect, the client passes since and receives a joined frame plus a stream of replay events:
# src/praisonai/praisonai/gateway/server.py:973-985
await self._send_to_client(client_id, {
"type": "joined", "session_id": session.session_id,
"agent_id": agent_id, "resumed": session._was_resumed,
"cursor": session._event_cursor, # current head only
})
for event in replay_events:
await self._send_to_client(client_id, {"type": "replay", "event": event.to_dict()})
The problem: with max_messages = 1000 (default, src/praisonai-agents/praisonaiagents/gateway/config.py:26), once more than 2000 events accumulate the oldest are discarded. If a client was at cursor 5 and reconnects after the buffer has rolled past it, get_events_since(5) returns only the events still retained — the gap between cursor 5 and the oldest retained cursor is lost. The joined frame reports only the current head cursor and a boolean resumed; it does not report the oldest available cursor nor a truncated/resync-required flag, so the client cannot detect the loss. There is also no per-event top-level sequence number on the wire — the cursor is buried in event.data['cursor'], so a client cannot cheaply detect a skipped sequence during a live stream either.
Desired behaviour
- The
joined/resume acknowledgement reports the oldest retained cursor alongside the head cursor, and sets an explicit resync_required: true (or truncated: true) when the requested since is below the oldest retained cursor — telling the client to discard local state and re-fetch a fresh snapshot rather than assume continuity.
- Every delivered event carries a monotonic top-level
seq, so clients can detect a skipped sequence mid-stream and request resume, instead of relying on a value nested inside the payload.
Layer placement
- Primary layer: wrapper (
praisonai) — the retention/trim, cursor accounting, and the joined/replay emission all live in src/praisonai/praisonai/gateway/server.py, which is where the floor check and resync flag must be enforced.
- Why not core: core owns the event contract but not the live session buffer or the replay emission; only the wire fields (
seq, oldest_cursor, resync_required) belong in the core contract.
- Why not tools: delivery integrity is gateway transport, not an agent-callable integration.
- Why not plugins: this is the gateway's own delivery guarantee, not a lifecycle guardrail wrapped around an agent run.
- Secondary touch: core (
praisonaiagents/gateway/protocols.py) to add a top-level seq on the event envelope and the resume-ack fields so all server/client implementations share them.
- 3-way surface (CLI + YAML + Python): no — this is a delivery-guarantee correctness fix in the gateway runtime, not a user-authored feature.
Proposed approach
- Extension point: protocol envelope + gateway server replay logic.
- Track the oldest retained cursor on the session; compute resync on resume; stamp
seq on every outbound event.
# Resume acknowledgement with gap detection
oldest = self._events[0].data["cursor"] if self._events else self._event_cursor
truncated = since_cursor is not None and since_cursor < oldest
await self._send_to_client(client_id, {
"type": "joined",
"session_id": session.session_id,
"cursor": session._event_cursor, # head
"oldest_cursor": oldest, # floor of what can be replayed
"resync_required": truncated, # client must drop local state + refetch
})
if truncated:
snapshot = session.snapshot() # send authoritative state instead of partial replay
await self._send_to_client(client_id, {"type": "snapshot", "state": snapshot})
else:
for event in replay_events:
await self._send_to_client(client_id, {"type": "replay", "event": event.to_dict()})
Resolution sketch
# Before (today)
ws.send({"type": "join", "agent_id": "support", "since": 5})
# -> {"type": "joined", "cursor": 4200, "resumed": true}
# -> replay events 3201..4200 only; events 6..3200 are gone, client never told. Silent loss.
# After (proposed)
ws.send({"type": "join", "agent_id": "support", "since": 5})
# -> {"type": "joined", "cursor": 4200, "oldest_cursor": 3201, "resync_required": true}
# -> {"type": "snapshot", "state": {...}} # client rebuilds from authoritative state, no silent gap
Severity
Medium — resume works on the happy path, but under real reconnect conditions (long disconnect, busy session) clients can silently miss events and diverge from gateway state with no way to detect it; correct resume integrity is required before the gateway can be relied on for production real-time delivery.
Validation
Confirmed by reading src/praisonai/praisonai/gateway/server.py:104-117 (event buffer trimmed to max_messages; get_events_since filters with no floor check), :973-985 (the joined/replay frames expose only the head cursor and a boolean resumed, with no oldest-cursor or truncation/resync flag and no top-level seq on events), and src/praisonai-agents/praisonaiagents/gateway/config.py:26 (max_messages = 1000, so the buffer rolls after ~2000 events). The retention trim discards the oldest events while the replay path provides no signal that they are missing.
Summary
The gateway supports resume-after-disconnect via a per-session event cursor, but the replay is not integrity-checked. The retained event buffer is trimmed to a fixed size, and when a reconnecting client asks to replay from a cursor older than the oldest retained event, the server silently returns a partial (or empty) set of events — with no signal that a gap occurred and a full resync is required. The client believes it is up to date when it is not. For a gateway that promises reliable real-time delivery across flaky networks, resume must either replay completely or explicitly tell the client it must resync.
Current behaviour
Events get a monotonic cursor, and the buffer is trimmed once it grows past a bound:
On reconnect, the client passes
sinceand receives ajoinedframe plus a stream ofreplayevents:The problem: with
max_messages = 1000(default,src/praisonai-agents/praisonaiagents/gateway/config.py:26), once more than 2000 events accumulate the oldest are discarded. If a client was at cursor5and reconnects after the buffer has rolled past it,get_events_since(5)returns only the events still retained — the gap between cursor5and the oldest retained cursor is lost. Thejoinedframe reports only the current headcursorand a booleanresumed; it does not report the oldest available cursor nor a truncated/resync-required flag, so the client cannot detect the loss. There is also no per-event top-level sequence number on the wire — the cursor is buried inevent.data['cursor'], so a client cannot cheaply detect a skipped sequence during a live stream either.Desired behaviour
joined/resume acknowledgement reports the oldest retained cursor alongside the head cursor, and sets an explicitresync_required: true(ortruncated: true) when the requestedsinceis below the oldest retained cursor — telling the client to discard local state and re-fetch a fresh snapshot rather than assume continuity.seq, so clients can detect a skipped sequence mid-stream and request resume, instead of relying on a value nested inside the payload.Layer placement
praisonai) — the retention/trim, cursor accounting, and thejoined/replayemission all live insrc/praisonai/praisonai/gateway/server.py, which is where the floor check and resync flag must be enforced.seq,oldest_cursor,resync_required) belong in the core contract.praisonaiagents/gateway/protocols.py) to add a top-levelseqon the event envelope and the resume-ack fields so all server/client implementations share them.Proposed approach
seqon every outbound event.Resolution sketch
Severity
Medium — resume works on the happy path, but under real reconnect conditions (long disconnect, busy session) clients can silently miss events and diverge from gateway state with no way to detect it; correct resume integrity is required before the gateway can be relied on for production real-time delivery.
Validation
Confirmed by reading
src/praisonai/praisonai/gateway/server.py:104-117(event buffer trimmed tomax_messages;get_events_sincefilters with no floor check),:973-985(thejoined/replayframes expose only the headcursorand a booleanresumed, with no oldest-cursor or truncation/resync flag and no top-levelseqon events), andsrc/praisonai-agents/praisonaiagents/gateway/config.py:26(max_messages = 1000, so the buffer rolls after ~2000 events). The retention trim discards the oldest events while the replay path provides no signal that they are missing.