Skip to content

feat(agent,timeline): subagent factory pattern + per-session timeline reload#12

Open
ngoclam9415 wants to merge 5 commits into
developfrom
feat/subagent-factory-timeline-resume
Open

feat(agent,timeline): subagent factory pattern + per-session timeline reload#12
ngoclam9415 wants to merge 5 commits into
developfrom
feat/subagent-factory-timeline-resume

Conversation

@ngoclam9415
Copy link
Copy Markdown
Contributor

Summary

Fixes two coupled correctness bugs in Dana subagent dispatch
(plan: 260519-0848-subagent-factory-and-timeline-resume).

Phase 1 — Shared-instance corruption. TaskResource held one long-lived
agent instance per subagent_type. Concurrent same-type Task calls (already
possible — tool batches run through asyncio.gather) interleaved on shared
mutable state; sequential calls accumulated into one ever-growing timeline.
TaskResource now dispatches to agent factories (functools.partial) —
every spawn constructs a fresh agent with a disjoint object graph (timeline,
star loop count, session id, EventLog).

Phase 2 — session_id is now a real context boundary. aquery(session_id=...)
previously only relabeled the save target — it never reloaded or reset the
in-memory timeline, so resume across restart silently lost data and "fresh"
sessions inherited prior context. set_session_id now flushes the outgoing
session, rebuilds the timeline via the extracted _build_timeline() (resetting
ALL session-scoped state, including compaction-tracking fields), and rehydrates
from disk via the new CompressedTimeline.rehydrate() wrapper over read_since.

Changes

  • task_resource.py — factory registration + .func contract validation,
    _descriptions cache, per-spawn construction, notifiable propagation at
    spawn; _get_agent_tools() dropped. Failed aquery marks the session
    "failed" instead of leaving it "running" forever.
  • star_agent.py — extracted _build_timeline() (single source of truth);
    rewrote set_session_id (fast-path → flush → assign id → rebuild → rehydrate).
  • timeline_serializer.py — added CompressedTimeline.rehydrate().
  • dana_coding_agent.py — explore subagent registered as functools.partial.

Legacy register_agent(name, instance) still works (wrapped as a constant
factory) but emits DeprecationWarning — migration is incremental.

Testing

  • 16 new unit tests (10 test_task_resource.py, 6 test_set_session_id.py)
  • Full unit suite: 1288 passed (3 pre-existing unrelated test_llm_providers.py
    failures); regression suite incl. reasoning-replay green
  • code-reviewer: 9/10, no critical issues — all findings (N1/N2/N3) resolved

Notes

  • Internal architecture change — no docs update needed (doc references to
    TaskResource are one-line tree nodes).

… reload

TaskResource dispatches to agent factories instead of shared instances.
Each Task call constructs a fresh agent with a disjoint object graph
(timeline, star loop count, session id, EventLog), eliminating both the
concurrent-same-type state corruption and the sequential timeline-
accumulation bug. Legacy instance registration still works but warns.

set_session_id is now a real context boundary: it flushes the outgoing
session, rebuilds the timeline via the extracted _build_timeline() (which
resets all session-scoped state including compaction-tracking fields), and
rehydrates from disk via the new CompressedTimeline.rehydrate() wrapper over
read_since. Unknown session yields an empty timeline; same id is a no-op.

- task_resource.py: factory registration, _descriptions cache, per-spawn
  construction, notifiable propagation at spawn; drop _get_agent_tools
- star_agent.py: extract _build_timeline(); rewrite set_session_id
- timeline_serializer.py: add CompressedTimeline.rehydrate()
- dana_coding_agent.py: register explore subagent as functools.partial
- tests: 9 TaskResource + 6 set_session_id unit tests
Previously a sub-agent whose aquery() raised left its session stuck at
status='running', so task_output reported it running indefinitely. task()
now catches the exception, records status='failed' plus the error, and
re-raises so the caller still observes the failure. task_output surfaces
the stored error for failed sessions.

Addresses code-review finding N3.
The persisted config flag sat one line above _timeline with a near-
identical name, reading as if it held a timeline. Rename to match
CompressedTimeline.compression_enabled. Constructor param unchanged.
…reload

set_session_id gains reload_timeline (default True). When False it is a
pure relabel — the current in-memory timeline is kept and carried into the
new session id instead of being flushed, rebuilt, and rehydrated. This lets
a subclass that seeds its own timeline (e.g. a persona entry before the STAR
loop) keep that timeline across a session switch.

Gating only the rehydrate() call is insufficient — the _build_timeline()
rebuild also discards the caller's timeline — so the flag gates the whole
reload.

Threaded through aquery, aquery_stream, and aconverse (-> Communicator.
aconverse -> aquery). Default True everywhere preserves the session-boundary
contract for subagents and existing callers; opt out with reload_timeline=
False. Directed @agent messages keep the default.
…o True

Flip the reload_timeline default to False across set_session_id, aquery,
aquery_stream, and aconverse. Most callers — including subclasses that seed
their own timeline before the STAR loop — manage their own context, so a
pure relabel is the safer default.

TaskResource.task() now passes reload_timeline=True explicitly: a sub-agent's
session_id is a hard context boundary, so each spawn gets a disjoint,
disk-accurate timeline (fresh for a new session, rehydrated on resume).

Tests exercising the reload path updated to pass reload_timeline=True.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant