Autonomous research agents that plan, execute, and self-critique. Five demos showing progressively more sophisticated agent patterns with the Claude Agent SDK — from single-agent one-shot to multi-agent orchestration with plan-and-execute + reflection. Reports average 2,253 words with 8-18 verified sources, costs $0.42-$1.97 per run.
- Demos Overview — all 5 demos at a glance
- Demo 1: Single Agent — one-shot
query() - Demo 2: Multi-Agent — parallel sub-agents
- Demo 3: HTTP API — Starlette server + n8n
- Demo 4: Plan + Reflect — Haiku plan, Sonnet execute, Haiku reflect
- Demo 5: Multi + Plan — orchestrator + sub-agents + reflection
- Comparison Runner — Demo 1 vs Demo 4 side-by-side
- Dashboard — visual HTML comparison
- Testing — 42 pytest tests, no API calls
- Docker — containerized Demo 3
- Cost — verified run costs
- Project Structure
| Demo | Script | Pattern | Description |
|---|---|---|---|
| 1 | research_agent.py |
Single agent | One agent researches topic end-to-end |
| 2 | multi_agent_research.py |
Multi-agent | Orchestrator spawns parallel sub-agents per subtopic |
| 3 | n8n_hybrid_server.py |
HTTP API + n8n | Webhook triggers research, n8n formats and emails |
| 4 | plan_reflect_agent.py |
Plan + Reflect | Structured planning, sequential execution, self-critique |
| 5 | plan_reflect_multi_agent.py |
Multi + Plan | Orchestrator plans, delegates to sub-agents, reflects |
| — | run_comparison.py |
Comparison | Runs Demo 1 vs Demo 4 side-by-side on same topic |
- Python 3.12+
- Claude Code CLI installed and authenticated (
npm install -g @anthropic-ai/claude-code) - Anthropic API access (the CLI handles authentication — no manual API key export needed)
- n8n (optional, for Demo 3 email workflow) — v2.11+ installed locally
cd ~/projects/managed-agent-poc
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtUnit tests for shared utilities, system prompt structure, and output parsing. All tests run locally without API calls.
source venv/bin/activate
pytest tests/ -v42 tests covering:
test_utils.py— slugify, strip_preamble, check_report_structure, config constantstest_prompts.py— validates BASIC/PLAN_REFLECT/RESEARCHER prompt structuretest_output_parsing.py— report section detection, meta-info parsing, source URL extraction
Visual HTML dashboard comparing all demos side-by-side with charts and sample outputs:
cd static && python3 -m http.server 8080
# Open http://localhost:8080/comparison.htmlOr open static/comparison.html directly in a browser.
Single agent that researches a topic autonomously and produces a structured Markdown report.
flowchart LR
A["topic"] --> B["query() — Sonnet\nWebSearch + WebFetch"]
B --> C["Decompose → Search → Evaluate → Write"]
C --> D["output/{slug}.md"]
python3 research_agent.py "State of AI Coding Agents 2026"
python3 research_agent.py "KI-Telefonie im DACH-Mittelstand" -o reports/| Flag | Default | Description |
|---|---|---|
topic (positional) |
required | The topic to research |
-o, --output-dir |
output/ |
Directory to save the report |
| Metric | Value |
|---|---|
| Report length | 2,463 words |
| Sources | 14 verified URLs |
| Agent turns | 32 |
| Cost | $0.81 |
| Runtime | ~3 minutes |
Tested with topic "State of AI Coding Agents 2026" on 2026-04-10.
Orchestrator agent decomposes the topic into subtopics and spawns parallel sub-agents — each researching one subtopic independently. Results are stitched into a unified report.
flowchart TD
A["CLI: topic string"] --> B["Orchestrator — Sonnet"]
B --> C["Decompose into 3-5 subtopics"]
C --> D1["Researcher 1\nWebSearch + WebFetch"]
C --> D2["Researcher 2\nWebSearch + WebFetch"]
C --> D3["Researcher 3\nWebSearch + WebFetch"]
C --> D4["Researcher ...\nWebSearch + WebFetch"]
D1 --> E["Orchestrator stitches report"]
D2 --> E
D3 --> E
D4 --> E
E --> F["output/multi-{slug}.md"]
python3 multi_agent_research.py "Claude Managed Agents vs LangChain"
python3 multi_agent_research.py "RAG Architekturen 2026" -o reports/| Flag | Default | Description |
|---|---|---|
topic (positional) |
required | The topic to research |
-o, --output-dir |
output/ |
Directory to save the report |
- Orchestrator receives the topic and decomposes it into 3-5 subtopics
- For each subtopic, spawns a researcher sub-agent via the Agent tool
- Sub-agents run in parallel, each using WebSearch/WebFetch independently
- Progress tracked via
TaskStartedMessageandTaskNotificationMessage - Orchestrator collects all results and writes the final unified report
- Per-agent token breakdown printed to stdout
| Metric | Value |
|---|---|
| Report length | 2,178 words |
| Sub-agents spawned | 5 (parallel) |
| Total tokens | ~65,000 across all agents |
| Cost | $1.70 |
| Runtime | ~7 minutes |
Tested with topic "Claude Managed Agents vs LangChain" on 2026-04-10.
Sub-agent breakdown from test run:
| Sub-agent | Tokens |
|---|---|
| Claude Managed Agents architecture | 22,081 |
| LangChain architecture and features | 7,429 |
| Developer experience comparison | 11,440 |
| Performance and production readiness | 12,076 |
| Use cases and adoption trends | 11,811 |
Demo 1 uses query() (one-shot, single agent). Demo 2 uses ClaudeSDKClient (streaming, bidirectional) with AgentDefinition to define sub-agents that the orchestrator can spawn.
HTTP API server that any client (n8n, curl, Postman) can call to trigger research. Includes an importable n8n workflow that receives a webhook, calls the API, formats the report, and sends it via email.
Shows: "n8n + Managed Agents = complementary, not competitive"
flowchart LR
A["n8n Webhook\nor curl"] -->|"POST /research"| B["Starlette Server\nport 8000"]
B --> C["query() — Sonnet\nWebSearch + WebFetch"]
C -->|"JSON response"| B
B -->|"report + cost + meta"| D["n8n: Format\n& Send Email"]
Start the server:
python3 n8n_hybrid_server.py # Default: port 8000
python3 n8n_hybrid_server.py --port 9000 # Custom portTest with curl:
# Health check
curl http://localhost:8000/health
# Trigger research — basic mode (default, ~3 minutes)
curl -X POST http://localhost:8000/research \
-H "Content-Type: application/json" \
-d '{"topic": "AI Coding Agents 2026"}'
# Trigger research — plan-reflect mode (~2 minutes)
curl -X POST http://localhost:8000/research \
-H "Content-Type: application/json" \
-d '{"topic": "AI Coding Agents 2026", "mode": "plan-reflect"}'Triggers autonomous research on a topic.
Request:
{"topic": "State of AI Coding Agents 2026", "mode": "basic"}| Field | Required | Default | Description |
|---|---|---|---|
topic |
yes | — | The topic to research |
mode |
no | "basic" |
"basic" (Demo 1 prompt) or "plan-reflect" (Demo 4 prompt with plan + reflection) |
Response (from actual test run):
{
"topic": "n8n vs Make.com vs Zapier 2026",
"mode": "basic",
"report": "# n8n vs Make.com vs Zapier: 2026 Automation Platform Comparison\n...",
"words": 1956,
"turns": 18,
"cost_usd": 0.415,
"elapsed_seconds": 164.5
}Returns {"status": "ok", "service": "research-agent"}.
- Import
n8n_workflow.jsoninto n8n (Settings → Import Workflow) - Configure SMTP credentials in the "Send Email" node
- Set
EMAIL_FROMandEMAIL_TOenvironment variables (or edit the node directly) - Ensure the Python server is running on
localhost:8000 - Activate the workflow — the webhook is ready at
POST /webhook/research
Workflow flow: Webhook Trigger → HTTP Request (calls Python API) → Format Report → Send Email + Respond to Webhook
| Server Flag | Default | Description |
|---|---|---|
--port |
8000 |
Port to listen on |
--host |
0.0.0.0 |
Host to bind to |
Three separate query() calls using the optimal model per phase. Planning and reflection use Haiku (10x cheaper), execution uses Sonnet (needs reasoning quality).
Inspired by: Plan-and-Execute and Reflection patterns from the LangGraph ecosystem.
flowchart LR
A["topic"] --> B["Plan\nHaiku"]
B -->|"plan"| C["Execute\nSonnet + WebSearch"]
C -->|"report"| D["Reflect\nHaiku"]
D --> E["output/plan-reflect-{slug}.md"]
python3 plan_reflect_agent.py "State of AI Coding Agents 2026"
python3 plan_reflect_agent.py "RAG Architekturen 2026" -o reports/| Flag | Default | Description |
|---|---|---|
topic (positional) |
required | The topic to research |
-o, --output-dir |
output/ |
Directory to save the output |
Unlike Demo 1 which outputs only the report, Demo 4 outputs a complete research artifact:
- Research Plan — Table of planned steps with questions and target sources
- Report — Executive Summary, Key Findings, Sources, Conclusions (same structure as Demo 1)
- Reflection Notes — Self-critique: plan coverage, source quality, contradictions, overall assessment
- Meta — Step count, web search count, whether reflection triggered a correction
## Research Plan
| Step | Research Question | Target Source Type |
|------|------------------|--------------------|
| 1 | What are the leading AI coding agents in 2026? | News, official docs |
| 2 | How do they compare on code generation quality? | Benchmarks, papers |
| 3 | What enterprise adoption patterns are emerging? | Industry reports |
| 4 | What are the key limitations and risks? | Expert analyses |
## Report
[... standard report content ...]
## Reflection Notes
1. **Plan Coverage**: All 4 steps adequately addressed. Step 3 (enterprise adoption)
had fewer primary sources than ideal.
2. **Source Quality**: 12 sources, mostly authoritative. 2 blog posts are weaker but
corroborated by other sources.
3. **Contradictions**: Benchmark results vary by provider — noted in findings.
4. **Overall Assessment**: Adequate
## Meta
- **Research steps planned**: 4
- **Research steps completed**: 4
- **Total web searches performed**: 11
- **Reflection triggered correction**: No
- **Correction details**: N/Av2 (multi-model: Haiku plan/reflect + Sonnet execute) — 2026-04-14:
| # | Topic | Words | Turns | Cost | Plan $ | Exec $ | Reflect $ | Correction |
|---|---|---|---|---|---|---|---|---|
| 1 | State of AI Coding Agents 2026 | 3,107 | 32 | $0.86 | $0.04 | $0.78 | $0.04 | Yes |
| 2 | n8n vs Make.com vs Zapier 2026 | 2,956 | 26 | $0.65 | $0.01 | $0.61 | $0.03 | No |
Haiku cost for Plan+Reflect: $0.04-0.08 per run (vs ~$0.15-0.20 with Sonnet for those phases).
v1 results (single Sonnet, 2026-04-13)
| # | Topic | Words | Turns | Cost | Correction |
|---|---|---|---|---|---|
| 1 | State of AI Coding Agents 2026 | 2,338 | 25 | $0.60 | No |
| 2 | n8n vs Make.com vs Zapier 2026 | 2,145 | 18 | $0.50 | No |
| 3 | KI-Telefonie im DACH-Mittelstand | 2,176 | 32 | $0.82 | No |
| 4 | Claude Managed Agents vs LangChain | 2,588 | 30 | $0.53 | No |
| 5 | RAG Architekturen 2026 | 1,823 | 20 | $0.57 | No |
| Aspect | Demo 1 (Basic) | Demo 4 (Plan + Reflect) |
|---|---|---|
| Approach | "Research this topic" one-shot | Structured plan → sequential execution → self-critique |
| Models | Sonnet only | Haiku (plan/reflect) + Sonnet (execute) |
| Planning | Implicit (agent decides internally) | Explicit plan output before any research |
| Evaluation | None — agent decides when done | Per-step sufficiency check during execution |
| Self-critique | None | Reflection phase with optional correction |
| Output | Report only | Plan + Report + Reflection + Meta |
| query() calls | 1 | 3 (one per phase) |
Combines Demo 2's multi-agent orchestration with Demo 4's plan-and-execute + reflection. The orchestrator creates a research plan, delegates each step to parallel sub-agents, then synthesizes and reflects on the combined results.
flowchart TD
A["CLI: topic string"] --> B["Orchestrator — Sonnet\nPhase 1: Create research plan"]
B --> C1["Researcher 1\nStep 1"]
B --> C2["Researcher 2\nStep 2"]
B --> C3["Researcher 3\nStep 3"]
B --> C4["Researcher ...\nStep N"]
C1 --> D["Orchestrator\nPhase 3: Synthesize + Reflect"]
C2 --> D
C3 --> D
C4 --> D
D --> E["output/plan-reflect-multi-{slug}.md\nPlan + Report + Reflection + Meta"]
python3 plan_reflect_multi_agent.py "State of AI Coding Agents 2026"
python3 plan_reflect_multi_agent.py "RAG Architekturen 2026" -o reports/| Metric | Value |
|---|---|
| Report length | 2,467 words |
| Sub-agents spawned | 5 (parallel) |
| Total tokens | ~53,000 across all agents |
| Cost | $1.97 |
| Runtime | ~5 minutes |
| Structure (Plan/Reflection/Meta) | ✓/✓/✓ |
Tested with topic "State of AI Coding Agents 2026" on 2026-04-13.
Sub-agent breakdown from test run:
| Sub-agent | Tokens |
|---|---|
| AI coding agent products 2026 | 8,631 |
| Capabilities/benchmarks 2026 | 9,904 |
| Enterprise adoption 2026 | 10,261 |
| Market landscape 2026 | 11,867 |
| Limitations and challenges 2026 | 12,312 |
Runs Demo 1 (basic) and Demo 4 (plan+reflect) on the same topic sequentially and prints a side-by-side comparison table.
python3 run_comparison.py "State of AI Coding Agents 2026"Outputs: comparison table to stdout + two report files (compare-basic-*.md, compare-plan-reflect-*.md).
Topic: "n8n vs Make.com vs Zapier 2026" (2026-04-13):
| Metric | Demo 1 (basic) | Demo 4 (plan+reflect) |
|---|---|---|
| Words | 2,059 | 2,804 |
| Turns | 16 | 31 |
| Cost | $0.45 | $0.68 |
| Runtime | 153s | 233s |
| Has Research Plan | No | Yes |
| Has Reflection | No | Yes |
| Has Meta-info | No | Yes |
Total comparison cost: $1.13
utils.py contains shared code used across all demos:
Functions:
slugify()— filesystem-safe filename generationstrip_preamble()— removes non-Markdown artifacts from agent outputcheck_report_structure()— validates plan-reflect output sections
System Prompts:
BASIC_SYSTEM_PROMPT— system prompt for Demo 1/3 (basic mode)PLAN_REFLECT_SYSTEM_PROMPT— system prompt for Demo 3 (plan-reflect mode)/4/5RESEARCHER_PROMPT— sub-agent prompt for Demo 2/5
Configuration Constants:
DEFAULT_MODEL—claude-sonnet-4-6(single place to change model)HAIKU_MODEL—claude-haiku-4-5-20251001(used for plan/reflect phases)DEFAULT_TOOLS—["WebSearch", "WebFetch"]DEFAULT_PERMISSION_MODE—bypassPermissions
Phase-Specific Prompts (Demo 4):
PLAN_PHASE_PROMPT— planning-only prompt for HaikuEXECUTE_PHASE_PROMPT— research execution prompt for SonnetREFLECT_PHASE_PROMPT— self-critique prompt for Haiku
All demos produce reports with this format:
- Executive Summary — 2-3 paragraph overview
- Key Findings — One subsection per subtopic with detailed analysis
- Sources — Numbered list with titles and URLs
- Conclusions — Synthesis of findings, trends, and implications
| # | Topic | Expected Output | Tested |
|---|---|---|---|
| 1 | "State of AI Coding Agents 2026" | ~2000 words, 10+ sources | Demo 1: 2,463w / Demo 4: 2,338w |
| 2 | "n8n vs Make.com vs Zapier 2026" | Comparison table + analysis | Demo 3: 1,956w / Demo 4: 2,145w |
| 3 | "KI-Telefonie im DACH-Mittelstand" | Market analysis, providers, ROI | Demo 4: 2,176w |
| 4 | "Claude Managed Agents vs LangChain" | Technical comparison | Demo 2: 2,178w / Demo 4: 2,588w |
| 5 | "RAG Architekturen 2026: Naive vs Graph vs Wiki" | Architecture guide | Demo 4: 1,823w |
query()for one-shot interactions (Demo 1, Demo 3 API, Demo 4)ClaudeSDKClientfor streaming/multi-agent (Demo 2, Demo 5)AgentDefinitionto declare sub-agents the orchestrator can spawn (Demo 2, Demo 5)- Tool names are Claude Code built-ins:
WebSearch,WebFetch(notweb_search) - All configuration centralized in
utils.py(DEFAULT_MODEL,DEFAULT_TOOLS,DEFAULT_PERMISSION_MODE) - Authentication handled by the
claudeCLI — noANTHROPIC_API_KEYexport needed
| Component | Version |
|---|---|
| Python | 3.12 |
| claude-agent-sdk | 0.1.58 |
| anthropic | 0.93.0 |
| starlette | 1.0.0 |
| uvicorn | 0.44.0 |
| Claude Code CLI | 2.1.101 |
| n8n | 2.11.2 |
| Models | claude-sonnet-4-6 (execute), claude-haiku-4-5-20251001 (plan/reflect) |
All costs from actual test runs:
| Demo | Cost | Words | Runtime | Topic Tested | Date |
|---|---|---|---|---|---|
| Demo 1 | $0.81 | 2,463 | ~3 min | State of AI Coding Agents 2026 | 2026-04-10 |
| Demo 2 | $1.70 | 2,178 | ~7 min | Claude Managed Agents vs LangChain | 2026-04-10 |
| Demo 3 | $0.42 | 1,956 | 164s | n8n vs Make.com vs Zapier 2026 | 2026-04-10 |
| Demo 4 | $0.60 avg | 2,214 avg | ~2 min | 5 topics (see Demo 4 results) | 2026-04-13 |
| Demo 5 | $1.97 | 2,467 | ~5 min | State of AI Coding Agents 2026 | 2026-04-13 |
Run the HTTP API server (Demo 3) in a container. The image includes Node.js + Claude CLI (@anthropic-ai/claude-code) since claude-agent-sdk requires it at runtime.
Prerequisites: Set your ANTHROPIC_API_KEY in a .env file or environment:
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env# Build
docker compose build
# Agent server only
docker compose up agent
# Agent server + n8n (n8n waits for agent healthcheck)
docker compose --profile with-n8n up
# Test
curl http://localhost:8000/health
curl -X POST http://localhost:8000/research \
-H "Content-Type: application/json" \
-d '{"topic": "AI Coding Agents 2026"}'| Service | Port | Notes |
|---|---|---|
agent |
8000 | Always runs, healthcheck on /health |
n8n |
5678 | Optional (--profile with-n8n), starts after agent is healthy |
managed-agent-poc/
├── research_agent.py # Demo 1: Simple Research Agent
├── multi_agent_research.py # Demo 2: Multi-Agent Research
├── n8n_hybrid_server.py # Demo 3: n8n Hybrid Server (supports mode=plan-reflect)
├── plan_reflect_agent.py # Demo 4: Plan & Reflect Research Agent
├── plan_reflect_multi_agent.py # Demo 5: Multi-Agent Plan & Reflect
├── run_comparison.py # Comparison runner (Demo 1 vs Demo 4)
├── utils.py # Shared utilities, prompts, config constants
├── requirements.txt # Python dependencies
├── Dockerfile # Demo 3 container (Python 3.12-slim)
├── docker-compose.yml # Agent + optional n8n services
├── n8n_workflow.json # Demo 3: Importable n8n workflow
├── static/comparison.html # Visual comparison dashboard (open in browser)
├── tests/ # pytest test suite (42 tests)
├── output/ # Generated reports
├── venv/ # Python virtual environment
└── README.md

