a recursive self-improving harness designed to help your agents (and future iterations of those agents) succeed on any task
Autocontext is a harness. You point it at a goal in plain language. It iterates against real evaluation, keeps what worked, throws out what didn't, and produces a structured trace of the work plus the artifacts, playbooks, datasets, and (optionally) a distilled local model that the next agent inherits. Repeated runs get better, not just different.
The fastest path uses our Pi runtime, a local coding agent that handles its own auth. No API key plumbing, no provider config: install Pi, install autocontext, point one at the other.
uv tool install autocontext==0.4.9
AUTOCONTEXT_AGENT_PROVIDER=pi \
AUTOCONTEXT_PI_COMMAND=pi \
uv run autoctx solve \
"improve customer-support replies for billing disputes" \
--iterations 3Pi runs locally as a subprocess and emits live traces back into the harness. For a hosted Pi, set AUTOCONTEXT_AGENT_PROVIDER=pi-rpc and AUTOCONTEXT_PI_RPC_ENDPOINT instead.
Prefer TypeScript? Same surface, same command:
bun add -g autoctx@0.4.9
AUTOCONTEXT_AGENT_PROVIDER=pi bunx autoctx solve \
"improve customer-support replies for billing disputes" \
--iterations 5 --jsonAlready on Anthropic, OpenAI, Gemini, Mistral, Groq, OpenRouter, Azure, Claude CLI, Codex CLI, or MLX? Set AUTOCONTEXT_AGENT_PROVIDER and the matching credential env var:
AUTOCONTEXT_AGENT_PROVIDER=anthropic \
ANTHROPIC_API_KEY=sk-ant-... \
uv run autoctx solve "..." --iterations 3See .env.example for every provider's variables. Prefer to clone and run a starter? examples/README.md has copy-paste recipes for Python CLI, Claude Code MCP, Python SDK, and TypeScript library usage.
If you already work inside a coding agent (Claude Code, Pi, Cursor, or anything MCP-aware), you don't need to learn the CLI. Wire autocontext in once and your agent gets a natural-language entry point.
Pi ships an autocontext skill out of the box. Install the published Pi package and Pi loads natural-language wrappers over live tools such as autocontext_solve_scenario, autocontext_evaluate_output, autocontext_run_improvement_loop, autocontext_run_status, and autocontext_list_scenarios.
pi install npm:pi-autocontextThen you just ask:
"Solve: improve customer-support replies for billing disputes."
"Judge this output against this rubric and improve it until it scores 0.85."
Claude Code (and any other MCP client) gets the same surface by adding one entry to .claude/settings.json:
{
"mcpServers": {
"autocontext": {
"command": "uv",
"args": ["run", "--directory", "/path/to/autocontext", "autoctx", "mcp-serve"],
"env": { "AUTOCONTEXT_AGENT_PROVIDER": "pi", "AUTOCONTEXT_PI_COMMAND": "pi" }
}
}
}After that, Python MCP exposes prefixed tools such as autocontext_solve_scenario, autocontext_evaluate_output, autocontext_run_improvement_loop, autocontext_run_status, autocontext_list_scenarios, autocontext_export_skill, and autocontext_search_strategies. The MCP server runs on stdio. The TypeScript package exposes the same capabilities with its documented tool names via bunx autoctx mcp-serve.
Full integration guide: autocontext/docs/agent-integration.md.
Every run leaves a structured record on disk. Replay it, diff it, export it, feed it back into training.
runs/<run_id>/
├── trace.jsonl # every prompt, tool call, and outcome, in order
├── generations/
│ ├── gen_1/
│ │ ├── strategy.json # what the competitor proposed
│ │ ├── analysis.md # what the analyst observed
│ │ └── score.json # how it was evaluated
│ └── gen_2/ ...
├── report.md # human-readable summary of the whole run
└── artifacts/ # files, configs, packages the run produced
knowledge/<scenario>/
├── playbook.md # accumulated lessons that carried forward
├── hints.md # competitor hints that survived the curator
└── tools/ # any helper tools the architect generated
A playbook.md is plain markdown the next run reads as context:
<!-- PLAYBOOK_START -->
## Billing dispute replies
- Always restate the disputed charge in the first sentence; refunds requested without
explicit confirmation cause loops.
- "Pending" charges are not yet billable. Don't promise a refund until status flips
to `posted`. Verified gen_4, regressed in gen_7 when omitted.
- Empathy + specific next step beats empathy alone. Escalation rate dropped from
0.31 to 0.12 once the second sentence named the next-step owner.
<!-- PLAYBOOK_END -->A trace.jsonl line is one event:
{
"ts": "2026-04-28T17:42:11Z",
"gen": 4,
"role": "competitor",
"event": "strategy_proposed",
"score": 0.78,
"tokens_in": 1840,
"tokens_out": 612,
"strategy_id": "s_4f2a"
}Inspect, replay, or compare any of it:
uv run autoctx list
uv run autoctx status <run_id>
uv run autoctx replay <run_id> --generation 2Inside each run, five roles cooperate:
- competitor proposes a strategy or artifact for the task
- analyst explains what happened and why
- coach turns that analysis into playbook updates and future hints
- architect proposes tools or harness changes when the loop is stuck
- curator gates what knowledge is allowed to persist across runs
Strategies are evaluated through scenario execution, staged validation, and gating. Weak changes are rolled back. Successful changes accumulate as reusable knowledge that future runs (and future agents) inherit automatically.
The full vocabulary (Scenario, Task, Mission, Campaign, Run, Verifier, Knowledge, Artifact, Budget, Policy) lives in docs/concept-model.md.
Autocontext can sit alongside your live application and record what your agents do, then turn that into training data. Wrap your existing Anthropic or OpenAI client once:
from anthropic import Anthropic
from autocontext.production_traces import instrument_client
client = instrument_client(Anthropic(), app="billing-bot", env="prod")
# use `client` exactly like before; calls are captured to JSONL with content blocks,
# cache-aware usage, and Anthropic-native outcome taxonomy.import Anthropic from "@anthropic-ai/sdk";
import { instrumentClient } from "autoctx/production-traces";
const client = instrumentClient(new Anthropic(), { app: "billing-bot", env: "prod" });Then build scoped datasets from the captured traces:
uv run autoctx build-dataset \
--app billing-bot --provider anthropic \
--env prod --outcome success \
--output training/billing.jsonlAnd distill them into a smaller local model with MLX (Apple Silicon) or CUDA (Linux GPUs):
uv run autoctx train --scenario support_triage --data training/billing.jsonl --time-budget 300- Pi-shaped scenario hardening keeps schema-evolution simulations from persisting empty mutation plans and improves recovery from Pi-style JSON wrappers.
- Budgeted Pi/Pi-RPC role calls now report effective bounded timeouts and clean up one-shot persistent clients after use.
- RLM convergence guards capture explicit final-answer tags, cautious closure cues, and repeated silent no-progress cases without stealing normal inspection turns.
- Rubric drift monitoring now catches within-generation mean-versus-best compression and slower dimension decline.
- Task-like solve lifecycle metadata separates persisted generations from improvement rounds for hook consumers.
- Pi extension release alignment keeps
pi-autocontextone package behindautoctxwhile publishing the next extension patch.
| If you want to... | Start here |
|---|---|
| Run the full multi-generation control plane (Python) | autocontext/README.md |
| Run from Node, or operate missions, simulations, investigations | ts/README.md |
| Install the Pi extension package | pi/README.md |
| Wire an external coding agent into autocontext over MCP | autocontext/docs/agent-integration.md |
| Grab copy-paste integration snippets | examples/README.md |
# Python: library or CLI tool
uv pip install autocontext==0.4.9
uv tool install autocontext==0.4.9
# TypeScript
bun add -g autoctx@0.4.9
# Pi extension
pi install npm:pi-autocontextThe PyPI package is
autocontext. The CLI entrypoint isautoctx. The npm packages areautoctxandpi-autocontext(note: an unrelated package on npm uses the nameautocontext; that is not this project).
| Surface | Command | When to use it |
|---|---|---|
solve |
autoctx solve "..." --iterations 3 |
Hand the harness a goal in plain language; it generates the scenario and runs the loop |
run |
autoctx run <scenario> --iterations 3 |
Improve behavior inside a saved scenario across generations |
simulate |
autoctx simulate -d "..." |
Model a system, sweep parameters, replay, compare |
investigate |
autoctx investigate -d "..." |
Evidence-driven diagnosis, either synthetic harness or live iterative LLM session |
analyze |
autoctx analyze --id <id> --type <kind> |
Inspect or compare runs, simulations, investigations, or missions after the fact |
mission |
autoctx mission create --name "..." --goal "..." |
Verifier-driven goal advanced step by step until done |
campaign |
bunx autoctx campaign ... (TypeScript) |
Coordinate multiple missions with budgets, dependencies, progress aggregation |
export |
autoctx export <run-id> |
Share solved knowledge as JSON, skills, or Pi-local package directories |
train |
autoctx train --scenario <name> --data <jsonl> |
Distill stable exported data into a cheaper local runtime |
replay |
autoctx replay <run_id> --generation N |
Inspect what happened before deciding what knowledge should persist |
All 11 families execute in both Python and TypeScript. TypeScript uses V8 isolate codegen; Python uses subprocess executors.
| Family | Evaluation | What it tests |
|---|---|---|
game |
Tournament with Elo | Turn-based strategy (grid_ctf, othello) |
agent_task |
LLM judge | Prompt-centric tasks with optional improvement loops |
simulation |
Trace evaluation | Action-trace scenarios with mock environments and fault injection |
artifact_editing |
Artifact validation | File, config, and schema modification with diff tracking |
investigation |
Evidence chains | Diagnosis accuracy with red herring detection |
workflow |
Workflow evaluation | Transactional flows with compensation, retry, and side-effect tracking |
negotiation |
Negotiation evaluation | Hidden preferences, BATNA constraints, and opponent modeling |
schema_evolution |
Schema adaptation | Mid-run state changes where agents must detect stale context |
tool_fragility |
Drift adaptation | APIs that drift, requiring agents to adapt to changed tool behavior |
operator_loop |
Judgment evaluation | Escalation and clarification judgment in operator-in-the-loop workflows |
coordination |
Coordination evaluation | Multi-agent partial context, handoff, merge, and duplication detection |
LLM providers: Anthropic (with instrument_client capture), OpenAI-compatible (vLLM, Ollama, Hermes), Gemini, Mistral, Groq, OpenRouter, Azure OpenAI, MLX (Apple Silicon), CUDA (Linux GPUs), Pi (CLI and RPC).
Agent runtimes: Claude CLI, Codex CLI, Hermes CLI, Direct API, Pi variants, plus branch-aware session and persistent Pi RPC for local agent loops.
Executors: Local subprocess, SSH remote, Monty (pydantic-monty sandbox), PrimeIntellect remote sandbox.
Harness profiles and hooks: The Python control plane supports a Pi-shaped lean profile that caps prompt context during generation and exports a minimal tool-affordance allowlist for agent surfaces that enforce tool gating. Semantic prompt compactions are recorded as Pi-shaped JSONL entries under each run; the TypeScript package now includes a mirrored deterministic prompt compactor plus ArtifactStore ledger read/write/latest APIs for standalone npm runs. Python and TypeScript runs can load AUTOCONTEXT_EXTENSIONS; Python extensions are Python modules, while TypeScript extensions are JavaScript/ESM modules that register hooks around context assembly, semantic compaction, provider calls, judge calls, artifact writes, and run lifecycle events.
A deterministic offline provider exists for the test suite. Configuration matrix: .env.example and docs/concept-model.md.
Is autocontext a benchmark? No. It's a harness for improving real agent behavior on real work. Benchmarks (the 11 scenario families) are one of many surfaces; you can also point it at production tasks, missions, or simulations.
How is this different from DSPy, Inspect, TextGrad, or a prompt optimizer? Those tools optimize prompts. Autocontext takes a goal in plain language, generates the scenario, runs a multi-role loop with verifier-driven gating, and produces transferable artifacts (playbooks, datasets, distilled models) that the next run inherits. Prompt optimization is a special case.
Do I need API keys? No. The Pi runtime runs locally and handles its own auth. Anthropic, OpenAI, Gemini, Mistral, Groq, OpenRouter, Azure, MLX, and Claude/Codex CLI are all opt-in via env vars.
Where does the knowledge live?
On your filesystem. Runs go to runs/, accumulated knowledge to knowledge/. Indexed metadata is in SQLite. Everything is inspectable, diffable, and portable.
Can my coding agent drive autocontext directly?
Yes. Wire autoctx mcp-serve (or bunx autoctx mcp-serve) into Claude Code, Cursor, or Pi as an MCP server, and the agent gets natural-language access to solve, judge, improve, status, export_skill, and the rest. See Or Just Talk To Your Agent.
- Canonical vocabulary and object model: docs/concept-model.md
- Docs overview: docs/README.md
- Python package guide: autocontext/README.md
- TypeScript package guide: ts/README.md
- Copy-paste examples: examples/README.md
- External agent integration: autocontext/docs/agent-integration.md
- Recent changes: CHANGELOG.md
- Contributor setup: CONTRIBUTING.md
- Repo layout for coding agents: AGENTS.md
- Sandboxed agents that need to trigger MLX training on the host: autocontext/docs/mlx-training.md
- Sandbox and executor notes: autocontext/docs/sandbox.md
- License: LICENSE
Thanks to George for generously donating the autocontext name on PyPI.