@tangle-network/agent-eval

Trace-first evaluation infrastructure for agent systems.

agent-eval provides the contracts and runtime primitives for measuring agent behavior: traces, harnesses, verifier pipelines, judges, datasets, holdout gates, failure classification, optimization loops, and release reports.

It does not own your product state, credentials, UI, or model routing. Product teams keep those boundaries; this package standardizes how runs are recorded, checked, compared, and promoted.

When To Use It

Use agent-eval when you need one or more of these:

A reproducible eval harness for coding agents, builder agents, or multi-tool workflows.
Structured traces for agent runs: spans, artifacts, events, budgets, tool calls, retrieval, judge output, and sandbox execution.
Deterministic gates around build/test/deploy checks.
LLM-as-judge or deterministic judge fleets with calibration and canaries.
Dataset splits, holdouts, paired statistics, and release confidence gates.
Failure taxonomy that distinguishes prompt, tool, sandbox, retrieval, evaluator, and knowledge-readiness failures.
Optimization loops over prompts, steering, code mutations, or full multi-shot trajectories.
Report data for internal launch reviews, CI gates, and research analysis.

Architecture

agent/product run
  -> TraceEmitter / TraceStore
  -> SandboxHarness / MultiLayerVerifier / JudgeRunner
  -> failure taxonomy + metrics
  -> paired stats + held-out gates
  -> optimization + release confidence + reports

Package responsibilities:

agent-eval: run evidence, eval contracts, verification, statistics, optimization, reporting.
Product app: domain state, tools, credentials, UI, storage, deployment, model gateway.
@tangle-network/agent-runtime: production agent-loop/session runtime.
@tangle-network/agent-knowledge: evidence stores, claim/page synthesis, retrieval, knowledge readiness implementation.

Install

pnpm add @tangle-network/agent-eval

Wire protocol / CLI:

npm i -g @tangle-network/agent-eval
agent-eval serve --port 5005

Python client source lives in clients/python. Until the PyPI package is published, install it from the repo:

cd clients/python
pip install -e .

Core Primitives

Primitive	Purpose
`TraceEmitter`, `TraceStore`	Append-only run/span/event/artifact/budget records.
`SandboxHarness`	Build/test/runtime checks with captured stdout, stderr, exit codes, wall time, and parsed test counts.
`MultiLayerVerifier`	Ordered verification stages with dependencies, skip-on-fail, findings, scores, and time caps.
`JudgeRunner`	Parallel deterministic or LLM-backed judges over the same artifact/run.
`runAgentControlLoop`	Observe/validate/decide/act loop with budgets, stop policies, and structured eval results.
`Dataset`, `RunRecord`, `HeldOutGate`	Versioned corpora, reproducible run metadata, and held-out promotion decisions.
`pairedBootstrap`, `pairedWilcoxon`, `bhAdjust`	Paired experiment statistics and multiple-comparison correction.
`classifyFailure`	Rule-based failure classification for agent, tool, sandbox, retrieval, and knowledge failures.
`runMultiShotOptimization`	Optimization over full agent trajectories with actionable side information.
`runPromptEvolution`	Prompt/steering/code evolution over scenario scores.
`evaluateReleaseConfidence`	Release scorecard across evidence volume, pass rate, score, overfit, cost, latency, and gates.
`summaryTable`, `paretoChart`, `gainHistogram`	Report-ready structured outputs.
`KnowledgeRequirement`, `KnowledgeBundle`	Shared contracts for knowledge readiness.

NoopResearcher is a fail-loud sentinel for wiring tests. Production systems should implement Researcher directly or use CallbackResearcher.

Examples

Runnable examples live in the repository's examples/ directory. They are not part of the published npm package.

examples/same-sandbox-harness - run multiple eval passes against the same workspace.
examples/multi-shot-optimization - optimize full agent trajectories with held-out promotion.
examples/benchmarks - benchmark adapter shape and reference benchmark wrappers.

The examples are intentionally kept outside the README so they can be expanded, tested, and copied without turning this page into a tutorial.

Documentation

Development

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi

Run the local server:

pnpm build
node dist/cli.js serve --port 5005

Python client tests:

pnpm build
cd clients/python
pip install -e ".[dev]"
pytest

Release

@tangle-network/agent-eval publishes to npm. The Python client lives under clients/python and is versioned from this repository.

Related Packages

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.claude/skills/agent-eval		.claude/skills/agent-eval
.github/workflows		.github/workflows
clients/python		clients/python
docs		docs
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@tangle-network/agent-eval

Contents

When To Use It

Architecture

Install

Core Primitives

Examples

Documentation

Development

Release

Related Packages

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@tangle-network/agent-eval

Contents

When To Use It

Architecture

Install

Core Primitives

Examples

Documentation

Development

Release

Related Packages

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages