diff --git a/api-reference/cli/eval.mdx b/api-reference/cli/eval.mdx
new file mode 100644
index 00000000..e6f931c6
--- /dev/null
+++ b/api-reference/cli/eval.mdx
@@ -0,0 +1,170 @@
+---
+title: eval
+description: "Run behavioral evals against a Pipecat agent, individually or as a suite"
+---
+
+Run scenario-based behavioral evals. `pipecat eval run` tests scenarios against an already-running agent; `pipecat eval suite` spawns the agents listed in a manifest and runs their scenarios concurrently. Both exit `0` when everything passes and `1` otherwise.
+
+The same commands are also available as `python -m pipecat.evals`.
+
+See the [Pipecat Evals guide](/pipecat/evals/overview) for concepts, the scenario format, and manifests.
+
+## eval run
+
+Run one or more scenarios against an already-running agent (started with `-t eval`).
+
+**Usage:**
+
+```shell
+pipecat eval run [OPTIONS] SCENARIOS...
+```
+
+**Arguments:**
+
+
+ One or more scenario YAML files.
+
+
+**Options:**
+
+
+ WebSocket URL of the agent's eval transport.
+
+
+
+ Print a line for each turn and expectation as it resolves.
+
+
+
+ Record each scenario's conversation audio (audio-mode scenarios).
+
+
+
+ Directory for `--audio` recordings: `/.wav`.
+
+
+
+ Directory for cached synthesized user audio. Defaults to
+ `/pipecat/tts`.
+
+
+
+ Disable the user-audio cache: re-synthesize every turn (no reads or writes).
+
+
+
+ Default per-expectation timeout in seconds, for expectations without their own
+ `within_ms`.
+
+
+
+ Directory for each scenario's logs: `/.eval.log` (plus
+ `.debug.log` under `--debug`).
+
+
+
+ Also save `.debug.log` with the harness's full per-pipeline logs.
+
+
+
+ Cancel the agent's pipeline (exit it) after the run. By default the agent is
+ left running so it can serve more scenarios.
+
+
+## eval suite
+
+Spawn the agents in a manifest and run their scenarios concurrently. Everything except the `suite:` list can be set in the manifest or overridden on the command line (the command line wins).
+
+**Usage:**
+
+```shell
+pipecat eval suite [OPTIONS] MANIFEST_PATH
+```
+
+**Arguments:**
+
+
+ Manifest YAML listing agents and their scenarios.
+
+
+**Options:**
+
+
+ Only run bots whose path contains this substring.
+
+
+
+ Only run this scenario name.
+
+
+
+ Run subdirectory name under `runs_dir`. Defaults to a timestamp.
+
+
+
+ Output base, overriding the manifest's `runs_dir`. A `/` subdirectory
+ with `logs/` and `recordings/` is created under it. Defaults to `eval-runs`.
+
+
+
+ Override the manifest's `bots_dir` (bot paths are relative to it).
+
+
+
+ Override the manifest's `scenarios_dir`.
+
+
+
+ Override the manifest's `concurrency` (how many runs execute at once).
+
+
+
+ Override the manifest's `base_port` (default `7900`). Each run gets `base_port
+ + index`.
+
+
+
+ Override the manifest's `cache_dir` for cached synthesized user audio.
+
+
+
+ Disable the user-audio cache: re-synthesize every turn (no reads or writes).
+
+
+
+ Default per-expectation timeout in seconds, for expectations without their own
+ `within_ms`.
+
+
+
+ Override the manifest's spawn template. Default: `"{python} {bot} -t eval
+ --port {port}"`.
+
+
+
+ Override the Python interpreter used to spawn each agent.
+
+
+
+ Record conversation audio.
+
+
+
+ Also save `.debug.log` with the harness's full per-pipeline logs.
+
+
+## Examples
+
+```shell
+# Run one scenario against a running agent
+pipecat eval run scenarios/capital_question.yaml
+
+# Run a batch of scenarios, verbosely
+pipecat eval run scenarios/*.yaml -v
+
+# Run a full suite
+pipecat eval suite manifest.yaml
+
+# Only the support agent, 8 runs at a time, named output dir
+pipecat eval suite manifest.yaml -p support -c 8 -n nightly
+```
diff --git a/api-reference/cli/overview.mdx b/api-reference/cli/overview.mdx
index 3c0f7a1f..09156852 100644
--- a/api-reference/cli/overview.mdx
+++ b/api-reference/cli/overview.mdx
@@ -19,11 +19,11 @@ description: "Command-line tool for scaffolding, deploying, and monitoring Pipec
Push your bots to production with one command
- Watch real-time logs, conversations, and metrics
+ Test your agents with scripted scenarios and an LLM judge
@@ -51,7 +51,7 @@ pipecat --version
**[`pipecat init`](/api-reference/cli/init)** - Scaffold new projects with interactive setup
-**[`pipecat tail`](/api-reference/cli/tail)** - Monitor sessions in real-time with a terminal dashboard
+**[`pipecat eval`](/api-reference/cli/eval)** - Run behavioral evals against your agents
**[`pipecat cloud`](/api-reference/cli/cloud/auth)** - Deploy and manage bots on Pipecat Cloud
@@ -62,7 +62,7 @@ View help for any command:
```bash
pipecat --help
pipecat init --help
-pipecat tail --help
+pipecat eval --help
pipecat cloud --help
```
diff --git a/api-reference/cli/tail.mdx b/api-reference/cli/tail.mdx
deleted file mode 100644
index ea56df05..00000000
--- a/api-reference/cli/tail.mdx
+++ /dev/null
@@ -1,51 +0,0 @@
----
-title: tail
-description: "A terminal dashboard for monitoring Pipecat sessions in real-time"
----
-
-**Tail** is a terminal dashboard for monitoring your Pipecat sessions in real-time with logs, conversations, metrics, and audio levels all in one place.
-
-With Tail you can:
-
-- ๐ Follow system logs in real time
-- ๐ฌ Track conversations as they happen
-- ๐ Monitor user and agent audio levels
-- ๐ Keep an eye on service metrics and usage
-- ๐ฅ๏ธ Run locally as a pipeline runner or connect to a remote session
-
-**Usage:**
-
-```shell
-pipecat tail [OPTIONS]
-```
-
-**Options:**
-
-
- WebSocket URL to connect to. Defaults to `ws://localhost:9292`.
-
-
-## How to Use Tail
-
-- Add `pipecat-ai-cli` to your project's dependencies.
-
-- Update your Pipecat code to include the `TailObserver`:
-
- ```python
- from pipecat_cli.tail import TailObserver
-
- task = PipelineWorker(
- pipeline,
- observers=[TailObserver()]
- )
- ```
-
-- Start the Tail app separately:
-
- ```bash
- # Connect to local session (default)
- pipecat tail
-
- # Connect to remote session
- pipecat tail --url wss://my-bot.example.com
- ```
diff --git a/docs.json b/docs.json
index 79af2cf1..e8d3f7a1 100644
--- a/docs.json
+++ b/docs.json
@@ -80,15 +80,6 @@
"pipecat/fundamentals/saving-transcripts",
"pipecat/fundamentals/recording-audio",
"pipecat/fundamentals/metrics",
- {
- "group": "Evaluations",
- "pages": [
- "pipecat/fundamentals/evaluations/overview",
- "pipecat/fundamentals/evaluations/bluejay",
- "pipecat/fundamentals/evaluations/cekura",
- "pipecat/fundamentals/evaluations/coval"
- ]
- },
"pipecat/fundamentals/voicemail",
"pipecat/fundamentals/ivr",
"pipecat/fundamentals/custom-frame-processor",
@@ -102,6 +93,25 @@
}
]
},
+ {
+ "group": "Evals",
+ "pages": [
+ "pipecat/evals/overview",
+ "pipecat/evals/quickstart",
+ "pipecat/evals/scenarios",
+ "pipecat/evals/suites",
+ "pipecat/evals/library",
+ "pipecat/evals/agent-self-improvement",
+ {
+ "group": "Third-party Platforms",
+ "pages": [
+ "pipecat/evals/platforms/bluejay",
+ "pipecat/evals/platforms/cekura",
+ "pipecat/evals/platforms/coval"
+ ]
+ }
+ ]
+ },
{
"group": "Features",
"pages": [
@@ -816,7 +826,7 @@
"group": "Commands",
"pages": [
"api-reference/cli/init",
- "api-reference/cli/tail",
+ "api-reference/cli/eval",
{
"group": "cloud",
"pages": [
@@ -1345,11 +1355,27 @@
},
{
"source": "/guides/fundamentals/evaluations/overview",
- "destination": "/pipecat/fundamentals/evaluations/overview"
+ "destination": "/pipecat/evals/overview"
},
{
"source": "/guides/fundamentals/evaluations/bluejay",
- "destination": "/pipecat/fundamentals/evaluations/bluejay"
+ "destination": "/pipecat/evals/platforms/bluejay"
+ },
+ {
+ "source": "/pipecat/fundamentals/evaluations/overview",
+ "destination": "/pipecat/evals/overview"
+ },
+ {
+ "source": "/pipecat/fundamentals/evaluations/bluejay",
+ "destination": "/pipecat/evals/platforms/bluejay"
+ },
+ {
+ "source": "/pipecat/fundamentals/evaluations/cekura",
+ "destination": "/pipecat/evals/platforms/cekura"
+ },
+ {
+ "source": "/pipecat/fundamentals/evaluations/coval",
+ "destination": "/pipecat/evals/platforms/coval"
},
{
"source": "/examples",
@@ -2309,7 +2335,7 @@
},
{
"source": "/cli/tail",
- "destination": "/api-reference/cli/tail"
+ "destination": "/api-reference/cli/overview"
},
{
"source": "/cli/cloud/agent",
diff --git a/pipecat/evals/agent-self-improvement.mdx b/pipecat/evals/agent-self-improvement.mdx
new file mode 100644
index 00000000..7869d14c
--- /dev/null
+++ b/pipecat/evals/agent-self-improvement.mdx
@@ -0,0 +1,104 @@
+---
+title: "Agent Self-Improvement"
+description: "Close the loop: let an AI coding assistant write agent code, run evals, and iterate until they pass."
+---
+
+Evals do more than catch regressions. They turn agent quality into a signal that an AI coding assistant can read, which changes how you build: instead of asking an assistant to "improve the prompt" and judging the result by hand, you describe the desired behavior as a scenario and let the assistant iterate until the eval passes.
+
+## The loop
+
+1. **Describe the behavior as a scenario.** A scenario file is an executable specification: the conversation, the expected events, and the criteria a response must meet.
+2. **The assistant changes the agent.** A prompt edit, a new tool, a pipeline change.
+3. **The assistant runs the evals.** One command, either against a running agent (`pipecat eval run`) or letting the suite spawn the agent itself (`pipecat eval suite`).
+4. **The assistant reads the result.** A non-zero exit code, a per-assertion failure message ("turn 1 expectation 0 (llm_response): judge said no: ..."), and a full decision trace in `.eval.log`.
+5. **Repeat until green.**
+
+Steps 2 through 5 need no human in the loop. You review the final diff with the evidence that it works attached.
+
+## Why this works well for coding assistants
+
+The framework was built to be driven by tools, not just humans:
+
+- **One command, one exit code.** `pipecat eval run scenarios/*.yaml` exits `0` on success and `1` on failure, so an assistant knows mechanically whether it's done.
+- **Plain-text output when piped.** Outside a terminal the CLI streams one result line per scenario instead of rendering a live dashboard, which is exactly what an assistant running shell commands sees.
+- **Actionable failures.** Failures name the turn, the expectation, and the reason, including what the judge said. The `.eval.log` decision trace shows every event the harness observed, so "why did this fail" is answerable from files.
+- **Suites are self-contained.** `pipecat eval suite` spawns the agents itself, so an autonomous loop doesn't need to manage processes: edit, run one command, read the result.
+- **Text mode is fast and cheap.** Iterating on prompts and logic skips STT and TTS entirely, so an assistant can afford to run the evals after every change.
+
+## Setting up your project
+
+Keep scenarios in the repo next to the agent and tell your assistant how to run them. For example, in your project's `CLAUDE.md` or `AGENTS.md`:
+
+```markdown
+## Behavioral evals
+
+Evals live in `scenarios/`. To verify any change to the agent's behavior:
+
+1. Start the agent: `uv run bot.py -t eval` (serves ws://localhost:7860)
+2. Run the evals: `pipecat eval run scenarios/*.yaml`
+
+The command exits non-zero on failure and prints each failed assertion.
+Each scenario writes a decision trace to `.eval.log`; read it
+to understand a failure before changing code.
+
+When you add or change agent behavior, add or update a scenario in
+`scenarios/` to cover it.
+```
+
+With that in place, a request like this becomes fully verifiable:
+
+> Add a `get_order_status` tool to the agent and make sure it gets called when the user asks where their order is. Add a scenario for it and run the evals until they pass.
+
+The assistant writes the tool, writes the scenario (a `function_call` assertion plus a judged response), runs `pipecat eval run`, reads any failure, and fixes its own work.
+
+## Evals as acceptance criteria
+
+You can also run the loop in the other direction: write the scenario first, watch it fail, and hand the failure to the assistant. The scenario is the spec, and "make this pass" is the task.
+
+```yaml order_status.yaml
+name: order_status
+
+turns:
+ - user: "Where's my order? The number is 12345."
+ expect:
+ - event: function_call
+ calls:
+ - name: get_order_status
+ args: { order_id: "12345" }
+ - event: response
+ eval: "tells the user the status of their order"
+```
+
+This is test-driven development for agent behavior, with the judge LLM absorbing the fuzziness that makes conversational output hard to assert on with string matching.
+
+## Guardrails
+
+A few practices keep autonomous loops honest:
+
+- **Review scenario changes like code.** An assistant that can edit scenarios can also weaken them. Failing evals should usually be fixed in the agent, not in the scenario.
+- **Keep a regression set.** As behaviors accumulate, so should scenarios. Run the full set (or a suite) before merging, not just the scenario being worked on.
+- **Gate merges in CI.** `pipecat eval suite manifest.yaml` in CI makes "the evals pass" a property of the branch, whoever (or whatever) wrote it. See [Eval Suites](/pipecat/evals/suites).
+- **Use audio mode for the final check.** Iterate in text mode for speed, then run the audio variants before release to cover the full STT, LLM, and TTS path.
+
+## Next steps
+
+
+
+ Give your coding assistant access to Pipecat docs and source context.
+
+
+
+ Layer in simulation platforms and observability once your agent is
+ deployed.
+
+
diff --git a/pipecat/evals/library.mdx b/pipecat/evals/library.mdx
new file mode 100644
index 00000000..f619a127
--- /dev/null
+++ b/pipecat/evals/library.mdx
@@ -0,0 +1,168 @@
+---
+title: "Using the Library"
+description: "Run, build, and orchestrate evals from Python with the pipecat.evals API."
+---
+
+Everything the `pipecat eval` CLI does is available as a library under `pipecat.evals`. Use it to run evals from your own test runner (pytest, a CI script, a custom dashboard), to build scenarios in code instead of YAML, or to customize pieces like the judge LLM.
+
+## Running a scenario
+
+`EvalScenario.load()` parses a scenario file, and `EvalSession.from_scenario()` builds a ready-to-run session, constructing the judge, user speech, and transcriber the scenario calls for:
+
+```python
+import asyncio
+
+from pipecat.evals.harness import EvalSession
+from pipecat.evals.scenario import EvalScenario
+
+
+async def main():
+ scenario = EvalScenario.load("scenarios/capital_question.yaml")
+ session = EvalSession.from_scenario(scenario, "ws://localhost:7860")
+ result = await session.run()
+
+ if result.passed:
+ print(f"PASS ({result.duration_ms}ms)")
+ else:
+ for failure in result.failures:
+ print(f" {failure}")
+
+
+asyncio.run(main())
+```
+
+The agent must already be running with its eval transport (`python bot.py -t eval`), just as with `pipecat eval run`.
+
+### The result
+
+`run()` returns an `EvalResult`:
+
+| Field | Description |
+| --------------- | ------------------------------------------------------------------------------------------- |
+| `scenario_name` | Name of the scenario that ran. |
+| `passed` | Whether every assertion passed. |
+| `failures` | The failed assertions, each with the turn index, expectation index, event name, and reason. |
+| `duration_ms` | Wall-clock time the run took. |
+| `events_seen` | Every semantic event observed, for diagnostics. |
+| `debug_log` | The harness's timestamped decision trace (what the CLI writes to `.eval.log`). |
+| `skipped` | Set (with a reason) when the scenario was not run; such a result is neither pass nor fail. |
+
+This maps cleanly onto a pytest test:
+
+```python
+import pytest
+
+from pipecat.evals.harness import EvalSession
+from pipecat.evals.scenario import EvalScenario
+
+
+@pytest.mark.asyncio
+async def test_capital_question():
+ scenario = EvalScenario.load("scenarios/capital_question.yaml")
+ result = await EvalSession.from_scenario(scenario, "ws://localhost:7860").run()
+ assert result.passed, "\n".join(str(f) for f in result.failures)
+```
+
+## Building scenarios in code
+
+Scenarios are plain dataclasses, so you can construct them programmatically, generating turns from a dataset, parameterizing a template, or skipping YAML entirely:
+
+```python
+from pipecat.evals.scenario import EvalExpectation, EvalScenario, EvalTurn
+
+scenario = EvalScenario(
+ name="capital_question",
+ turns=[
+ EvalTurn(
+ user="What is the capital of Germany?",
+ expect=[
+ EvalExpectation(
+ event="llm_response",
+ eval="the response says the capital of Germany is Berlin",
+ )
+ ],
+ )
+ ],
+)
+```
+
+
+ The modality-agnostic `response` event is resolved while parsing YAML. When
+ constructing scenarios in code, use `llm_response` for text mode directly (or
+ `response` only when you also configure audio judging).
+
+
+## Customizing the judge
+
+`from_scenario()` builds the judge from the scenario's `judge:` block, but you can inject your own. `EvalJudge` works with any Pipecat LLM service backed by an OpenAI-compatible API:
+
+```python
+import os
+
+from pipecat.evals.harness import EvalSession
+from pipecat.evals.judge import EvalJudge
+from pipecat.services.openai.llm import OpenAILLMService
+
+llm = OpenAILLMService(
+ api_key=os.environ["OPENAI_API_KEY"],
+ settings=OpenAILLMService.Settings(model="gpt-4o-mini"),
+)
+
+session = EvalSession.from_scenario(
+ scenario,
+ "ws://localhost:7860",
+ judge=EvalJudge(llm),
+)
+```
+
+The same injection points exist for the user's synthesized voice (`speech=`, wrapping any `TTSService` in an `EvalSpeech`) and the transcriber used for the agent's spoken audio (`transcriber=`, wrapping any `STTService` in an `EvalTranscriber`). The wrapped services can be local models or HTTP-based; WebSocket-streaming services are rejected, since they need a running pipeline to manage their connection lifecycle.
+
+## Observing progress
+
+Pass `on_progress` to get a callback as each turn and expectation resolves, which is how the CLI implements its `--verbose` output:
+
+```python
+from pipecat.evals.harness import EvalSession, EvalTurnProgress
+
+
+def show(p: EvalTurnProgress):
+ print(f"turn {p.turn_index} [{p.status}] {p.event_name} {p.detail}")
+
+
+session = EvalSession.from_scenario(scenario, url, on_progress=show)
+```
+
+## Orchestrating suites
+
+`EvalManifest` and `EvalSuite` are the library behind `pipecat eval suite`: the suite spawns each agent with its eval transport on its own port, runs its scenarios, and executes several runs concurrently:
+
+```python
+import asyncio
+from pathlib import Path
+
+from pipecat.evals.suite import EvalManifest, EvalSuite
+
+
+async def main():
+ manifest = EvalManifest.load("manifest.yaml")
+ suite = EvalSuite(manifest)
+
+ # Optionally narrow the runs, like the CLI's -p / -s flags.
+ suite.filter(pattern="support")
+
+ await suite.run(
+ Path("eval-runs/logs"),
+ on_update=lambda run: print(run.bot, run.scenario, run.status),
+ )
+
+ for run in suite.runs:
+ verdict = run.error or ("passed" if run.result and run.result.passed else "failed")
+ print(f"{run.bot} / {run.scenario}: {verdict}")
+
+
+asyncio.run(main())
+```
+
+Each run is mutated in place as it executes (`status`, `result`, `error`, `duration_ms`), so a live display can render directly from `suite.runs`.
+
+`EvalManifest.load()` accepts keyword overrides for every manifest value (`concurrency`, `base_port`, `spawn`, `scenarios_dir`, and so on), mirroring the CLI flags.
diff --git a/pipecat/evals/overview.mdx b/pipecat/evals/overview.mdx
new file mode 100644
index 00000000..0a6aaa8d
--- /dev/null
+++ b/pipecat/evals/overview.mdx
@@ -0,0 +1,164 @@
+---
+title: "Pipecat Evals"
+sidebarTitle: "Overview"
+description: "Behavioral testing for your agents: scripted conversations, semantic assertions, and an LLM judge."
+---
+
+Pipecat Evals is the framework's built-in system for testing agent behavior. You describe a conversation and the behavior you expect, and Pipecat runs it against your real agent (the same pipeline, the same services, the same code) and tells you whether the expectation still holds.
+
+```yaml capital_question.yaml
+name: capital_question
+
+turns:
+ - user: "What is the capital of Germany?"
+ expect:
+ - event: response
+ eval: "the response says the capital of Germany is Berlin"
+```
+
+```bash
+pipecat eval run capital_question.yaml
+```
+
+## Why evals matter
+
+Voice agents are probabilistic systems. The same agent can answer differently run to run, and a prompt tweak, a model upgrade, or a service swap can quietly break behavior that used to work: a function that no longer gets called, context that stops carrying across turns, an interruption that derails the conversation. Manual testing catches some of this, but it's slow, unrepeatable, and impractical to run on every change.
+
+Evals make agent behavior testable the way unit tests make code testable:
+
+- **Regression safety**: run your scenarios after every prompt, model, or pipeline change and catch breakage before users do.
+- **Fast iteration**: text-mode evals skip STT and TTS entirely, so a full conversation test runs in seconds with no audio service cost.
+- **Semantic assertions**: an LLM judge checks meaning ("the response says the capital is Berlin"), not exact strings, so tests don't break when wording changes.
+- **A feedback signal for AI coding assistants**: evals give a coding assistant a command it can run and a pass/fail result it can read, closing the loop between writing agent code and verifying it. See [Agent Self-Improvement](/pipecat/evals/agent-self-improvement).
+
+Pipecat itself relies on this framework: before every release, an eval suite drives 100+ example agents end to end.
+
+## How it works
+
+Pipecat Evals has two halves:
+
+1. **The eval transport.** Your agent runs unchanged with the eval transport. If your agent uses `create_transport()` and the development runner, this is already built in: start it with `-t eval` and it hosts a local WebSocket server speaking RTVI, instead of connecting to Daily, WebRTC, or telephony.
+
+2. **The eval harness.** The harness connects to that transport as an RTVI client, plays the scenario's user turns (as text, or as synthesized speech in audio mode), collects the events your agent emits, and asserts on them in order: transcriptions, LLM responses, spoken output, function calls, and timing.
+
+When a scenario asserts on meaning rather than exact text, a **judge LLM** evaluates the agent's response against a natural-language criterion. The judge runs locally with [Ollama](https://ollama.com) by default, or against OpenAI or any OpenAI-compatible endpoint.
+
+### Text and audio modes
+
+Every scenario runs in one of two modes:
+
+| Mode | User input | Agent output | Best for |
+| ------------------ | -------------------------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- |
+| **Text** (default) | Sent as text, bypassing the STT | LLM text; TTS is skipped automatically | Fast, cheap iteration on prompts, logic, and function calling |
+| **Audio** | Synthesized by a TTS the harness runs (local by default) | Real synthesized speech, transcribed by an STT the harness runs | True end-to-end coverage of the full STT, LLM, and TTS pipeline |
+
+Text mode exercises your agent's actual pipeline and context handling while skipping the audio services, so it costs nothing in TTS or STT usage and runs fast. Audio mode synthesizes the user's voice, streams it through your agent's real STT, and transcribes the agent's actual spoken audio for judging, catching issues that only surface with real speech (turn detection, homophones, barge-in).
+
+## What you can test
+
+- **Response content**: substring checks (`text_contains`) or semantic judging (`eval`) of the agent's replies.
+- **Multi-turn context**: verify the agent remembers earlier turns.
+- **Function calling**: assert that specific tools were called, with specific arguments.
+- **Interruptions**: barge in mid-response and verify the agent recovers (`send_after`).
+- **Latency**: per-event budgets with `within_ms`.
+- **Vision**: serve an image when the agent requests one and judge its description.
+
+## YAML or Python
+
+Scenarios are YAML files, so they're easy to write, review, and share. Everything is also available as a library: load and run scenarios programmatically, build them in code, inject a custom judge, or orchestrate whole suites from your own tooling. See [Using the Library](/pipecat/evals/library).
+
+## Requirements
+
+- **Pipecat CLI**: the `pipecat eval` commands ship with the CLI extra: `uv tool install "pipecat-ai[cli]"`. The same commands are available as `python -m pipecat.evals`.
+- **A judge LLM** (for `eval:` assertions): Ollama by default (`ollama pull gemma2:9b`), or point the scenario's `judge:` block at OpenAI or any OpenAI-compatible endpoint.
+- **Audio services** (audio mode only): the harness needs a TTS to synthesize the user's voice and an STT to transcribe the agent's speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with `uv add "pipecat-ai[kokoro,moonshine]"` or `uv add "pipecat-ai[kokoro,whisper]"`), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren't supported here, which keeps the harness simple.
+- **Your agent's own credentials**: the agent under test is your real agent, so it needs the same service API keys it normally would.
+
+## Production evaluation
+
+Pipecat Evals is built for development: fast, local, repeatable, and run on every change. Once your agent is deployed, third-party evaluation platforms complement it with testing and monitoring at production scale:
+
+- **Simulations**: scripted or AI-driven test calls over API, WebSocket, or telephony, exercising multi-turn flows, edge cases, and real phone-network conditions before they reach users.
+- **Observability**: continuous evaluation of live traffic, with automated quality scoring of calls and transcripts, and metrics tracked over time to catch quality drift.
+
+
+
+ AI-native simulation and evaluation platform for voice agents, trusted by
+ QA, Engineering, Operations, AI, and Executive teams.
+
+
+
+ Simulation, observability, and evaluation platform with native Pipecat Cloud
+ integration. Supports no-code API, WebSocket, and telephony testing.
+
+
+
+ Automated testing and monitoring platform with native Pipecat Integration
+ for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic
+ Variables and more!
+
+
+
+
+ Building an evaluation integration for Pipecat? We welcome contributions to
+ this page. Open a PR on the [docs
+ repository](https://github.com/pipecat-ai/docs).
+
+
+Pipecat's other building blocks feed into any evaluation workflow: [Metrics](/pipecat/fundamentals/metrics) for TTFB, processing time, and usage; [Saving Transcripts](/pipecat/fundamentals/saving-transcripts) for offline analysis; [OpenTelemetry](/api-reference/server/utilities/opentelemetry) for latency traces; and [Observers](/api-reference/server/utilities/observers/observer-pattern) for custom instrumentation.
+
+## Next steps
+
+
+
+ Run your first eval against an existing agent in a few minutes.
+
+
+
+ The full scenario format: turns, expectations, modalities, and the judge.
+
+
+
+ Spawn multiple agents and run many scenarios concurrently from a manifest.
+
+
+
+ Close the loop: let an AI coding assistant write, run, and fix against
+ evals.
+
+
diff --git a/pipecat/fundamentals/evaluations/bluejay.mdx b/pipecat/evals/platforms/bluejay.mdx
similarity index 97%
rename from pipecat/fundamentals/evaluations/bluejay.mdx
rename to pipecat/evals/platforms/bluejay.mdx
index 5bfdd1a6..1efbd879 100644
--- a/pipecat/fundamentals/evaluations/bluejay.mdx
+++ b/pipecat/evals/platforms/bluejay.mdx
@@ -140,12 +140,12 @@ OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
- Learn about evaluation strategies for Pipecat agents.
+ Pipecat's built-in behavioral testing framework.
+
+ If your agent uses `create_transport()`, it supports the eval transport with a one-line addition to its `transport_params`:
+
+ ```python
+ from pipecat.transports.websocket.server import WebsocketServerParams
+
+ transport_params = {
+ "eval": lambda: WebsocketServerParams(
+ audio_in_enabled=True,
+ audio_out_enabled=True,
+ ),
+ # ... your other transports (daily, webrtc, twilio, ...)
+ }
+ ```
+
+ Then start the agent with `-t eval`:
+
+ ```bash
+ uv run bot.py -t eval
+ ```
+
+ ```
+ ๐ Bot ready! (eval transport on ws://localhost:7860)
+ ```
+
+ Instead of connecting to Daily or WebRTC, the agent now hosts a local WebSocket server and waits for the eval harness to connect. Nothing else in the agent changes: same pipeline, same services, same event handlers.
+
+
+ The harness talks to your agent over RTVI. `PipelineWorker` adds an
+ `RTVIProcessor` and `RTVIObserver` automatically, so the standard agent
+ setup needs no extra wiring. All Pipecat example agents already include
+ the `"eval"` transport entry.
+
+
+
+
+
+ A scenario is a YAML file describing a scripted conversation and the behavior you expect. Save this as `scenarios/capital_question.yaml`:
+
+
+
+ ```yaml
+ name: capital_question
+
+ turns:
+ # The agent greets on connect; wait for the greeting before speaking.
+ - expect:
+ - event: response
+ eval: "the bot opens the conversation with a greeting or an offer to help"
+
+ - user: "What is the capital of Germany?"
+ expect:
+ - event: response
+ eval: "the response says the capital of Germany is Berlin"
+ ```
+
+
+ ```yaml
+ name: capital_question
+
+ judge:
+ eval:
+ service: openai
+ model: gpt-4o-mini
+
+ turns:
+ # The agent greets on connect; wait for the greeting before speaking.
+ - expect:
+ - event: response
+ eval: "the bot opens the conversation with a greeting or an offer to help"
+
+ - user: "What is the capital of Germany?"
+ expect:
+ - event: response
+ eval: "the response says the capital of Germany is Berlin"
+ ```
+
+
+
+ Each turn optionally sends a user utterance and lists the events expected in response. The `eval:` field is a natural-language criterion checked by the judge LLM, so the test passes whether the agent says "Berlin is the capital of Germany" or "That would be Berlin!".
+
+ This scenario runs in **text mode** (the default): the user turn is sent as text and the agent's TTS is skipped automatically, so the whole conversation costs nothing in audio services and finishes in seconds.
+
+
+ Ollama with `gemma2:9b` is the default judge, which is why the first tab
+ has no `judge:` block. To use a different judge LLM, add a `judge.eval:`
+ block as in the OpenAI tab.
+
+
+
+
+
+ With the agent still running, run the scenario from another terminal:
+
+ ```bash
+ pipecat eval run scenarios/capital_question.yaml
+ ```
+
+ The harness connects to `ws://localhost:7860` (override with `--bot-url`), drives the conversation, and reports the result. Pass `-v` to watch each turn resolve:
+
+ ```
+ turn 0 โ (observe)
+ โ llm_response โ "Hello! How can I help you today?"
+ turn 1 โ "What is the capital of Germany?"
+ โ llm_response โ "The capital of Germany is Berlin."
+
+ โ ws://localhost:7860 capital_question (3402ms)
+
+ 1/1 passed ยท 3.4s
+ ```
+
+ The command exits `0` when everything passes and `1` otherwise, so it slots directly into scripts and CI. Each scenario also writes a decision trace to `.eval.log`, which shows every event the harness saw and why each assertion passed or failed.
+
+
+
+
+ Change the criterion to something false, for example `"the response says the capital of Germany is Madrid"`, and run again:
+
+ ```
+ โ ws://localhost:7860 capital_question
+
+ Failed (1):
+ โ ws://localhost:7860 capital_question
+ โข turn 1 expectation 0 (llm_response): judge said no: the reply says the capital is Berlin, not Madrid
+
+ 0/1 passed, 1 failed ยท 4.1s
+ ```
+
+ A failing eval tells you which turn, which expectation, and why. That message (plus the `.eval.log` trace) is what you, or your AI coding assistant, iterate against.
+
+
+
+
+## Where to go next
+
+- Learn the full scenario format, including multi-turn conversations, function call assertions, interruptions, latency budgets, and text vs audio modes, in [Writing Scenarios](/pipecat/evals/scenarios).
+- Have many scenarios or agents? Let Pipecat spawn the agents for you with [Eval Suites](/pipecat/evals/suites).
+- Want your coding assistant to run these for you? See [Agent Self-Improvement](/pipecat/evals/agent-self-improvement).
diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx
new file mode 100644
index 00000000..58911980
--- /dev/null
+++ b/pipecat/evals/scenarios.mdx
@@ -0,0 +1,327 @@
+---
+title: "Writing Scenarios"
+description: "The scenario file format: turns, expectations, assertions, and text vs audio modes."
+---
+
+A scenario is a YAML file describing a scripted conversation and the events you expect your agent to emit. This page covers the full format. If you haven't run a scenario yet, start with the [quickstart](/pipecat/evals/quickstart).
+
+## Anatomy of a scenario
+
+```yaml
+name: multi_turn # required: the eval's name
+
+judge: # optional: judge modality and LLM (defaults shown below)
+ eval:
+ service: ollama
+ model: gemma2:9b
+
+turns: # required: the conversation, in order
+ - user: "My name is Alex, and I'm planning a trip to Italy."
+ expect:
+ - event: response
+ eval: "acknowledges the user's message (the name Alex and/or the trip to Italy)"
+
+ - user: "Remind me โ what's my name and where am I going?"
+ expect:
+ - event: response
+ eval: "recalls that the user's name is Alex and the destination is Italy"
+```
+
+Each turn optionally sends a user utterance (`user:`) and lists the events expected in response (`expect:`). Expected events must arrive in the order listed, but the agent may emit other events in between, so you don't have to enumerate everything it does.
+
+A turn without a `user:` field is observation-only: the harness just waits for the expected events. This is how you test agent-first behavior like an on-connect greeting:
+
+```yaml
+turns:
+ # No user input: just wait for the agent to speak first.
+ - expect:
+ - event: response
+ eval: "the bot opens the conversation with a greeting or an offer to help"
+```
+
+## Events
+
+Scenarios assert on a small set of semantic events, mapped from the RTVI messages the agent emits:
+
+| Event | Meaning |
+| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `response` | The agent's reply. In audio mode this is a transcription of the agent's actual synthesized speech; in text mode it resolves to `llm_response`. Prefer this for content checks. |
+| `llm_response` | The LLM's text output for the turn. Available in both modes. |
+| `tts_response` | The text the TTS reports speaking, one segment at a time. Audio mode only. |
+| `llm_started` | The LLM began generating a response. |
+| `function_call` | The LLM called a function. |
+| `user_transcription` | The agent's STT finalized a transcription of the user. Audio mode only. |
+| `user_started_speaking` | The agent's VAD detected the start of user speech. Audio mode only. |
+| `user_stopped_speaking` | The agent's VAD detected the end of user speech. Audio mode only. |
+
+
+ Use `response` for the agent's reply unless you have a reason not to. It's
+ modality-agnostic: the same scenario judges LLM text in text mode and the
+ transcription of real spoken audio in audio mode, so one file covers both.
+
+
+## Assertions
+
+Each entry in `expect:` names an event and, optionally, asserts on its content or timing.
+
+### Semantic judging with `eval`
+
+The `eval:` field is a natural-language criterion that the event's text must satisfy, decided by the judge LLM:
+
+```yaml
+- user: "What's 2 plus 2?"
+ expect:
+ - event: response
+ eval: "the response says the answer is four"
+```
+
+The judge sees the whole conversation so far, so it can resolve terse or context-dependent replies (like "That's four"). It also understands that audio-mode responses come from a speech-to-text pass and judges intended meaning rather than exact spelling, so "for" transcribed instead of "four" still passes.
+
+The judge handles interim replies gracefully: if the agent says "Let me check on that." before the real answer, the harness keeps accumulating response text and re-judges until the criterion is met or the time budget runs out.
+
+`eval:` only makes sense on the agent's text output (`response`, `llm_response`, `tts_response`).
+
+### Substring checks with `text_contains`
+
+For exact content, `text_contains` does a plain substring check, with no judge round-trip:
+
+```yaml
+- user: "What is the capital of France?"
+ expect:
+ - event: response
+ text_contains: "Paris"
+```
+
+### Latency budgets with `within_ms`
+
+`within_ms` bounds how long after the turn's user send the event may arrive. All of a turn's expectations share that one anchor:
+
+```yaml
+- user: "What is the capital of France?"
+ expect:
+ - event: llm_started
+ within_ms: 2000 # the LLM must start responding within 2s
+ - event: response
+ text_contains: "Paris"
+```
+
+When omitted, an expectation defaults to a generous 60 second budget (configurable with `--timeout`), so timing is only asserted when you ask for it.
+
+Because every deadline is measured from the send, time spent matching earlier expectations counts against later ones. In the example above, if `llm_started` arrives at 1.5 seconds, the `response` (with the default 60 second budget) has 58.5 seconds left, and a turn that stalls completely fails within a single budget rather than one per expectation.
+
+### Function calls
+
+A `function_call` expectation asserts that the turn invoked one or more tools. List the expected calls under `calls:`; they're matched by name in any order, and the expectation passes once all are found:
+
+```yaml
+- user: "What's the weather in San Francisco? And recommend a restaurant."
+ expect:
+ - event: function_call
+ calls:
+ - name: get_current_weather
+ args: { location: "San Francisco" }
+ - name: get_restaurant_recommendation
+ - event: response
+ eval: "describes the weather and recommends a restaurant"
+```
+
+`args` is a subset check: every listed key/value must be present in the call's arguments, and extra arguments are ignored. A single expected call can use the `name:`/`args:` shorthand directly on the expectation, and a bare `function_call` with neither just asserts that some call happened.
+
+## Interruptions
+
+`send_after:` schedules a turn's user send relative to a prior event, which is how you script barge-in tests:
+
+```yaml
+turns:
+ - user: "Tell me a long, detailed story about the history of Paris."
+ expect:
+ - event: llm_started
+
+ # Interrupt 2 seconds after the agent starts its long answer.
+ - user: "Actually, never mind that โ what's the capital of Japan?"
+ send_after:
+ event: llm_started
+ delay_ms: 2000
+ expect:
+ - event: response
+ eval: "the response says the capital of Japan is Tokyo, instead of continuing the Paris story"
+```
+
+## Vision turns
+
+A turn may register an image with `image:` (a path relative to the scenario file). When a vision agent requests a user image during the turn, the eval transport serves it:
+
+```yaml
+turns:
+ - user: "What do you see in this image?"
+ image: assets/cat.jpg
+ expect:
+ - event: response
+ eval: "the response describes a cat"
+```
+
+## Text and audio modes
+
+Two top-level blocks control a scenario's modalities, and each has its own `modality:` field:
+
+- `user:` sets how each turn's utterance is delivered to the agent: sent as text, bypassing its STT (`modality: text`), or synthesized into real speech (`modality: audio`).
+- `judge:` sets what the judge evaluates: the agent's LLM text, with its TTS skipped (`modality: text`), or a transcription of its actual spoken audio (`modality: audio`).
+
+When `modality:` isn't specified, or a block is omitted entirely, it defaults to `text`. The two sides are also independent: you can drive the agent with text while judging its real speech, or speak to it and judge the LLM text.
+
+A scenario with neither block runs entirely in text mode. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. The judge LLM is the only service the harness itself needs (Ollama with `gemma2:9b` by default).
+
+
+ The top-level `user:` block only configures delivery. Each turn's `user:`
+ field is the utterance itself, and is written the same way in both modes.
+
+
+### User input with `user:`
+
+**Text (the default).** Each turn's utterance is sent to the agent as text, bypassing its STT. This needs no configuration; it's equivalent to:
+
+```yaml
+user:
+ modality: text
+```
+
+**Audio.** Each turn's utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The `speech:` block (the TTS service and voice) is required:
+
+```yaml
+user:
+ modality: audio
+ speech:
+ service: kokoro # local TTS, no API key, no per-run cost
+ voice: af_heart
+ sample_rate: 16000
+```
+
+
+ The built-in speech services are `kokoro`, a local model and the recommended
+ default, and `cartesia` (HTTP) when you want a cloud voice.
+
+
+### Judging with `judge:`
+
+**Text (the default).** The agent's TTS is skipped automatically, including any on-connect greeting, and the judge evaluates the LLM's text output. Fast and silent; equivalent to:
+
+```yaml
+judge:
+ modality: text
+```
+
+**Audio.** The agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The `transcription:` block is required:
+
+```yaml
+judge:
+ modality: audio
+ transcription:
+ service: moonshine # STT for the agent's audio (or: whisper)
+ model: small-streaming
+ eval:
+ service: ollama # the judge LLM
+ model: gemma2:9b
+```
+
+
+ The built-in transcribers are `moonshine` and `whisper`, both local models.
+ When `transcription.service:` is omitted, it defaults to `moonshine`.
+
+
+In either modality, the `judge.eval:` block selects the judge LLM: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`.
+
+### Custom services with `factory:`
+
+To use a TTS or STT beyond the built-ins, both blocks accept a `factory:` escape hatch: a dotted path to a callable that receives the block's mapping and the resolved sample rate, and returns the service. Any extra keys you put in the block are passed through to your factory:
+
+```yaml
+user:
+ modality: audio
+ speech:
+ factory: "my_evals.services.make_tts"
+ voice: luna # available to your factory as speech_cfg["voice"]
+
+judge:
+ modality: audio
+ transcription:
+ factory: "my_evals.services.make_stt"
+```
+
+```python my_evals/services.py
+import os
+
+from pipecat.services.fal.stt import FalSTTService
+from pipecat.services.rime.tts import RimeHttpTTSService
+
+
+def make_tts(speech_cfg, sample_rate):
+ return RimeHttpTTSService(
+ api_key=os.environ["RIME_API_KEY"],
+ settings=RimeHttpTTSService.Settings(voice=speech_cfg["voice"]),
+ sample_rate=sample_rate,
+ )
+
+
+def make_stt(transcription_cfg, sample_rate):
+ return FalSTTService(api_key=os.environ["FAL_KEY"])
+```
+
+
+ The service your factory returns must be a local model or an HTTP-based
+ service. WebSocket-streaming services aren't supported: they need a running
+ pipeline to manage their connection lifecycle, and keeping them out keeps the
+ evals code simple.
+
+
+For a fully custom setup (your own caching, a pre-built service instance), construct `EvalSpeech` or `EvalTranscriber` directly and inject them through the [library](/pipecat/evals/library).
+
+### Sharing config across scenarios
+
+Any value can be pulled from another file with `!include`, resolved relative to the scenario file. This keeps per-scenario noise down when a whole directory of scenarios shares the same audio setup:
+
+```yaml
+name: capital_question
+
+user: !include user_audio.yaml
+judge: !include judge_audio.yaml
+
+turns:
+ - user: "What is the capital of Germany?"
+ expect:
+ - event: response
+ eval: "the response says the capital of Germany is Berlin"
+```
+
+## Seed the conversation context with `reset:`
+
+Before driving the turns, the harness resets the agent's LLM context. By default the context is cleared; provide `reset:` to seed it with messages instead, which lets a scenario start mid-conversation:
+
+```yaml
+reset:
+ - role: developer
+ content: "The user has already introduced themselves as Alex."
+ - role: assistant
+ content: "Nice to meet you, Alex! How can I help?"
+```
+
+## Next steps
+
+
+
+ Run many scenarios against many agents concurrently from a manifest.
+
+
+
+ Load, build, and run scenarios from Python instead of YAML.
+
+
diff --git a/pipecat/evals/suites.mdx b/pipecat/evals/suites.mdx
new file mode 100644
index 00000000..5dc6327b
--- /dev/null
+++ b/pipecat/evals/suites.mdx
@@ -0,0 +1,120 @@
+---
+title: "Eval Suites"
+description: "Spawn agents and run many scenarios concurrently from a single manifest."
+---
+
+`pipecat eval run` tests scenarios against an agent you started yourself. A **suite** goes one step further: you list agents and scenarios in a manifest, and `pipecat eval suite` spawns each agent with its eval transport on its own port, runs its scenarios, tears it down, and aggregates the results, several runs at a time.
+
+Suites are the right tool when you have more than one agent, more than a handful of scenarios, or want a single command for CI. Pipecat's own release evals are a manifest with 100+ example agents plus this command.
+
+## The manifest
+
+```yaml manifest.yaml
+concurrency: 4 # how many runs execute at once
+runs_dir: eval-runs # logs + recordings go to //
+record: false # record conversation audio (audio-mode scenarios)
+scenarios_dir: scenarios # scenario names resolve to /.yaml
+
+# How to start each agent. {python}, {bot}, and {port} are substituted per run.
+spawn: "{python} {bot} -t eval --port {port}"
+
+suite:
+ - bot: bots/support-agent.py
+ scenarios: [greeting, capital_question, multi_turn]
+ - bot: bots/sales-agent.py
+ scenarios: [greeting, weather_function_call]
+ - bot: bots/vision-agent.py
+ runner_body: scenarios/vision-body.json # optional --runner-body data
+ scenarios: [vision_describe]
+```
+
+Paths in the manifest (`bots_dir`, `scenarios_dir`, `runs_dir`, the `bot:` entries) resolve relative to the manifest file, so a manifest is portable: check it into your repo and run it from anywhere.
+
+Scenarios are reusable across agents. One `greeting` scenario can cover every agent in the suite.
+
+
+ An optional `runner_body:` points at a JSON file passed to the agent as
+ `--runner-body`. It supplies session data the agent would normally receive in
+ a `/start` request body (for example, a vision agent's image path).
+
+
+## Running a suite
+
+```bash
+pipecat eval suite manifest.yaml
+```
+
+In a terminal, a live dashboard shows each run's status, a running tally, and total time. When piped (in CI, or driven by a coding assistant), it streams one plain result line per run instead. The command exits `0` only if every run passes.
+
+Useful flags:
+
+```bash
+pipecat eval suite manifest.yaml -p support # only bots whose path contains "support"
+pipecat eval suite manifest.yaml -s greeting # only the greeting scenario
+pipecat eval suite manifest.yaml -c 8 # 8 runs at a time
+pipecat eval suite manifest.yaml -n nightly # output to eval-runs/nightly/
+pipecat eval suite manifest.yaml -a # record conversation audio
+pipecat eval suite manifest.yaml -d # save full per-pipeline debug logs
+```
+
+Everything except the `suite:` list can live in the manifest or be passed on the command line (the command line wins), so a manifest can be as minimal as a `suite:` list.
+
+## Run output
+
+Each invocation writes to `//` (a timestamp when `-n` is omitted):
+
+```
+eval-runs/20260610_142200/
+ logs/
+ bots_support-agent.py__greeting.log # the agent process output
+ bots_support-agent.py__greeting.eval.log # the harness's decision trace
+ bots_support-agent.py__greeting.debug.log # per-pipeline harness logs (-d only)
+ recordings/
+ bots_support-agent.py__greeting.wav # conversation audio (record: true or -a)
+```
+
+When a run fails, start with the `.eval.log` decision trace: it's a timestamped record of every event the harness saw, what it matched, what the judge said, and why an assertion failed. The agent's own log sits next to it.
+
+## Testing one agent with many scenarios
+
+If you just want to run a batch of scenarios against an agent you already have running, you don't need a manifest. `pipecat eval run` accepts multiple scenario files and shares the suite's dashboard and tally:
+
+```bash
+pipecat eval run scenarios/*.yaml --bot-url ws://localhost:7860
+```
+
+By default the agent is left running afterward so it can serve more evals; pass `--stop-bot` to shut it down when the batch finishes.
+
+## Suites in CI
+
+The exit code makes suites CI-ready with no extra glue:
+
+```yaml
+# e.g. GitHub Actions
+- name: Run behavioral evals
+ run: pipecat eval suite manifest.yaml
+```
+
+For deterministic, key-free CI runs, prefer text-mode scenarios and an OpenAI-compatible judge endpoint you control. Audio-mode scenarios work in CI too, but need the harness's TTS and STT services available (local models by default, which also need more CPU).
+
+## Next steps
+
+
+
+ Orchestrate suites programmatically with `EvalManifest` and `EvalSuite`.
+
+
+
+ Let an AI coding assistant run your suite and iterate until it's green.
+
+
diff --git a/pipecat/fundamentals/evaluations/overview.mdx b/pipecat/fundamentals/evaluations/overview.mdx
deleted file mode 100644
index 791c22c4..00000000
--- a/pipecat/fundamentals/evaluations/overview.mdx
+++ /dev/null
@@ -1,153 +0,0 @@
----
-title: "Evaluations"
-sidebarTitle: "Overview"
-description: "Test and improve your voice AI agents from local prompt iteration to production monitoring."
----
-
-## Overview
-
-Building a voice AI agent is only half the challenge. You also need to know it handles real conversations reliably. A good evaluation strategy progresses through two phases:
-
-1. **Local testing**: Iterate on your LLM prompts quickly without needing live audio services, reducing cost and tightening the feedback loop during development.
-2. **Production evaluation**: Automated simulations and observability for deployed agents, catching regressions and tracking quality over time with real user traffic.
-
-Starting locally and layering in production tooling as your agent matures gives you the fastest path to a reliable, well-tested agent.
-
-## Local prompt testing
-
-Before investing in full end-to-end simulations, focus on getting your LLM prompts right. Pipecat's architecture makes it straightforward to test your agent's conversational logic without running STT or TTS services, saving both time and cost during development.
-
-The most efficient way to iterate on prompts is to bypass audio entirely and send text directly to your LLM pipeline. This lets you validate conversational logic, function calling, and response quality in seconds rather than minutes.
-
-You can configure your pipeline to accept text input instead of audio by replacing STT with a transcript-based input:
-
-```python
-from pipecat.frames.frames import TranscriptionFrame
-
-# Send a simulated user utterance directly into the pipeline
-frame = TranscriptionFrame(
- text="I'd like to schedule an appointment for tomorrow at 3pm",
- user_id="test-user",
- timestamp=0,
-)
-```
-
-This approach lets you:
-
-- Test prompt variations rapidly without waiting for audio processing
-- Validate function calling behavior with specific user inputs
-- Build repeatable test cases for edge cases and failure modes
-- Run tests in CI without audio infrastructure
-
-## Production evaluation
-
-Once your prompts are solid and you've validated the local experience, production evaluation tools help you scale testing and monitor quality across real deployments. This is where evaluation platforms come in.
-
-### Simulations
-
-Automated test conversations exercise your agent's behavior across scenarios, edge cases, and failure modes before they reach users. Simulation platforms can connect to your agent via API, WebSocket, or telephony to run scripted or AI-driven test calls.
-
-Key things to test with simulations:
-
-- **Multi-turn flows**: Verify your agent handles complete conversation paths correctly
-- **Edge cases**: Test interruptions, unexpected input, silence, and barge-in
-- **Telephony behavior**: End-to-end testing over real phone networks catches issues that only surface in production call conditions
-- **Regressions**: Run simulation suites before each deployment to catch breaking changes
-
-### Observability
-
-Continuous evaluation of live calls lets you catch regressions, track quality over time, and close the loop between what you test and what users experience. Common approaches include:
-
-- Submitting call recordings and transcripts for automated quality scoring
-- Tracking evaluation metrics over time to detect quality drift
-- Using [OpenTelemetry traces](/api-reference/server/utilities/opentelemetry) to monitor latency and execution flow
-
-Together, simulations and observability form a feedback loop: simulations validate changes before deployment, and observability surfaces issues that inform your next round of tests.
-
-### Evaluation platforms
-
-Several platforms offer simulation testing and production monitoring for voice AI agents:
-
-
-
- AI-native simulation and evaluation platform for voice agents, trusted by QA, Engineering, Operations, AI, and Executive teams.
-
-
-
- Simulation, observability, and evaluation platform with native Pipecat Cloud integration. Supports no-code API, WebSocket, and telephony testing.
-
-
-
- Automated testing and monitoring platform with native Pipecat Integration for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic Variables and more!
-
-
-
-
- Building an evaluation integration for Pipecat? We welcome contributions to
- this page. Open a PR on the [docs
- repository](https://github.com/pipecat-ai/docs).
-
-
-## Pipecat's built-in tools
-
-Pipecat provides several building blocks that feed into any evaluation workflow:
-
-- **[Metrics](/pipecat/fundamentals/metrics)**: Built-in TTFB, processing time, and usage tracking for LLM and TTS services
-- **[Saving transcripts](/pipecat/fundamentals/saving-transcripts)**: Capture conversation transcripts for offline analysis and evaluation
-- **[OpenTelemetry](/api-reference/server/utilities/opentelemetry)**: Export traces to any OTel-compatible backend for latency and performance monitoring
-- **[Observers](/api-reference/server/utilities/observers/observer-pattern)**: Monitor frame flow without modifying the pipeline, useful for custom instrumentation
-
-## Next steps
-
-
-
- Monitor performance and LLM/TTS usage with Pipecat's built-in metrics.
-
-
-
- Capture conversation transcripts to use with evaluation tools.
-
-
-
- Export traces for performance monitoring and debugging.
-
-
-
- Build custom processors for evaluation-specific instrumentation.
-
-