From 5a781d27c020a78737800c60852dae60149e0a01 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Wed, 10 Jun 2026 19:53:02 -0700 Subject: [PATCH 01/18] Add Pipecat Evals framework documentation Document the new built-in evals framework with a top-level Evals group in the Pipecat tab: - Overview: why evals matter, eval transport + harness + judge architecture, text vs audio modes, requirements - Quickstart: run an existing agent with -t eval and a first scenario - Writing Scenarios: full YAML reference (turns, expectations, judge, audio mode, includes, interruptions, vision) - Eval Suites: manifests, pipecat eval suite, run output, CI - Using the Library: EvalScenario/EvalSession/EvalSuite Python API - Agent Self-Improvement: closing the loop with AI coding assistants Also add a CLI reference page for pipecat eval run/suite, and update the Fundamentals evaluations overview to feature the built-in framework. --- api-reference/cli/eval.mdx | 171 +++++++++++ docs.json | 12 + pipecat/evals/agent-self-improvement.mdx | 103 +++++++ pipecat/evals/library.mdx | 168 +++++++++++ pipecat/evals/overview.mdx | 116 ++++++++ pipecat/evals/quickstart.mdx | 149 ++++++++++ pipecat/evals/scenarios.mdx | 266 ++++++++++++++++++ pipecat/evals/suites.mdx | 120 ++++++++ pipecat/fundamentals/evaluations/overview.mdx | 58 ++-- 9 files changed, 1135 insertions(+), 28 deletions(-) create mode 100644 api-reference/cli/eval.mdx create mode 100644 pipecat/evals/agent-self-improvement.mdx create mode 100644 pipecat/evals/library.mdx create mode 100644 pipecat/evals/overview.mdx create mode 100644 pipecat/evals/quickstart.mdx create mode 100644 pipecat/evals/scenarios.mdx create mode 100644 pipecat/evals/suites.mdx diff --git a/api-reference/cli/eval.mdx b/api-reference/cli/eval.mdx new file mode 100644 index 00000000..0c261a5a --- /dev/null +++ b/api-reference/cli/eval.mdx @@ -0,0 +1,171 @@ +--- +title: eval +description: "Run behavioral evals against a Pipecat agent, individually or as a suite" +--- + +Run scenario-based behavioral evals. `pipecat eval run` tests scenarios against an already-running agent; `pipecat eval suite` spawns the agents listed in a manifest and runs their scenarios concurrently. Both exit `0` when everything passes and `1` otherwise. + +The same commands are also available as `python -m pipecat.evals`. + +See the [Pipecat Evals guide](/pipecat/evals/overview) for concepts, the scenario format, and manifests. + +## eval run + +Run one or more scenarios against an already-running agent (started with `-t eval`). + +**Usage:** + +```shell +pipecat eval run [OPTIONS] SCENARIOS... +``` + +**Arguments:** + + + One or more scenario YAML files. + + +**Options:** + + + WebSocket URL of the agent's eval transport. Overridden per-scenario by + `fixtures.bot_url`. + + + + Print a line for each turn and expectation as it resolves. + + + + Record each scenario's conversation audio (audio-mode scenarios). + + + + Directory for `--audio` recordings: `/.wav`. + + + + Directory for cached synthesized user audio. Defaults to + `/pipecat/tts`. + + + + Disable the user-audio cache: re-synthesize every turn (no reads or writes). + + + + Default per-expectation timeout in seconds, for expectations without their own + `within_ms`. + + + + Directory for each scenario's logs: `/.eval.log` (plus + `.debug.log` under `--debug`). + + + + Also save `.debug.log` with the harness's full per-pipeline logs. + + + + Cancel the agent's pipeline (exit it) after the run. By default the agent is + left running so it can serve more scenarios. + + +## eval suite + +Spawn the agents in a manifest and run their scenarios concurrently. Everything except the `suite:` list can be set in the manifest or overridden on the command line (the command line wins). + +**Usage:** + +```shell +pipecat eval suite [OPTIONS] MANIFEST_PATH +``` + +**Arguments:** + + + Manifest YAML listing agents and their scenarios. + + +**Options:** + + + Only run bots whose path contains this substring. + + + + Only run this scenario name. + + + + Run subdirectory name under `runs_dir`. Defaults to a timestamp. + + + + Output base, overriding the manifest's `runs_dir`. A `/` subdirectory + with `logs/` and `recordings/` is created under it. Defaults to `eval-runs`. + + + + Override the manifest's `bots_dir` (bot paths are relative to it). + + + + Override the manifest's `scenarios_dir`. + + + + Override the manifest's `concurrency` (how many runs execute at once). + + + + Override the manifest's `base_port` (default `7900`). Each run gets `base_port + + index`. + + + + Override the manifest's `cache_dir` for cached synthesized user audio. + + + + Disable the user-audio cache: re-synthesize every turn (no reads or writes). + + + + Default per-expectation timeout in seconds, for expectations without their own + `within_ms`. + + + + Override the manifest's spawn template. Default: `"{python} {bot} -t eval + --port {port}"`. + + + + Override the Python interpreter used to spawn each agent. + + + + Record conversation audio. + + + + Also save `.debug.log` with the harness's full per-pipeline logs. + + +## Examples + +```shell +# Run one scenario against a running agent +pipecat eval run scenarios/capital_question.yaml + +# Run a batch of scenarios, verbosely +pipecat eval run scenarios/*.yaml -v + +# Run a full suite +pipecat eval suite manifest.yaml + +# Only the support agent, 8 runs at a time, named output dir +pipecat eval suite manifest.yaml -p support -c 8 -n nightly +``` diff --git a/docs.json b/docs.json index 79af2cf1..d4eab342 100644 --- a/docs.json +++ b/docs.json @@ -102,6 +102,17 @@ } ] }, + { + "group": "Evals", + "pages": [ + "pipecat/evals/overview", + "pipecat/evals/quickstart", + "pipecat/evals/scenarios", + "pipecat/evals/suites", + "pipecat/evals/library", + "pipecat/evals/agent-self-improvement" + ] + }, { "group": "Features", "pages": [ @@ -815,6 +826,7 @@ { "group": "Commands", "pages": [ + "api-reference/cli/eval", "api-reference/cli/init", "api-reference/cli/tail", { diff --git a/pipecat/evals/agent-self-improvement.mdx b/pipecat/evals/agent-self-improvement.mdx new file mode 100644 index 00000000..1932373f --- /dev/null +++ b/pipecat/evals/agent-self-improvement.mdx @@ -0,0 +1,103 @@ +--- +title: "Agent Self-Improvement" +description: "Close the loop: let an AI coding assistant write agent code, run evals, and iterate until they pass." +--- + +Evals do more than catch regressions. They turn agent quality into a signal that an AI coding assistant can read, which changes how you build: instead of asking an assistant to "improve the prompt" and judging the result by hand, you describe the desired behavior as a scenario and let the assistant iterate until the eval passes. + +## The loop + +1. **Describe the behavior as a scenario.** A scenario file is an executable specification: the conversation, the expected events, and the criteria a response must meet. +2. **The assistant changes the agent.** A prompt edit, a new tool, a pipeline change. +3. **The assistant runs the evals.** One command, either against a running agent (`pipecat eval run`) or letting the suite spawn the agent itself (`pipecat eval suite`). +4. **The assistant reads the result.** A non-zero exit code, a per-assertion failure message ("turn 1 expectation 0 (llm_response): judge said no: ..."), and a full decision trace in `.eval.log`. +5. **Repeat until green.** + +Steps 2 through 5 need no human in the loop. You review the final diff with the evidence that it works attached. + +## Why this works well for coding assistants + +The framework was built to be driven by tools, not just humans: + +- **One command, one exit code.** `pipecat eval run scenarios/*.yaml` exits `0` on success and `1` on failure, so an assistant knows mechanically whether it's done. +- **Plain-text output when piped.** Outside a terminal the CLI streams one result line per scenario instead of rendering a live dashboard, which is exactly what an assistant running shell commands sees. +- **Actionable failures.** Failures name the turn, the expectation, and the reason, including what the judge said. The `.eval.log` decision trace shows every event the harness observed, so "why did this fail" is answerable from files. +- **Suites are self-contained.** `pipecat eval suite` spawns the agents itself, so an autonomous loop doesn't need to manage processes: edit, run one command, read the result. +- **Text mode is fast and cheap.** Iterating on prompts and logic skips STT and TTS entirely, so an assistant can afford to run the evals after every change. + +## Setting up your project + +Keep scenarios in the repo next to the agent and tell your assistant how to run them. For example, in your project's `CLAUDE.md` or `AGENTS.md`: + +```markdown +## Behavioral evals + +Evals live in `scenarios/`. To verify any change to the agent's behavior: + +1. Start the agent: `python bot.py -t eval` (serves ws://localhost:7860) +2. Run the evals: `pipecat eval run scenarios/*.yaml` + +The command exits non-zero on failure and prints each failed assertion. +Each scenario writes a decision trace to `.eval.log`; read it +to understand a failure before changing code. + +When you add or change agent behavior, add or update a scenario in +`scenarios/` to cover it. +``` + +With that in place, a request like this becomes fully verifiable: + +> Add a `get_order_status` tool to the agent and make sure it gets called when the user asks where their order is. Add a scenario for it and run the evals until they pass. + +The assistant writes the tool, writes the scenario (a `function_call` assertion plus a judged response), runs `pipecat eval run`, reads any failure, and fixes its own work. + +## Evals as acceptance criteria + +You can also run the loop in the other direction: write the scenario first, watch it fail, and hand the failure to the assistant. The scenario is the spec, and "make this pass" is the task. + +```yaml order_status.yaml +name: order_status + +turns: + - user: "Where's my order? The number is 12345." + expect: + - event: function_call + calls: + - name: get_order_status + args: { order_id: "12345" } + - event: response + eval: "tells the user the status of their order" +``` + +This is test-driven development for agent behavior, with the judge LLM absorbing the fuzziness that makes conversational output hard to assert on with string matching. + +## Guardrails + +A few practices keep autonomous loops honest: + +- **Review scenario changes like code.** An assistant that can edit scenarios can also weaken them. Failing evals should usually be fixed in the agent, not in the scenario. +- **Keep a regression set.** As behaviors accumulate, so should scenarios. Run the full set (or a suite) before merging, not just the scenario being worked on. +- **Gate merges in CI.** `pipecat eval suite manifest.yaml` in CI makes "the evals pass" a property of the branch, whoever (or whatever) wrote it. See [Eval Suites](/pipecat/evals/suites). +- **Use audio mode for the final check.** Iterate in text mode for speed, then run the audio variants before release to cover the full STT, LLM, and TTS path. + +## Next steps + + + + Give your coding assistant access to Pipecat docs and source context. + + + + The full scenario format your assistant will be writing. + + diff --git a/pipecat/evals/library.mdx b/pipecat/evals/library.mdx new file mode 100644 index 00000000..f619a127 --- /dev/null +++ b/pipecat/evals/library.mdx @@ -0,0 +1,168 @@ +--- +title: "Using the Library" +description: "Run, build, and orchestrate evals from Python with the pipecat.evals API." +--- + +Everything the `pipecat eval` CLI does is available as a library under `pipecat.evals`. Use it to run evals from your own test runner (pytest, a CI script, a custom dashboard), to build scenarios in code instead of YAML, or to customize pieces like the judge LLM. + +## Running a scenario + +`EvalScenario.load()` parses a scenario file, and `EvalSession.from_scenario()` builds a ready-to-run session, constructing the judge, user speech, and transcriber the scenario calls for: + +```python +import asyncio + +from pipecat.evals.harness import EvalSession +from pipecat.evals.scenario import EvalScenario + + +async def main(): + scenario = EvalScenario.load("scenarios/capital_question.yaml") + session = EvalSession.from_scenario(scenario, "ws://localhost:7860") + result = await session.run() + + if result.passed: + print(f"PASS ({result.duration_ms}ms)") + else: + for failure in result.failures: + print(f" {failure}") + + +asyncio.run(main()) +``` + +The agent must already be running with its eval transport (`python bot.py -t eval`), just as with `pipecat eval run`. + +### The result + +`run()` returns an `EvalResult`: + +| Field | Description | +| --------------- | ------------------------------------------------------------------------------------------- | +| `scenario_name` | Name of the scenario that ran. | +| `passed` | Whether every assertion passed. | +| `failures` | The failed assertions, each with the turn index, expectation index, event name, and reason. | +| `duration_ms` | Wall-clock time the run took. | +| `events_seen` | Every semantic event observed, for diagnostics. | +| `debug_log` | The harness's timestamped decision trace (what the CLI writes to `.eval.log`). | +| `skipped` | Set (with a reason) when the scenario was not run; such a result is neither pass nor fail. | + +This maps cleanly onto a pytest test: + +```python +import pytest + +from pipecat.evals.harness import EvalSession +from pipecat.evals.scenario import EvalScenario + + +@pytest.mark.asyncio +async def test_capital_question(): + scenario = EvalScenario.load("scenarios/capital_question.yaml") + result = await EvalSession.from_scenario(scenario, "ws://localhost:7860").run() + assert result.passed, "\n".join(str(f) for f in result.failures) +``` + +## Building scenarios in code + +Scenarios are plain dataclasses, so you can construct them programmatically, generating turns from a dataset, parameterizing a template, or skipping YAML entirely: + +```python +from pipecat.evals.scenario import EvalExpectation, EvalScenario, EvalTurn + +scenario = EvalScenario( + name="capital_question", + turns=[ + EvalTurn( + user="What is the capital of Germany?", + expect=[ + EvalExpectation( + event="llm_response", + eval="the response says the capital of Germany is Berlin", + ) + ], + ) + ], +) +``` + + + The modality-agnostic `response` event is resolved while parsing YAML. When + constructing scenarios in code, use `llm_response` for text mode directly (or + `response` only when you also configure audio judging). + + +## Customizing the judge + +`from_scenario()` builds the judge from the scenario's `judge:` block, but you can inject your own. `EvalJudge` works with any Pipecat LLM service backed by an OpenAI-compatible API: + +```python +import os + +from pipecat.evals.harness import EvalSession +from pipecat.evals.judge import EvalJudge +from pipecat.services.openai.llm import OpenAILLMService + +llm = OpenAILLMService( + api_key=os.environ["OPENAI_API_KEY"], + settings=OpenAILLMService.Settings(model="gpt-4o-mini"), +) + +session = EvalSession.from_scenario( + scenario, + "ws://localhost:7860", + judge=EvalJudge(llm), +) +``` + +The same injection points exist for the user's synthesized voice (`speech=`, wrapping any `TTSService` in an `EvalSpeech`) and the transcriber used for the agent's spoken audio (`transcriber=`, wrapping any `STTService` in an `EvalTranscriber`). The wrapped services can be local models or HTTP-based; WebSocket-streaming services are rejected, since they need a running pipeline to manage their connection lifecycle. + +## Observing progress + +Pass `on_progress` to get a callback as each turn and expectation resolves, which is how the CLI implements its `--verbose` output: + +```python +from pipecat.evals.harness import EvalSession, EvalTurnProgress + + +def show(p: EvalTurnProgress): + print(f"turn {p.turn_index} [{p.status}] {p.event_name} {p.detail}") + + +session = EvalSession.from_scenario(scenario, url, on_progress=show) +``` + +## Orchestrating suites + +`EvalManifest` and `EvalSuite` are the library behind `pipecat eval suite`: the suite spawns each agent with its eval transport on its own port, runs its scenarios, and executes several runs concurrently: + +```python +import asyncio +from pathlib import Path + +from pipecat.evals.suite import EvalManifest, EvalSuite + + +async def main(): + manifest = EvalManifest.load("manifest.yaml") + suite = EvalSuite(manifest) + + # Optionally narrow the runs, like the CLI's -p / -s flags. + suite.filter(pattern="support") + + await suite.run( + Path("eval-runs/logs"), + on_update=lambda run: print(run.bot, run.scenario, run.status), + ) + + for run in suite.runs: + verdict = run.error or ("passed" if run.result and run.result.passed else "failed") + print(f"{run.bot} / {run.scenario}: {verdict}") + + +asyncio.run(main()) +``` + +Each run is mutated in place as it executes (`status`, `result`, `error`, `duration_ms`), so a live display can render directly from `suite.runs`. + +`EvalManifest.load()` accepts keyword overrides for every manifest value (`concurrency`, `base_port`, `spawn`, `scenarios_dir`, and so on), mirroring the CLI flags. diff --git a/pipecat/evals/overview.mdx b/pipecat/evals/overview.mdx new file mode 100644 index 00000000..ec8d774e --- /dev/null +++ b/pipecat/evals/overview.mdx @@ -0,0 +1,116 @@ +--- +title: "Pipecat Evals" +sidebarTitle: "Overview" +description: "Behavioral testing for your agents: scripted conversations, semantic assertions, and an LLM judge." +--- + +Pipecat Evals is the framework's built-in system for testing agent behavior. You describe a conversation and the behavior you expect, and Pipecat runs it against your real agent (the same pipeline, the same services, the same code) and tells you whether the expectation still holds. + +```yaml capital_question.yaml +name: capital_question + +turns: + - user: "What is the capital of Germany?" + expect: + - event: response + eval: "the response says the capital of Germany is Berlin" +``` + +```bash +pipecat eval run capital_question.yaml +``` + +## Why evals matter + +Voice agents are probabilistic systems. The same agent can answer differently run to run, and a prompt tweak, a model upgrade, or a service swap can quietly break behavior that used to work: a function that no longer gets called, context that stops carrying across turns, an interruption that derails the conversation. Manual testing catches some of this, but it's slow, unrepeatable, and impractical to run on every change. + +Evals make agent behavior testable the way unit tests make code testable: + +- **Regression safety**: run your scenarios after every prompt, model, or pipeline change and catch breakage before users do. +- **Fast iteration**: text-mode evals skip STT and TTS entirely, so a full conversation test runs in seconds with no audio service cost. +- **Semantic assertions**: an LLM judge checks meaning ("the response says the capital is Berlin"), not exact strings, so tests don't break when wording changes. +- **A feedback signal for AI coding assistants**: evals give a coding assistant a command it can run and a pass/fail result it can read, closing the loop between writing agent code and verifying it. See [Agent Self-Improvement](/pipecat/evals/agent-self-improvement). + +Pipecat itself relies on this framework: before every release, an eval suite drives 100+ example agents end to end. + +## How it works + +Pipecat Evals has two halves: + +1. **The eval transport.** Your agent runs unchanged with the eval transport. If your agent uses `create_transport()` and the development runner, this is already built in: start it with `-t eval` and it hosts a local WebSocket server speaking RTVI, instead of connecting to Daily, WebRTC, or telephony. + +2. **The eval harness.** The harness connects to that transport as an RTVI client, plays the scenario's user turns (as text, or as synthesized speech in audio mode), collects the events your agent emits, and asserts on them in order: transcriptions, LLM responses, spoken output, function calls, and timing. + +When a scenario asserts on meaning rather than exact text, a **judge LLM** evaluates the agent's response against a natural-language criterion. The judge runs locally with [Ollama](https://ollama.com) by default, or against OpenAI or any OpenAI-compatible endpoint. + +### Text and audio modes + +Every scenario runs in one of two modes: + +| Mode | User input | Agent output | Best for | +| ------------------ | -------------------------------------------------------- | --------------------------------------------------------------- | --------------------------------------------------------------- | +| **Text** (default) | Sent as text, bypassing the STT | LLM text; TTS is skipped automatically | Fast, cheap iteration on prompts, logic, and function calling | +| **Audio** | Synthesized by a TTS the harness runs (local by default) | Real synthesized speech, transcribed by an STT the harness runs | True end-to-end coverage of the full STT, LLM, and TTS pipeline | + +Text mode exercises your agent's actual pipeline and context handling while skipping the audio services, so it costs nothing in TTS or STT usage and runs fast. Audio mode synthesizes the user's voice, streams it through your agent's real STT, and transcribes the agent's actual spoken audio for judging, catching issues that only surface with real speech (turn detection, homophones, barge-in). + +## What you can test + +- **Response content**: substring checks (`text_contains`) or semantic judging (`eval`) of the agent's replies. +- **Multi-turn context**: verify the agent remembers earlier turns. +- **Function calling**: assert that specific tools were called, with specific arguments. +- **Interruptions**: barge in mid-response and verify the agent recovers (`send_after`). +- **Latency**: per-event budgets with `within_ms`. +- **Vision**: serve an image when the agent requests one and judge its description. + +## YAML or Python + +Scenarios are YAML files, so they're easy to write, review, and share. Everything is also available as a library: load and run scenarios programmatically, build them in code, inject a custom judge, or orchestrate whole suites from your own tooling. See [Using the Library](/pipecat/evals/library). + +## Requirements + +- **Pipecat CLI**: the `pipecat eval` commands ship with the CLI extra: `pip install "pipecat-ai[cli]"`. The same commands are available as `python -m pipecat.evals`. +- **A judge LLM** (for `eval:` assertions): Ollama by default (`ollama pull gemma2:9b`), or point the scenario's `judge:` block at OpenAI or any OpenAI-compatible endpoint. +- **Audio services** (audio mode only): the harness needs a TTS to synthesize the user's voice and an STT to transcribe the agent's speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with `pip install "pipecat-ai[kokoro,moonshine]"` or `pip install "pipecat-ai[kokoro,whisper]"`), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren't supported here, which keeps the harness simple. +- **Your agent's own credentials**: the agent under test is your real agent, so it needs the same service API keys it normally would. + +## Next steps + + + + Run your first eval against an existing agent in a few minutes. + + + + The full scenario format: turns, expectations, modalities, and the judge. + + + + Spawn multiple agents and run many scenarios concurrently from a manifest. + + + + Close the loop: let an AI coding assistant write, run, and fix against + evals. + + diff --git a/pipecat/evals/quickstart.mdx b/pipecat/evals/quickstart.mdx new file mode 100644 index 00000000..fcafb325 --- /dev/null +++ b/pipecat/evals/quickstart.mdx @@ -0,0 +1,149 @@ +--- +title: "Evals Quickstart" +sidebarTitle: "Quickstart" +description: "Run your first behavioral eval against an existing agent." +--- + +This guide takes an existing agent, starts it with the eval transport, and runs a two-turn scenario against it. Total time: a few minutes. + +## Prerequisites + +- A working Pipecat agent that uses `create_transport()` and the development runner (the standard pattern from the [quickstart](/pipecat/get-started/quickstart) and all Pipecat examples), with its usual service API keys in `.env`. +- The Pipecat CLI: `pip install "pipecat-ai[cli]"`. +- A judge LLM. Either: + - **Ollama** (local, the default): install [Ollama](https://ollama.com) and run `ollama pull gemma2:9b`, or + - **OpenAI**: set `OPENAI_API_KEY` and point the scenario's `judge:` block at it (shown below). + + + + If your agent uses `create_transport()`, it supports the eval transport with a one-line addition to its `transport_params`: + + ```python + from pipecat.transports.websocket.server import WebsocketServerParams + + transport_params = { + "eval": lambda: WebsocketServerParams( + audio_in_enabled=True, + audio_out_enabled=True, + ), + # ... your other transports (daily, webrtc, twilio, ...) + } + ``` + + Then start the agent with `-t eval`: + + ```bash + python bot.py -t eval + ``` + + ``` + ๐Ÿš€ Bot ready! (eval transport on ws://localhost:7860) + ``` + + Instead of connecting to Daily or WebRTC, the agent now hosts a local WebSocket server and waits for the eval harness to connect. Nothing else in the agent changes: same pipeline, same services, same event handlers. + + + The harness talks to your agent over RTVI. `PipelineWorker` adds an + `RTVIProcessor` and `RTVIObserver` automatically, so the standard agent + setup needs no extra wiring. All Pipecat example agents already include + the `"eval"` transport entry. + + + + + + A scenario is a YAML file describing a scripted conversation and the behavior you expect. Save this as `scenarios/capital_question.yaml`: + + + + ```yaml + name: capital_question + + turns: + # The agent greets on connect; wait for the greeting before speaking. + - expect: + - event: response + eval: "the bot opens the conversation with a greeting or an offer to help" + + - user: "What is the capital of Germany?" + expect: + - event: response + eval: "the response says the capital of Germany is Berlin" + ``` + + + ```yaml + name: capital_question + + judge: + eval: + service: openai + model: gpt-4o-mini + + turns: + # The agent greets on connect; wait for the greeting before speaking. + - expect: + - event: response + eval: "the bot opens the conversation with a greeting or an offer to help" + + - user: "What is the capital of Germany?" + expect: + - event: response + eval: "the response says the capital of Germany is Berlin" + ``` + + + + Each turn optionally sends a user utterance and lists the events expected in response. The `eval:` field is a natural-language criterion checked by the judge LLM, so the test passes whether the agent says "Berlin is the capital of Germany" or "That would be Berlin!". + + This scenario runs in **text mode** (the default): the user turn is sent as text and the agent's TTS is skipped automatically, so the whole conversation costs nothing in audio services and finishes in seconds. + + + + + With the agent still running, run the scenario from another terminal: + + ```bash + pipecat eval run scenarios/capital_question.yaml + ``` + + The harness connects to `ws://localhost:7860` (override with `--bot-url`), drives the conversation, and reports the result. Pass `-v` to watch each turn resolve: + + ``` + turn 0 โ†’ (observe) + โœ“ llm_response โ€” "Hello! How can I help you today?" + turn 1 โ†’ "What is the capital of Germany?" + โœ“ llm_response โ€” "The capital of Germany is Berlin." + + โœ“ ws://localhost:7860 capital_question (3402ms) + + 1/1 passed ยท 3.4s + ``` + + The command exits `0` when everything passes and `1` otherwise, so it slots directly into scripts and CI. Each scenario also writes a decision trace to `.eval.log`, which shows every event the harness saw and why each assertion passed or failed. + + + + + Change the criterion to something false, for example `"the response says the capital of Germany is Madrid"`, and run again: + + ``` + โœ— ws://localhost:7860 capital_question + + Failed (1): + โœ— ws://localhost:7860 capital_question + โ€ข turn 1 expectation 0 (llm_response): judge said no: the reply says the capital is Berlin, not Madrid + + 0/1 passed, 1 failed ยท 4.1s + ``` + + A failing eval tells you which turn, which expectation, and why. That message (plus the `.eval.log` trace) is what you, or your AI coding assistant, iterate against. + + + + +## Where to go next + +- Learn the full scenario format, including multi-turn conversations, function call assertions, interruptions, latency budgets, and audio mode, in [Writing Scenarios](/pipecat/evals/scenarios). +- Have many scenarios or agents? Let Pipecat spawn the agents for you with [Eval Suites](/pipecat/evals/suites). +- Want your coding assistant to run these for you? See [Agent Self-Improvement](/pipecat/evals/agent-self-improvement). diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx new file mode 100644 index 00000000..ae839fbf --- /dev/null +++ b/pipecat/evals/scenarios.mdx @@ -0,0 +1,266 @@ +--- +title: "Writing Scenarios" +description: "The scenario file format: turns, expectations, assertions, and audio mode." +--- + +A scenario is a YAML file describing a scripted conversation and the events you expect your agent to emit. This page covers the full format. If you haven't run a scenario yet, start with the [quickstart](/pipecat/evals/quickstart). + +## Anatomy of a scenario + +```yaml +name: multi_turn # required: the eval's name + +judge: # optional: judge modality and LLM (defaults shown below) + eval: + service: ollama + model: gemma2:9b + +turns: # required: the conversation, in order + - user: "My name is Alex, and I'm planning a trip to Italy." + expect: + - event: response + eval: "acknowledges the user's message (the name Alex and/or the trip to Italy)" + + - user: "Remind me โ€” what's my name and where am I going?" + expect: + - event: response + eval: "recalls that the user's name is Alex and the destination is Italy" +``` + +Each turn optionally sends a user utterance (`user:`) and lists the events expected in response (`expect:`). Expected events must arrive in the order listed, but the agent may emit other events in between, so you don't have to enumerate everything it does. + +A turn without a `user:` field is observation-only: the harness just waits for the expected events. This is how you test agent-first behavior like an on-connect greeting: + +```yaml +turns: + # No user input: just wait for the agent to speak first. + - expect: + - event: response + eval: "the bot opens the conversation with a greeting or an offer to help" +``` + +## Events + +Scenarios assert on a small set of semantic events, mapped from the RTVI messages the agent emits: + +| Event | Meaning | +| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `response` | The agent's reply. In audio mode this is a transcription of the agent's actual synthesized speech; in text mode it resolves to `llm_response`. Prefer this for content checks. | +| `llm_response` | The LLM's text output for the turn. Available in both modes. | +| `tts_response` | The text the TTS reports speaking, one segment at a time. Audio mode only. | +| `llm_started` | The LLM began generating a response. | +| `function_call` | The LLM called a function. | +| `user_transcription` | The agent's STT finalized a transcription of the user. Audio mode only. | +| `user_started_speaking` | The agent's VAD detected the start of user speech. Audio mode only. | +| `user_stopped_speaking` | The agent's VAD detected the end of user speech. Audio mode only. | + + + Use `response` for the agent's reply unless you have a reason not to. It's + modality-agnostic: the same scenario judges LLM text in text mode and the + transcription of real spoken audio in audio mode, so one file covers both. + + +## Assertions + +Each entry in `expect:` names an event and, optionally, asserts on its content or timing. + +### Semantic judging with `eval` + +The `eval:` field is a natural-language criterion that the event's text must satisfy, decided by the judge LLM: + +```yaml +- user: "What's 2 plus 2?" + expect: + - event: response + eval: "the response says the answer is four" +``` + +The judge sees the whole conversation so far, so it can resolve terse or context-dependent replies (like "That's four"). It also understands that audio-mode responses come from a speech-to-text pass and judges intended meaning rather than exact spelling, so "for" transcribed instead of "four" still passes. + +The judge handles interim replies gracefully: if the agent says "Let me check on that." before the real answer, the harness keeps accumulating response text and re-judges until the criterion is met or the time budget runs out. + +`eval:` only makes sense on the agent's text output (`response`, `llm_response`, `tts_response`). + +### Substring checks with `text_contains` + +For exact content, `text_contains` does a plain substring check, with no judge round-trip: + +```yaml +- user: "What is the capital of France?" + expect: + - event: response + text_contains: "Paris" +``` + +### Latency budgets with `within_ms` + +`within_ms` bounds how long after the turn's user send the event may arrive. All of a turn's expectations share that one anchor: + +```yaml +- user: "What is the capital of France?" + expect: + - event: llm_started + within_ms: 2000 # the LLM must start responding within 2s + - event: response + text_contains: "Paris" +``` + +When omitted, an expectation defaults to a generous 60 second budget (configurable with `--timeout`), so timing is only asserted when you ask for it. + +### Function calls + +A `function_call` expectation asserts that the turn invoked one or more tools. List the expected calls under `calls:`; they're matched by name in any order, and the expectation passes once all are found: + +```yaml +- user: "What's the weather in San Francisco? And recommend a restaurant." + expect: + - event: function_call + calls: + - name: get_current_weather + args: { location: "San Francisco" } + - name: get_restaurant_recommendation + - event: response + eval: "describes the weather and recommends a restaurant" +``` + +`args` is a subset check: every listed key/value must be present in the call's arguments, and extra arguments are ignored. A single expected call can use the `name:`/`args:` shorthand directly on the expectation, and a bare `function_call` with neither just asserts that some call happened. + +## Interruptions + +`send_after:` schedules a turn's user send relative to a prior event, which is how you script barge-in tests: + +```yaml +turns: + - user: "Tell me a long, detailed story about the history of Paris." + expect: + - event: llm_started + + # Interrupt 2 seconds after the agent starts its long answer. + - user: "Actually, never mind that โ€” what's the capital of Japan?" + send_after: + event: llm_started + delay_ms: 2000 + expect: + - event: response + eval: "the response says the capital of Japan is Tokyo, instead of continuing the Paris story" +``` + +## Audio mode + +By default everything runs in text mode: user turns are sent as text and the agent's TTS is skipped, including any on-connect greeting. That's the fastest and cheapest way to test conversational logic. Two top-level blocks switch a scenario to real speech: + +### Speak the user's turns with `user:` + +```yaml +user: + modality: audio + speech: + service: kokoro # local TTS, no API key, no per-run cost + voice: af_heart + sample_rate: 16000 +``` + +With `modality: audio`, each `user:` utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The built-in services are `kokoro` (a local model, the recommended default) and `cartesia` (HTTP) when you want a cloud voice. + +### Judge the agent's actual speech with `judge:` + +```yaml +judge: + modality: audio + transcription: + service: moonshine # STT for the agent's audio (or: whisper) + model: small-streaming + eval: + service: ollama # the judge LLM + model: gemma2:9b +``` + +With `judge.modality: audio`, the agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The built-in transcribers are `moonshine` and `whisper`, both local models. + +The `judge.eval:` block selects the judge LLM in either modality: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`. + + + The harness's TTS and STT can be local models or HTTP-based services. + WebSocket-streaming services aren't supported (they need a running pipeline to + manage their connection lifecycle, and keeping them out keeps the evals code + simple). To use a service beyond the built-ins, both blocks accept a + `factory:` escape hatch, a dotted path to a callable returning the service, + e.g. `speech.factory: "my_pkg.make_tts"` or `transcription.factory: + "my_pkg.make_stt"`. You can also construct the services yourself and inject + them via the [library](/pipecat/evals/library). + + +### Sharing config across scenarios + +Any value can be pulled from another file with `!include`, resolved relative to the scenario file. This keeps per-scenario noise down when a whole directory of scenarios shares the same audio setup: + +```yaml +name: capital_question + +user: !include user_audio.yaml +judge: !include judge_audio.yaml + +turns: + - user: "What is the capital of Germany?" + expect: + - event: response + eval: "the response says the capital of Germany is Berlin" +``` + +## Other top-level fields + +### Seed the conversation context with `reset:` + +Before driving the turns, the harness resets the agent's LLM context. By default the context is cleared; provide `reset:` to seed it with messages instead, which lets a scenario start mid-conversation: + +```yaml +reset: + - role: developer + content: "The user has already introduced themselves as Alex." + - role: assistant + content: "Nice to meet you, Alex! How can I help?" +``` + +### Per-scenario settings with `fixtures:` + +A free-form mapping for run settings. `bot_url` overrides the URL the harness connects to for this scenario: + +```yaml +fixtures: + bot_url: ws://localhost:7861 +``` + +### Vision turns with `image:` + +A turn may register an image (a path relative to the scenario file). When a vision agent requests a user image during the turn, the eval transport serves it: + +```yaml +turns: + - user: "What do you see in this image?" + image: assets/cat.jpg + expect: + - event: response + eval: "the response describes a cat" +``` + +## Next steps + + + + Run many scenarios against many agents concurrently from a manifest. + + + + Load, build, and run scenarios from Python instead of YAML. + + diff --git a/pipecat/evals/suites.mdx b/pipecat/evals/suites.mdx new file mode 100644 index 00000000..5dc6327b --- /dev/null +++ b/pipecat/evals/suites.mdx @@ -0,0 +1,120 @@ +--- +title: "Eval Suites" +description: "Spawn agents and run many scenarios concurrently from a single manifest." +--- + +`pipecat eval run` tests scenarios against an agent you started yourself. A **suite** goes one step further: you list agents and scenarios in a manifest, and `pipecat eval suite` spawns each agent with its eval transport on its own port, runs its scenarios, tears it down, and aggregates the results, several runs at a time. + +Suites are the right tool when you have more than one agent, more than a handful of scenarios, or want a single command for CI. Pipecat's own release evals are a manifest with 100+ example agents plus this command. + +## The manifest + +```yaml manifest.yaml +concurrency: 4 # how many runs execute at once +runs_dir: eval-runs # logs + recordings go to // +record: false # record conversation audio (audio-mode scenarios) +scenarios_dir: scenarios # scenario names resolve to /.yaml + +# How to start each agent. {python}, {bot}, and {port} are substituted per run. +spawn: "{python} {bot} -t eval --port {port}" + +suite: + - bot: bots/support-agent.py + scenarios: [greeting, capital_question, multi_turn] + - bot: bots/sales-agent.py + scenarios: [greeting, weather_function_call] + - bot: bots/vision-agent.py + runner_body: scenarios/vision-body.json # optional --runner-body data + scenarios: [vision_describe] +``` + +Paths in the manifest (`bots_dir`, `scenarios_dir`, `runs_dir`, the `bot:` entries) resolve relative to the manifest file, so a manifest is portable: check it into your repo and run it from anywhere. + +Scenarios are reusable across agents. One `greeting` scenario can cover every agent in the suite. + + + An optional `runner_body:` points at a JSON file passed to the agent as + `--runner-body`. It supplies session data the agent would normally receive in + a `/start` request body (for example, a vision agent's image path). + + +## Running a suite + +```bash +pipecat eval suite manifest.yaml +``` + +In a terminal, a live dashboard shows each run's status, a running tally, and total time. When piped (in CI, or driven by a coding assistant), it streams one plain result line per run instead. The command exits `0` only if every run passes. + +Useful flags: + +```bash +pipecat eval suite manifest.yaml -p support # only bots whose path contains "support" +pipecat eval suite manifest.yaml -s greeting # only the greeting scenario +pipecat eval suite manifest.yaml -c 8 # 8 runs at a time +pipecat eval suite manifest.yaml -n nightly # output to eval-runs/nightly/ +pipecat eval suite manifest.yaml -a # record conversation audio +pipecat eval suite manifest.yaml -d # save full per-pipeline debug logs +``` + +Everything except the `suite:` list can live in the manifest or be passed on the command line (the command line wins), so a manifest can be as minimal as a `suite:` list. + +## Run output + +Each invocation writes to `//` (a timestamp when `-n` is omitted): + +``` +eval-runs/20260610_142200/ + logs/ + bots_support-agent.py__greeting.log # the agent process output + bots_support-agent.py__greeting.eval.log # the harness's decision trace + bots_support-agent.py__greeting.debug.log # per-pipeline harness logs (-d only) + recordings/ + bots_support-agent.py__greeting.wav # conversation audio (record: true or -a) +``` + +When a run fails, start with the `.eval.log` decision trace: it's a timestamped record of every event the harness saw, what it matched, what the judge said, and why an assertion failed. The agent's own log sits next to it. + +## Testing one agent with many scenarios + +If you just want to run a batch of scenarios against an agent you already have running, you don't need a manifest. `pipecat eval run` accepts multiple scenario files and shares the suite's dashboard and tally: + +```bash +pipecat eval run scenarios/*.yaml --bot-url ws://localhost:7860 +``` + +By default the agent is left running afterward so it can serve more evals; pass `--stop-bot` to shut it down when the batch finishes. + +## Suites in CI + +The exit code makes suites CI-ready with no extra glue: + +```yaml +# e.g. GitHub Actions +- name: Run behavioral evals + run: pipecat eval suite manifest.yaml +``` + +For deterministic, key-free CI runs, prefer text-mode scenarios and an OpenAI-compatible judge endpoint you control. Audio-mode scenarios work in CI too, but need the harness's TTS and STT services available (local models by default, which also need more CPU). + +## Next steps + + + + Orchestrate suites programmatically with `EvalManifest` and `EvalSuite`. + + + + Let an AI coding assistant run your suite and iterate until it's green. + + diff --git a/pipecat/fundamentals/evaluations/overview.mdx b/pipecat/fundamentals/evaluations/overview.mdx index 791c22c4..f0c25870 100644 --- a/pipecat/fundamentals/evaluations/overview.mdx +++ b/pipecat/fundamentals/evaluations/overview.mdx @@ -8,36 +8,36 @@ description: "Test and improve your voice AI agents from local prompt iteration Building a voice AI agent is only half the challenge. You also need to know it handles real conversations reliably. A good evaluation strategy progresses through two phases: -1. **Local testing**: Iterate on your LLM prompts quickly without needing live audio services, reducing cost and tightening the feedback loop during development. +1. **Local behavioral testing**: Run scripted conversations against your real agent with Pipecat's built-in evals, catching regressions on every prompt, model, or pipeline change. 2. **Production evaluation**: Automated simulations and observability for deployed agents, catching regressions and tracking quality over time with real user traffic. Starting locally and layering in production tooling as your agent matures gives you the fastest path to a reliable, well-tested agent. -## Local prompt testing +## Local behavioral testing with Pipecat Evals -Before investing in full end-to-end simulations, focus on getting your LLM prompts right. Pipecat's architecture makes it straightforward to test your agent's conversational logic without running STT or TTS services, saving both time and cost during development. +Pipecat ships a built-in evals framework: you describe a conversation and the behavior you expect in a YAML scenario, and the eval harness drives your real agent through it, asserting on responses, function calls, interruptions, and latency, with an LLM judge for semantic checks. -The most efficient way to iterate on prompts is to bypass audio entirely and send text directly to your LLM pipeline. This lets you validate conversational logic, function calling, and response quality in seconds rather than minutes. +```yaml +name: capital_question -You can configure your pipeline to accept text input instead of audio by replacing STT with a transcript-based input: - -```python -from pipecat.frames.frames import TranscriptionFrame - -# Send a simulated user utterance directly into the pipeline -frame = TranscriptionFrame( - text="I'd like to schedule an appointment for tomorrow at 3pm", - user_id="test-user", - timestamp=0, -) +turns: + - user: "What is the capital of Germany?" + expect: + - event: response + eval: "the response says the capital of Germany is Berlin" ``` -This approach lets you: +Scenarios run in text mode by default (no STT or TTS, so they're fast and free to iterate on) or in full audio mode for end-to-end coverage. Suites spawn multiple agents and run many scenarios concurrently, and the whole framework is also usable as a Python library. -- Test prompt variations rapidly without waiting for audio processing -- Validate function calling behavior with specific user inputs -- Build repeatable test cases for edge cases and failure modes -- Run tests in CI without audio infrastructure + + Behavioral testing for your agents: scripted conversations, semantic + assertions, and an LLM judge. + ## Production evaluation @@ -78,14 +78,15 @@ Several platforms offer simulation testing and production monitoring for voice A AI-native simulation and evaluation platform for voice agents, trusted by QA, Engineering, Operations, AI, and Executive teams. - - Simulation, observability, and evaluation platform with native Pipecat Cloud integration. Supports no-code API, WebSocket, and telephony testing. - + + Simulation, observability, and evaluation platform with native Pipecat Cloud + integration. Supports no-code API, WebSocket, and telephony testing. + Date: Wed, 10 Jun 2026 21:43:51 -0700 Subject: [PATCH 02/18] Clarify that within_ms deadlines consume the turn's shared budget --- pipecat/evals/scenarios.mdx | 2 ++ 1 file changed, 2 insertions(+) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index ae839fbf..5341a9c3 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -107,6 +107,8 @@ For exact content, `text_contains` does a plain substring check, with no judge r When omitted, an expectation defaults to a generous 60 second budget (configurable with `--timeout`), so timing is only asserted when you ask for it. +Because every deadline is measured from the send, time spent matching earlier expectations counts against later ones. In the example above, if `llm_started` arrives at 1.5 seconds, the `response` (with the default 60 second budget) has 58.5 seconds left, and a turn that stalls completely fails within a single budget rather than one per expectation. + ### Function calls A `function_call` expectation asserts that the turn invoked one or more tools. List the expected calls under `calls:`; they're matched by name in any order, and the expectation passes once all are found: From 6f891d523c3b5b0f19040b64b89df0c9640e450e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Wed, 10 Jun 2026 21:51:18 -0700 Subject: [PATCH 03/18] Add a factory: subsection with a code example for custom eval services --- pipecat/evals/scenarios.mdx | 46 ++++++++++++++++++++++++++++++++----- 1 file changed, 40 insertions(+), 6 deletions(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index 5341a9c3..d946b17d 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -183,15 +183,49 @@ The `judge.eval:` block selects the judge LLM in either modality: `ollama` (the The harness's TTS and STT can be local models or HTTP-based services. - WebSocket-streaming services aren't supported (they need a running pipeline to + WebSocket-streaming services aren't supported: they need a running pipeline to manage their connection lifecycle, and keeping them out keeps the evals code - simple). To use a service beyond the built-ins, both blocks accept a - `factory:` escape hatch, a dotted path to a callable returning the service, - e.g. `speech.factory: "my_pkg.make_tts"` or `transcription.factory: - "my_pkg.make_stt"`. You can also construct the services yourself and inject - them via the [library](/pipecat/evals/library). + simple. +### Custom services with `factory:` + +To use a TTS or STT beyond the built-ins, both blocks accept a `factory:` escape hatch: a dotted path to a callable that receives the block's mapping and the resolved sample rate, and returns the service. Any extra keys you put in the block are passed through to your factory: + +```yaml +user: + modality: audio + speech: + factory: "my_evals.services.make_tts" + voice: luna # available to your factory as speech_cfg["voice"] + +judge: + modality: audio + transcription: + factory: "my_evals.services.make_stt" +``` + +```python my_evals/services.py +import os + +from pipecat.services.fal.stt import FalSTTService +from pipecat.services.rime.tts import RimeHttpTTSService + + +def make_tts(speech_cfg, sample_rate): + return RimeHttpTTSService( + api_key=os.environ["RIME_API_KEY"], + settings=RimeHttpTTSService.Settings(voice=speech_cfg["voice"]), + sample_rate=sample_rate, + ) + + +def make_stt(transcription_cfg, sample_rate): + return FalSTTService(api_key=os.environ["FAL_KEY"]) +``` + +For a fully custom setup (your own caching, a pre-built service instance), construct `EvalSpeech` or `EvalTranscriber` directly and inject them through the [library](/pipecat/evals/library). + ### Sharing config across scenarios Any value can be pulled from another file with `!include`, resolved relative to the scenario file. This keeps per-scenario noise down when a whole directory of scenarios shares the same audio setup: From bf036b1604ad2db9f6a347eb1a9d1a318cea217f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Wed, 10 Jun 2026 21:53:04 -0700 Subject: [PATCH 04/18] Remove fixtures: documentation (field is being removed from the framework) --- api-reference/cli/eval.mdx | 3 +-- pipecat/evals/scenarios.mdx | 9 --------- 2 files changed, 1 insertion(+), 11 deletions(-) diff --git a/api-reference/cli/eval.mdx b/api-reference/cli/eval.mdx index 0c261a5a..e6f931c6 100644 --- a/api-reference/cli/eval.mdx +++ b/api-reference/cli/eval.mdx @@ -28,8 +28,7 @@ pipecat eval run [OPTIONS] SCENARIOS... **Options:** - WebSocket URL of the agent's eval transport. Overridden per-scenario by - `fixtures.bot_url`. + WebSocket URL of the agent's eval transport. diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index d946b17d..d9d3bb1b 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -257,15 +257,6 @@ reset: content: "Nice to meet you, Alex! How can I help?" ``` -### Per-scenario settings with `fixtures:` - -A free-form mapping for run settings. `bot_url` overrides the URL the harness connects to for this scenario: - -```yaml -fixtures: - bot_url: ws://localhost:7861 -``` - ### Vision turns with `image:` A turn may register an image (a path relative to the scenario file). When a vision agent requests a user image during the turn, the eval transport serves it: From d08920852fdb849789154d62627acb62a267581e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 08:48:51 -0700 Subject: [PATCH 05/18] Use uv commands for installs and runs; order eval after init in CLI nav --- docs.json | 2 +- pipecat/evals/overview.mdx | 4 ++-- pipecat/evals/quickstart.mdx | 4 ++-- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs.json b/docs.json index d4eab342..8e7659bb 100644 --- a/docs.json +++ b/docs.json @@ -826,8 +826,8 @@ { "group": "Commands", "pages": [ - "api-reference/cli/eval", "api-reference/cli/init", + "api-reference/cli/eval", "api-reference/cli/tail", { "group": "cloud", diff --git a/pipecat/evals/overview.mdx b/pipecat/evals/overview.mdx index ec8d774e..991d543e 100644 --- a/pipecat/evals/overview.mdx +++ b/pipecat/evals/overview.mdx @@ -69,9 +69,9 @@ Scenarios are YAML files, so they're easy to write, review, and share. Everythin ## Requirements -- **Pipecat CLI**: the `pipecat eval` commands ship with the CLI extra: `pip install "pipecat-ai[cli]"`. The same commands are available as `python -m pipecat.evals`. +- **Pipecat CLI**: the `pipecat eval` commands ship with the CLI extra: `uv tool install "pipecat-ai[cli]"`. The same commands are available as `python -m pipecat.evals`. - **A judge LLM** (for `eval:` assertions): Ollama by default (`ollama pull gemma2:9b`), or point the scenario's `judge:` block at OpenAI or any OpenAI-compatible endpoint. -- **Audio services** (audio mode only): the harness needs a TTS to synthesize the user's voice and an STT to transcribe the agent's speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with `pip install "pipecat-ai[kokoro,moonshine]"` or `pip install "pipecat-ai[kokoro,whisper]"`), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren't supported here, which keeps the harness simple. +- **Audio services** (audio mode only): the harness needs a TTS to synthesize the user's voice and an STT to transcribe the agent's speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with `uv add "pipecat-ai[kokoro,moonshine]"` or `uv add "pipecat-ai[kokoro,whisper]"`), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren't supported here, which keeps the harness simple. - **Your agent's own credentials**: the agent under test is your real agent, so it needs the same service API keys it normally would. ## Next steps diff --git a/pipecat/evals/quickstart.mdx b/pipecat/evals/quickstart.mdx index fcafb325..6c17e8e3 100644 --- a/pipecat/evals/quickstart.mdx +++ b/pipecat/evals/quickstart.mdx @@ -9,7 +9,7 @@ This guide takes an existing agent, starts it with the eval transport, and runs ## Prerequisites - A working Pipecat agent that uses `create_transport()` and the development runner (the standard pattern from the [quickstart](/pipecat/get-started/quickstart) and all Pipecat examples), with its usual service API keys in `.env`. -- The Pipecat CLI: `pip install "pipecat-ai[cli]"`. +- The Pipecat CLI: `uv tool install "pipecat-ai[cli]"`. - A judge LLM. Either: - **Ollama** (local, the default): install [Ollama](https://ollama.com) and run `ollama pull gemma2:9b`, or - **OpenAI**: set `OPENAI_API_KEY` and point the scenario's `judge:` block at it (shown below). @@ -33,7 +33,7 @@ This guide takes an existing agent, starts it with the eval transport, and runs Then start the agent with `-t eval`: ```bash - python bot.py -t eval + uv run bot.py -t eval ``` ``` From b2e9f61c3db0f431222d2479fab802ab723a17e5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 09:00:09 -0700 Subject: [PATCH 06/18] Remove the pipecat tail CLI reference page Drop the tail page and its nav entry, repoint the old /cli/tail redirect to the CLI overview, and replace the overview's tail references with the eval command (which was missing from that page). --- api-reference/cli/overview.mdx | 12 ++++---- api-reference/cli/tail.mdx | 51 ---------------------------------- docs.json | 3 +- 3 files changed, 7 insertions(+), 59 deletions(-) delete mode 100644 api-reference/cli/tail.mdx diff --git a/api-reference/cli/overview.mdx b/api-reference/cli/overview.mdx index 3c0f7a1f..09156852 100644 --- a/api-reference/cli/overview.mdx +++ b/api-reference/cli/overview.mdx @@ -19,11 +19,11 @@ description: "Command-line tool for scaffolding, deploying, and monitoring Pipec Push your bots to production with one command - Watch real-time logs, conversations, and metrics + Test your agents with scripted scenarios and an LLM judge @@ -51,7 +51,7 @@ pipecat --version **[`pipecat init`](/api-reference/cli/init)** - Scaffold new projects with interactive setup -**[`pipecat tail`](/api-reference/cli/tail)** - Monitor sessions in real-time with a terminal dashboard +**[`pipecat eval`](/api-reference/cli/eval)** - Run behavioral evals against your agents **[`pipecat cloud`](/api-reference/cli/cloud/auth)** - Deploy and manage bots on Pipecat Cloud @@ -62,7 +62,7 @@ View help for any command: ```bash pipecat --help pipecat init --help -pipecat tail --help +pipecat eval --help pipecat cloud --help ``` diff --git a/api-reference/cli/tail.mdx b/api-reference/cli/tail.mdx deleted file mode 100644 index ea56df05..00000000 --- a/api-reference/cli/tail.mdx +++ /dev/null @@ -1,51 +0,0 @@ ---- -title: tail -description: "A terminal dashboard for monitoring Pipecat sessions in real-time" ---- - -**Tail** is a terminal dashboard for monitoring your Pipecat sessions in real-time with logs, conversations, metrics, and audio levels all in one place. - -With Tail you can: - -- ๐Ÿ“œ Follow system logs in real time -- ๐Ÿ’ฌ Track conversations as they happen -- ๐Ÿ”Š Monitor user and agent audio levels -- ๐Ÿ“ˆ Keep an eye on service metrics and usage -- ๐Ÿ–ฅ๏ธ Run locally as a pipeline runner or connect to a remote session - -**Usage:** - -```shell -pipecat tail [OPTIONS] -``` - -**Options:** - - - WebSocket URL to connect to. Defaults to `ws://localhost:9292`. - - -## How to Use Tail - -- Add `pipecat-ai-cli` to your project's dependencies. - -- Update your Pipecat code to include the `TailObserver`: - - ```python - from pipecat_cli.tail import TailObserver - - task = PipelineWorker( - pipeline, - observers=[TailObserver()] - ) - ``` - -- Start the Tail app separately: - - ```bash - # Connect to local session (default) - pipecat tail - - # Connect to remote session - pipecat tail --url wss://my-bot.example.com - ``` diff --git a/docs.json b/docs.json index 8e7659bb..3688e396 100644 --- a/docs.json +++ b/docs.json @@ -828,7 +828,6 @@ "pages": [ "api-reference/cli/init", "api-reference/cli/eval", - "api-reference/cli/tail", { "group": "cloud", "pages": [ @@ -2321,7 +2320,7 @@ }, { "source": "/cli/tail", - "destination": "/api-reference/cli/tail" + "destination": "/api-reference/cli/overview" }, { "source": "/cli/cloud/agent", From 78ef52cae289604fcddfa917e28d442d8b9c6092 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 09:41:34 -0700 Subject: [PATCH 07/18] Point Agent Self-Improvement next steps forward to production evaluation --- pipecat/evals/agent-self-improvement.mdx | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/pipecat/evals/agent-self-improvement.mdx b/pipecat/evals/agent-self-improvement.mdx index 1932373f..e81fea22 100644 --- a/pipecat/evals/agent-self-improvement.mdx +++ b/pipecat/evals/agent-self-improvement.mdx @@ -34,7 +34,7 @@ Keep scenarios in the repo next to the agent and tell your assistant how to run Evals live in `scenarios/`. To verify any change to the agent's behavior: -1. Start the agent: `python bot.py -t eval` (serves ws://localhost:7860) +1. Start the agent: `uv run bot.py -t eval` (serves ws://localhost:7860) 2. Run the evals: `pipecat eval run scenarios/*.yaml` The command exits non-zero on failure and prints each failed assertion. @@ -93,11 +93,12 @@ A few practices keep autonomous loops honest: - The full scenario format your assistant will be writing. + Layer in simulation platforms and observability once your agent is + deployed. From 57b273c0985055a5a8959de7e577738064ecd0ea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 09:47:54 -0700 Subject: [PATCH 08/18] Note the Ollama gemma2:9b judge default in the quickstart scenario step --- pipecat/evals/quickstart.mdx | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/pipecat/evals/quickstart.mdx b/pipecat/evals/quickstart.mdx index 6c17e8e3..079cf455 100644 --- a/pipecat/evals/quickstart.mdx +++ b/pipecat/evals/quickstart.mdx @@ -98,6 +98,12 @@ This guide takes an existing agent, starts it with the eval transport, and runs This scenario runs in **text mode** (the default): the user turn is sent as text and the agent's TTS is skipped automatically, so the whole conversation costs nothing in audio services and finishes in seconds. + + Ollama with `gemma2:9b` is the default judge, which is why the first tab + has no `judge:` block. To use a different judge LLM, add a `judge.eval:` + block as in the OpenAI tab. + + From d63c217e6ae66a77c5a8b37c138082760a5d58ea Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 09:58:27 -0700 Subject: [PATCH 09/18] Make text mode explicit in Writing Scenarios and document the modality default --- pipecat/evals/quickstart.mdx | 2 +- pipecat/evals/scenarios.mdx | 20 +++++++++++++++++--- 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/pipecat/evals/quickstart.mdx b/pipecat/evals/quickstart.mdx index 079cf455..34126f22 100644 --- a/pipecat/evals/quickstart.mdx +++ b/pipecat/evals/quickstart.mdx @@ -150,6 +150,6 @@ This guide takes an existing agent, starts it with the eval transport, and runs ## Where to go next -- Learn the full scenario format, including multi-turn conversations, function call assertions, interruptions, latency budgets, and audio mode, in [Writing Scenarios](/pipecat/evals/scenarios). +- Learn the full scenario format, including multi-turn conversations, function call assertions, interruptions, latency budgets, and text vs audio modes, in [Writing Scenarios](/pipecat/evals/scenarios). - Have many scenarios or agents? Let Pipecat spawn the agents for you with [Eval Suites](/pipecat/evals/suites). - Want your coding assistant to run these for you? See [Agent Self-Improvement](/pipecat/evals/agent-self-improvement). diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index d9d3bb1b..fa5c79d4 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -1,6 +1,6 @@ --- title: "Writing Scenarios" -description: "The scenario file format: turns, expectations, assertions, and audio mode." +description: "The scenario file format: turns, expectations, assertions, and text vs audio modes." --- A scenario is a YAML file describing a scripted conversation and the events you expect your agent to emit. This page covers the full format. If you haven't run a scenario yet, start with the [quickstart](/pipecat/evals/quickstart). @@ -147,9 +147,23 @@ turns: eval: "the response says the capital of Japan is Tokyo, instead of continuing the Paris story" ``` -## Audio mode +## Text and audio modes -By default everything runs in text mode: user turns are sent as text and the agent's TTS is skipped, including any on-connect greeting. That's the fastest and cheapest way to test conversational logic. Two top-level blocks switch a scenario to real speech: +The user side and the judge side each have their own modality, set with `modality:` in the top-level `user:` and `judge:` blocks. When `modality:` isn't specified (or a block is omitted entirely), it defaults to `text`. The two sides are independent: you can drive the agent with text while judging its real speech, or speak to it and judge the LLM text. + +### Text mode (the default) + +A scenario with no `user:` or `judge:` block runs entirely in text mode, equivalent to spelling out: + +```yaml +user: + modality: text + +judge: + modality: text +``` + +User turns are sent as text, bypassing the agent's STT, and the agent's TTS is skipped automatically, including any on-connect greeting. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. ### Speak the user's turns with `user:` From 1881baa473b64449736616425f837896d7f0f535 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:04:34 -0700 Subject: [PATCH 10/18] Clarify the text/audio modes section: block vs per-turn user, defaults --- pipecat/evals/scenarios.mdx | 22 ++++++++++++++++------ 1 file changed, 16 insertions(+), 6 deletions(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index fa5c79d4..a096182d 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -149,7 +149,17 @@ turns: ## Text and audio modes -The user side and the judge side each have their own modality, set with `modality:` in the top-level `user:` and `judge:` blocks. When `modality:` isn't specified (or a block is omitted entirely), it defaults to `text`. The two sides are independent: you can drive the agent with text while judging its real speech, or speak to it and judge the LLM text. +Two top-level blocks control a scenario's modalities, and each has its own `modality:` field: + +- `user:` sets how each turn's utterance is delivered to the agent: sent as text, bypassing its STT (`modality: text`), or synthesized into real speech (`modality: audio`). +- `judge:` sets what the judge evaluates: the agent's LLM text, with its TTS skipped (`modality: text`), or a transcription of its actual spoken audio (`modality: audio`). + +When `modality:` isn't specified, or a block is omitted entirely, it defaults to `text`. The two sides are also independent: you can drive the agent with text while judging its real speech, or speak to it and judge the LLM text. + + + The top-level `user:` block only configures delivery. Each turn's `user:` + field is the utterance itself, and is written the same way in both modes. + ### Text mode (the default) @@ -163,9 +173,9 @@ judge: modality: text ``` -User turns are sent as text, bypassing the agent's STT, and the agent's TTS is skipped automatically, including any on-connect greeting. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. +User turns are sent as text, bypassing the agent's STT, and the agent's TTS is skipped automatically, including any on-connect greeting. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. The only service the harness itself needs is the judge LLM, which defaults to Ollama with `gemma2:9b` when no `judge.eval:` is specified. -### Speak the user's turns with `user:` +### Audio input with `user:` ```yaml user: @@ -176,9 +186,9 @@ user: sample_rate: 16000 ``` -With `modality: audio`, each `user:` utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The built-in services are `kokoro` (a local model, the recommended default) and `cartesia` (HTTP) when you want a cloud voice. +With `user.modality: audio`, each turn's utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The `speech:` block (the TTS service and voice) is required in audio mode; the built-in services are `kokoro` (a local model, the recommended default) and `cartesia` (HTTP) when you want a cloud voice. -### Judge the agent's actual speech with `judge:` +### Audio judging with `judge:` ```yaml judge: @@ -191,7 +201,7 @@ judge: model: gemma2:9b ``` -With `judge.modality: audio`, the agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The built-in transcribers are `moonshine` and `whisper`, both local models. +With `judge.modality: audio`, the agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The `transcription:` block is required in audio mode; the built-in transcribers are `moonshine` and `whisper` (the default when `service:` is omitted), both local models. The `judge.eval:` block selects the judge LLM in either modality: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`. From c5beb501cdc7de72444b9d6b3946f224d25b9709 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:06:30 -0700 Subject: [PATCH 11/18] Move the judge default into a note in the text mode section --- pipecat/evals/scenarios.mdx | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index a096182d..66f39058 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -173,7 +173,13 @@ judge: modality: text ``` -User turns are sent as text, bypassing the agent's STT, and the agent's TTS is skipped automatically, including any on-connect greeting. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. The only service the harness itself needs is the judge LLM, which defaults to Ollama with `gemma2:9b` when no `judge.eval:` is specified. +User turns are sent as text, bypassing the agent's STT, and the agent's TTS is skipped automatically, including any on-connect greeting. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. + + + In text mode the harness runs no audio services; the judge LLM is the only + service it needs. When `judge.eval:` isn't specified, the judge defaults to + Ollama with `gemma2:9b`. + ### Audio input with `user:` From b34a45b4d0084097f6720e4ac27940396a74a687 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:07:33 -0700 Subject: [PATCH 12/18] Move built-in speech and transcription services into notes --- pipecat/evals/scenarios.mdx | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index 66f39058..c1b64d6c 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -192,7 +192,12 @@ user: sample_rate: 16000 ``` -With `user.modality: audio`, each turn's utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The `speech:` block (the TTS service and voice) is required in audio mode; the built-in services are `kokoro` (a local model, the recommended default) and `cartesia` (HTTP) when you want a cloud voice. +With `user.modality: audio`, each turn's utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The `speech:` block (the TTS service and voice) is required in audio mode. + + + The built-in speech services are `kokoro`, a local model and the recommended + default, and `cartesia` (HTTP) when you want a cloud voice. + ### Audio judging with `judge:` @@ -207,7 +212,12 @@ judge: model: gemma2:9b ``` -With `judge.modality: audio`, the agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The `transcription:` block is required in audio mode; the built-in transcribers are `moonshine` and `whisper` (the default when `service:` is omitted), both local models. +With `judge.modality: audio`, the agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The `transcription:` block is required in audio mode. + + + The built-in transcribers are `moonshine` and `whisper`, both local models. + When `transcription.service:` is omitted, it defaults to `whisper`. + The `judge.eval:` block selects the judge LLM in either modality: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`. From 2c460c17e09cc48f2c2937a77607f9cf1f166965 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:09:22 -0700 Subject: [PATCH 13/18] Move the local/HTTP service constraint note into the factory section --- pipecat/evals/scenarios.mdx | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index c1b64d6c..a0914139 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -221,13 +221,6 @@ With `judge.modality: audio`, the agent speaks for real. The harness captures it The `judge.eval:` block selects the judge LLM in either modality: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`. - - The harness's TTS and STT can be local models or HTTP-based services. - WebSocket-streaming services aren't supported: they need a running pipeline to - manage their connection lifecycle, and keeping them out keeps the evals code - simple. - - ### Custom services with `factory:` To use a TTS or STT beyond the built-ins, both blocks accept a `factory:` escape hatch: a dotted path to a callable that receives the block's mapping and the resolved sample rate, and returns the service. Any extra keys you put in the block are passed through to your factory: @@ -264,6 +257,13 @@ def make_stt(transcription_cfg, sample_rate): return FalSTTService(api_key=os.environ["FAL_KEY"]) ``` + + The service your factory returns must be a local model or an HTTP-based + service. WebSocket-streaming services aren't supported: they need a running + pipeline to manage their connection lifecycle, and keeping them out keeps the + evals code simple. + + For a fully custom setup (your own caching, a pre-built service instance), construct `EvalSpeech` or `EvalTranscriber` directly and inject them through the [library](/pipecat/evals/library). ### Sharing config across scenarios From 1178ad138b7e7ba5f49603735f927261d05775c1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:10:18 -0700 Subject: [PATCH 14/18] Move vision turns out of top-level fields; image: is a per-turn field --- pipecat/evals/scenarios.mdx | 30 ++++++++++++++---------------- 1 file changed, 14 insertions(+), 16 deletions(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index a0914139..85727b29 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -147,6 +147,19 @@ turns: eval: "the response says the capital of Japan is Tokyo, instead of continuing the Paris story" ``` +## Vision turns + +A turn may register an image with `image:` (a path relative to the scenario file). When a vision agent requests a user image during the turn, the eval transport serves it: + +```yaml +turns: + - user: "What do you see in this image?" + image: assets/cat.jpg + expect: + - event: response + eval: "the response describes a cat" +``` + ## Text and audio modes Two top-level blocks control a scenario's modalities, and each has its own `modality:` field: @@ -283,9 +296,7 @@ turns: eval: "the response says the capital of Germany is Berlin" ``` -## Other top-level fields - -### Seed the conversation context with `reset:` +## Seed the conversation context with `reset:` Before driving the turns, the harness resets the agent's LLM context. By default the context is cleared; provide `reset:` to seed it with messages instead, which lets a scenario start mid-conversation: @@ -297,19 +308,6 @@ reset: content: "Nice to meet you, Alex! How can I help?" ``` -### Vision turns with `image:` - -A turn may register an image (a path relative to the scenario file). When a vision agent requests a user image during the turn, the eval transport serves it: - -```yaml -turns: - - user: "What do you see in this image?" - image: assets/cat.jpg - expect: - - event: response - eval: "the response describes a cat" -``` - ## Next steps From f6a2a4d8591c802c5e25717d2075f2ab9eedad8e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:15:57 -0700 Subject: [PATCH 15/18] Restructure modes section block-first: user input and judging, each with text and audio --- pipecat/evals/scenarios.mdx | 36 +++++++++++------------------------- 1 file changed, 11 insertions(+), 25 deletions(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index 85727b29..86f299b6 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -169,32 +169,18 @@ Two top-level blocks control a scenario's modalities, and each has its own `moda When `modality:` isn't specified, or a block is omitted entirely, it defaults to `text`. The two sides are also independent: you can drive the agent with text while judging its real speech, or speak to it and judge the LLM text. +A scenario with neither block runs entirely in text mode. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. The judge LLM is the only service the harness itself needs (Ollama with `gemma2:9b` by default). + The top-level `user:` block only configures delivery. Each turn's `user:` field is the utterance itself, and is written the same way in both modes. -### Text mode (the default) - -A scenario with no `user:` or `judge:` block runs entirely in text mode, equivalent to spelling out: - -```yaml -user: - modality: text +### User input with `user:` -judge: - modality: text -``` +**Text (the default).** Each turn's utterance is sent to the agent as text, bypassing its STT. This needs no configuration; it's equivalent to `user: { modality: text }`. -User turns are sent as text, bypassing the agent's STT, and the agent's TTS is skipped automatically, including any on-connect greeting. No audio flows on either side, so this is the fastest and cheapest way to test prompts, conversational logic, and function calling: no audio service cost, and a multi-turn scenario finishes in seconds. - - - In text mode the harness runs no audio services; the judge LLM is the only - service it needs. When `judge.eval:` isn't specified, the judge defaults to - Ollama with `gemma2:9b`. - - -### Audio input with `user:` +**Audio.** Each turn's utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The `speech:` block (the TTS service and voice) is required: ```yaml user: @@ -205,14 +191,16 @@ user: sample_rate: 16000 ``` -With `user.modality: audio`, each turn's utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The `speech:` block (the TTS service and voice) is required in audio mode. - The built-in speech services are `kokoro`, a local model and the recommended default, and `cartesia` (HTTP) when you want a cloud voice. -### Audio judging with `judge:` +### Judging with `judge:` + +**Text (the default).** The agent's TTS is skipped automatically, including any on-connect greeting, and the judge evaluates the LLM's text output. Fast and silent; equivalent to `judge: { modality: text }`. + +**Audio.** The agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The `transcription:` block is required: ```yaml judge: @@ -225,14 +213,12 @@ judge: model: gemma2:9b ``` -With `judge.modality: audio`, the agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The `transcription:` block is required in audio mode. - The built-in transcribers are `moonshine` and `whisper`, both local models. When `transcription.service:` is omitted, it defaults to `whisper`. -The `judge.eval:` block selects the judge LLM in either modality: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`. +In either modality, the `judge.eval:` block selects the judge LLM: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`. ### Custom services with `factory:` From 0ac397c3ef9cf1436a4bd8ad298702298f3b66b0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:17:12 -0700 Subject: [PATCH 16/18] Show the text-mode equivalents as YAML blocks --- pipecat/evals/scenarios.mdx | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index 86f299b6..63217772 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -178,7 +178,12 @@ A scenario with neither block runs entirely in text mode. No audio flows on eith ### User input with `user:` -**Text (the default).** Each turn's utterance is sent to the agent as text, bypassing its STT. This needs no configuration; it's equivalent to `user: { modality: text }`. +**Text (the default).** Each turn's utterance is sent to the agent as text, bypassing its STT. This needs no configuration; it's equivalent to: + +```yaml +user: + modality: text +``` **Audio.** Each turn's utterance is synthesized by a TTS the harness runs and streamed into your agent's pipeline at real-time cadence, exercising its VAD, turn detection, and STT exactly as a live microphone would. Synthesized audio is cached across runs, so repeated turns don't re-synthesize. The `speech:` block (the TTS service and voice) is required: @@ -198,7 +203,12 @@ user: ### Judging with `judge:` -**Text (the default).** The agent's TTS is skipped automatically, including any on-connect greeting, and the judge evaluates the LLM's text output. Fast and silent; equivalent to `judge: { modality: text }`. +**Text (the default).** The agent's TTS is skipped automatically, including any on-connect greeting, and the judge evaluates the LLM's text output. Fast and silent; equivalent to: + +```yaml +judge: + modality: text +``` **Audio.** The agent speaks for real. The harness captures its synthesized audio, transcribes it with the configured STT, and the `response` event becomes that transcription, so the judge evaluates what a user would actually have heard. This is the true end-to-end check: STT in, LLM in the middle, TTS out. The `transcription:` block is required: From 0b5342a69bd2576ed7b3ea05dabfbea63cd3d9a1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 10:19:20 -0700 Subject: [PATCH 17/18] Moonshine, not whisper, is the default transcriber --- pipecat/evals/scenarios.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/pipecat/evals/scenarios.mdx b/pipecat/evals/scenarios.mdx index 63217772..58911980 100644 --- a/pipecat/evals/scenarios.mdx +++ b/pipecat/evals/scenarios.mdx @@ -225,7 +225,7 @@ judge: The built-in transcribers are `moonshine` and `whisper`, both local models. - When `transcription.service:` is omitted, it defaults to `whisper`. + When `transcription.service:` is omitted, it defaults to `moonshine`. In either modality, the `judge.eval:` block selects the judge LLM: `ollama` (the default, `gemma2:9b`), `openai`, or any OpenAI-compatible endpoint via `endpoint:`. From c71f636d60ec8e6fdf6f2e459e85c088ce2ab1f0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Aleix=20Conchillo=20Flaqu=C3=A9?= Date: Thu, 11 Jun 2026 14:25:56 -0700 Subject: [PATCH 18/18] Fold the Fundamentals Evaluations group into Evals Move the third-party platform pages (Bluejay, Cekura, Coval) under a Third-party Platforms subgroup in the Evals tab group, absorb the old evaluations overview's production-evaluation content into the Evals overview, and add redirects for the old URLs. --- docs.json | 39 +++-- pipecat/evals/agent-self-improvement.mdx | 2 +- pipecat/evals/overview.mdx | 48 ++++++ .../platforms}/bluejay.mdx | 6 +- .../platforms}/cekura.mdx | 0 .../evaluations => evals/platforms}/coval.mdx | 0 pipecat/fundamentals/evaluations/overview.mdx | 155 ------------------ 7 files changed, 79 insertions(+), 171 deletions(-) rename pipecat/{fundamentals/evaluations => evals/platforms}/bluejay.mdx (97%) rename pipecat/{fundamentals/evaluations => evals/platforms}/cekura.mdx (100%) rename pipecat/{fundamentals/evaluations => evals/platforms}/coval.mdx (100%) delete mode 100644 pipecat/fundamentals/evaluations/overview.mdx diff --git a/docs.json b/docs.json index 3688e396..e8d3f7a1 100644 --- a/docs.json +++ b/docs.json @@ -80,15 +80,6 @@ "pipecat/fundamentals/saving-transcripts", "pipecat/fundamentals/recording-audio", "pipecat/fundamentals/metrics", - { - "group": "Evaluations", - "pages": [ - "pipecat/fundamentals/evaluations/overview", - "pipecat/fundamentals/evaluations/bluejay", - "pipecat/fundamentals/evaluations/cekura", - "pipecat/fundamentals/evaluations/coval" - ] - }, "pipecat/fundamentals/voicemail", "pipecat/fundamentals/ivr", "pipecat/fundamentals/custom-frame-processor", @@ -110,7 +101,15 @@ "pipecat/evals/scenarios", "pipecat/evals/suites", "pipecat/evals/library", - "pipecat/evals/agent-self-improvement" + "pipecat/evals/agent-self-improvement", + { + "group": "Third-party Platforms", + "pages": [ + "pipecat/evals/platforms/bluejay", + "pipecat/evals/platforms/cekura", + "pipecat/evals/platforms/coval" + ] + } ] }, { @@ -1356,11 +1355,27 @@ }, { "source": "/guides/fundamentals/evaluations/overview", - "destination": "/pipecat/fundamentals/evaluations/overview" + "destination": "/pipecat/evals/overview" }, { "source": "/guides/fundamentals/evaluations/bluejay", - "destination": "/pipecat/fundamentals/evaluations/bluejay" + "destination": "/pipecat/evals/platforms/bluejay" + }, + { + "source": "/pipecat/fundamentals/evaluations/overview", + "destination": "/pipecat/evals/overview" + }, + { + "source": "/pipecat/fundamentals/evaluations/bluejay", + "destination": "/pipecat/evals/platforms/bluejay" + }, + { + "source": "/pipecat/fundamentals/evaluations/cekura", + "destination": "/pipecat/evals/platforms/cekura" + }, + { + "source": "/pipecat/fundamentals/evaluations/coval", + "destination": "/pipecat/evals/platforms/coval" }, { "source": "/examples", diff --git a/pipecat/evals/agent-self-improvement.mdx b/pipecat/evals/agent-self-improvement.mdx index e81fea22..7869d14c 100644 --- a/pipecat/evals/agent-self-improvement.mdx +++ b/pipecat/evals/agent-self-improvement.mdx @@ -96,7 +96,7 @@ A few practices keep autonomous loops honest: title="Production Evaluation" icon="chart-line" iconType="duotone" - href="/pipecat/fundamentals/evaluations/overview" + href="/pipecat/evals/overview#production-evaluation" > Layer in simulation platforms and observability once your agent is deployed. diff --git a/pipecat/evals/overview.mdx b/pipecat/evals/overview.mdx index 991d543e..0a6aaa8d 100644 --- a/pipecat/evals/overview.mdx +++ b/pipecat/evals/overview.mdx @@ -74,6 +74,54 @@ Scenarios are YAML files, so they're easy to write, review, and share. Everythin - **Audio services** (audio mode only): the harness needs a TTS to synthesize the user's voice and an STT to transcribe the agent's speech. Both can be local models or HTTP-based services; the defaults are local (Kokoro and Moonshine or Whisper, installed with `uv add "pipecat-ai[kokoro,moonshine]"` or `uv add "pipecat-ai[kokoro,whisper]"`), which download once on first use and run with no keys and no per-run cost. WebSocket-streaming services aren't supported here, which keeps the harness simple. - **Your agent's own credentials**: the agent under test is your real agent, so it needs the same service API keys it normally would. +## Production evaluation + +Pipecat Evals is built for development: fast, local, repeatable, and run on every change. Once your agent is deployed, third-party evaluation platforms complement it with testing and monitoring at production scale: + +- **Simulations**: scripted or AI-driven test calls over API, WebSocket, or telephony, exercising multi-turn flows, edge cases, and real phone-network conditions before they reach users. +- **Observability**: continuous evaluation of live traffic, with automated quality scoring of calls and transcripts, and metrics tracked over time to catch quality drift. + + + + AI-native simulation and evaluation platform for voice agents, trusted by + QA, Engineering, Operations, AI, and Executive teams. + + + + Simulation, observability, and evaluation platform with native Pipecat Cloud + integration. Supports no-code API, WebSocket, and telephony testing. + + + + Automated testing and monitoring platform with native Pipecat Integration + for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic + Variables and more! + + + + + Building an evaluation integration for Pipecat? We welcome contributions to + this page. Open a PR on the [docs + repository](https://github.com/pipecat-ai/docs). + + +Pipecat's other building blocks feed into any evaluation workflow: [Metrics](/pipecat/fundamentals/metrics) for TTFB, processing time, and usage; [Saving Transcripts](/pipecat/fundamentals/saving-transcripts) for offline analysis; [OpenTelemetry](/api-reference/server/utilities/opentelemetry) for latency traces; and [Observers](/api-reference/server/utilities/observers/observer-pattern) for custom instrumentation. + ## Next steps diff --git a/pipecat/fundamentals/evaluations/bluejay.mdx b/pipecat/evals/platforms/bluejay.mdx similarity index 97% rename from pipecat/fundamentals/evaluations/bluejay.mdx rename to pipecat/evals/platforms/bluejay.mdx index 5bfdd1a6..1efbd879 100644 --- a/pipecat/fundamentals/evaluations/bluejay.mdx +++ b/pipecat/evals/platforms/bluejay.mdx @@ -140,12 +140,12 @@ OpenAIInstrumentor().instrument(tracer_provider=tracer_provider) - Learn about evaluation strategies for Pipecat agents. + Pipecat's built-in behavioral testing framework. - Behavioral testing for your agents: scripted conversations, semantic - assertions, and an LLM judge. - - -## Production evaluation - -Once your prompts are solid and you've validated the local experience, production evaluation tools help you scale testing and monitor quality across real deployments. This is where evaluation platforms come in. - -### Simulations - -Automated test conversations exercise your agent's behavior across scenarios, edge cases, and failure modes before they reach users. Simulation platforms can connect to your agent via API, WebSocket, or telephony to run scripted or AI-driven test calls. - -Key things to test with simulations: - -- **Multi-turn flows**: Verify your agent handles complete conversation paths correctly -- **Edge cases**: Test interruptions, unexpected input, silence, and barge-in -- **Telephony behavior**: End-to-end testing over real phone networks catches issues that only surface in production call conditions -- **Regressions**: Run simulation suites before each deployment to catch breaking changes - -### Observability - -Continuous evaluation of live calls lets you catch regressions, track quality over time, and close the loop between what you test and what users experience. Common approaches include: - -- Submitting call recordings and transcripts for automated quality scoring -- Tracking evaluation metrics over time to detect quality drift -- Using [OpenTelemetry traces](/api-reference/server/utilities/opentelemetry) to monitor latency and execution flow - -Together, simulations and observability form a feedback loop: simulations validate changes before deployment, and observability surfaces issues that inform your next round of tests. - -### Evaluation platforms - -Several platforms offer simulation testing and production monitoring for voice AI agents: - - - - AI-native simulation and evaluation platform for voice agents, trusted by QA, Engineering, Operations, AI, and Executive teams. - - - - Simulation, observability, and evaluation platform with native Pipecat Cloud - integration. Supports no-code API, WebSocket, and telephony testing. - - - - Automated testing and monitoring platform with native Pipecat Integration for WebRTC/Text based testing and support for Mock Tools, Custom Dynamic Variables and more! - - - - - Building an evaluation integration for Pipecat? We welcome contributions to - this page. Open a PR on the [docs - repository](https://github.com/pipecat-ai/docs). - - -## Pipecat's built-in tools - -Pipecat provides several building blocks that feed into any evaluation workflow: - -- **[Pipecat Evals](/pipecat/evals/overview)**: Built-in behavioral testing with scripted scenarios, an LLM judge, and multi-agent eval suites -- **[Metrics](/pipecat/fundamentals/metrics)**: Built-in TTFB, processing time, and usage tracking for LLM and TTS services -- **[Saving transcripts](/pipecat/fundamentals/saving-transcripts)**: Capture conversation transcripts for offline analysis and evaluation -- **[OpenTelemetry](/api-reference/server/utilities/opentelemetry)**: Export traces to any OTel-compatible backend for latency and performance monitoring -- **[Observers](/api-reference/server/utilities/observers/observer-pattern)**: Monitor frame flow without modifying the pipeline, useful for custom instrumentation - -## Next steps - - - - Monitor performance and LLM/TTS usage with Pipecat's built-in metrics. - - - - Capture conversation transcripts to use with evaluation tools. - - - - Export traces for performance monitoring and debugging. - - - - Build custom processors for evaluation-specific instrumentation. - -