From d1ea843294953f8520ffd5df862d8b65d17e4a1d Mon Sep 17 00:00:00 2001 From: Paulo Lacerda Date: Tue, 9 Jun 2026 10:28:43 -0300 Subject: [PATCH] docs: require rubric gate in prompt tutorial Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- docs/tutorial-prompt-agent-quickstart.md | 66 ++++++++++++++++++++---- 1 file changed, 56 insertions(+), 10 deletions(-) diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md index 6ffd680..1d69b98 100644 --- a/docs/tutorial-prompt-agent-quickstart.md +++ b/docs/tutorial-prompt-agent-quickstart.md @@ -455,6 +455,11 @@ later. Create the small JSONL dataset that matches the Travel Agent behavior: +> **Copilot assist:** If you want help expanding or reviewing these rows, ask +> Copilot to use `/skills agentops-dataset`. The skill can propose additional +> edge cases, check that each row has `input` and `expected`, and keep the +> criteria written as reviewable behavior instead of exact answer strings. + ```powershell New-Item -ItemType Directory -Force .agentops\data | Out-Null @' @@ -813,6 +818,11 @@ Create a small set of **synthetic multi-turn test cases**. These rows are not claiming that the agent already said the assistant turns verbatim. They define a controlled conversation scenario you want the next response to handle. +> **Copilot assist:** You can also ask `/skills agentops-dataset` to draft these +> conversation scenarios. Ask it for synthetic multi-turn rows that keep the +> conversation summary in `input`, preserve the structured turns in `messages`, +> and write `expected` as acceptance criteria. + Keep the important conversation context inside `input`, because that is the field AgentOps maps to the azd `query`. Also keep `messages` beside it so the dataset has the same shape as future trace-derived rows and release evidence can @@ -889,18 +899,54 @@ conversation experience itself: 4. Select or upload the conversation dataset you want Foundry to evaluate. 5. Run the evaluation and review the result in Foundry. -Do not block the tutorial on copying portal URLs back into `agentops.yaml`. -When Foundry exposes a stable automation path for this preview scope, AgentOps -should capture that evaluation evidence automatically instead of asking you to -paste links by hand. - Reference: [Run evaluations from the Microsoft Foundry portal](https://learn.microsoft.com/azure/foundry/how-to/evaluate-generative-ai-app#create-an-evaluation). -If your Foundry project already has a real rubric evaluator, add it later as an -advanced hardening step: declare `rubrics:` in `agentops.yaml`, bind thresholds -only to metric names that appear in the azd run output, and regenerate the recipe -with `agentops eval init --force`. Do not use placeholder rubric names in the -quickstart path. +### Add the Travel Agent rubric gate + +Now make the rubric part of the release gate. Use the rubric evaluator that you +created or selected in Foundry / azd for this Travel Agent project. Do not invent +placeholder evaluator names: the value in `rubrics[].evaluator` must match the +real evaluator name that the azd run can execute, and thresholds must bind to +metric names that appear in the azd output. + +Add the rubric metadata and thresholds to `agentops.yaml`. Replace every +`<...>` value with the evaluator and metric names from your Foundry / azd rubric +run before you save the file: + +```yaml +rubrics: + - name: travel-concierge-quality + evaluator: + description: Scores the Travel Agent against the intended product behavior. + dimensions: + - name: + description: Completes the user's travel-planning goal across the conversation. + weight: 0.5 + - name: + description: Carries user constraints such as kids, budget, duration, and pace. + weight: 0.3 + - name: + description: Avoids claiming live bookings, confirmations, or prices it cannot verify. + weight: 0.2 + +thresholds: + : ">=4" + : ">=4" + : ">=4" +``` + +Regenerate the azd recipe so the rubric evaluator is included in the backend +run, then run the gate again: + +```powershell +agentops eval init --force +agentops eval run +``` + +When this passes, the release gate has both the conversation-context dataset and +the Travel Agent rubric thresholds. If a metric name is wrong, AgentOps will keep +the run from passing the configured threshold gate because the threshold cannot +bind to an emitted metric. ### Add ASSERT evidence to the release proof