Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 75 additions & 15 deletions docs/tutorial-prompt-agent-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,14 @@ Two ideas to internalize:
per-environment deploy artifact `foundry-agent.json`. When you ask
"is the same prompt running in sandbox, dev, and prod?", you compare
SHAs, not version numbers. The version numbers will differ.
4. **PRs evaluate a candidate; `agentops.yaml` pins the rolling
baseline.** When a PR changes the prompt content,
`prompt_deploy stage` creates a new candidate Foundry version in the
target environment and evaluates *that*, while the Release Evidence
`Target` field continues to show whatever `agentops.yaml` pins. The
two intentionally diverge on every prompt-changing PR; the pin only
moves when you explicitly promote a new baseline by bumping
`agentops.yaml`.

The longer walkthrough of that identity story is in step 13, when you
have a real `foundry-agent.json` artifact to open.
Expand All @@ -114,7 +122,7 @@ have a real `foundry-agent.json` artifact to open.
| Promote the prompt to git | Editor | Copy validated instructions into `.agentops/prompts/travel-agent.md`. | The CI gate reads this file. |
| First green PR + dev deploy | GitHub Actions + Foundry dev project | Push prompt, open PR, watch CI auto-bootstrap the first version of `travel-agent` in dev from `prompt_agent_bootstrap` (the dev project is still empty at this point), evaluate it, run Doctor; merge; deploy lands in dev. | Owns the gate, the bootstrap-on-first-deploy, the threshold decision, the Doctor blocking step, the deploy artifact, and the release evidence. |
| Force a regression | Editor + GitHub Actions | Edit the prompt to a worse version, push, observe BOTH eval threshold failure AND Doctor regression CRITICAL. | Catches the regression at PR time, not after merge. |
| Fix and redeploy | Editor + GitHub Actions | Restore prompt, push, PR green, merge, deploy. | Records the recovery. |
| Fix and redeploy | Editor + GitHub Actions | Push an improved prompt that restores the guardrails and adds a small refinement, PR re-runs green, merge, deploy records a brand-new candidate version. | Records the recovery and makes the candidate-vs-baseline split visible: a new Foundry version is evaluated while `agentops.yaml` (and therefore the Release Evidence `Target`) is unchanged. |
| Review readiness | AgentOps Doctor + Cockpit | Check CI, eval, telemetry, evidence, and links. | Turns scattered signals into release blockers, warnings, evidence files, and next actions. |

## 1. Create a clean workspace and install AgentOps
Expand Down Expand Up @@ -1038,7 +1046,12 @@ post-deploy production alert.

## 15. Fix and redeploy

Restore the prompt to the good version:
Push an *improved* prompt that re-establishes the guardrails and adds
one small refinement. The refinement is narrow on purpose — it
re-establishes the structure so the eval threshold passes and Doctor's
regression findings clear, while still being a real content change so
`prompt_deploy stage` creates a new candidate version in dev (instead
of reusing the existing one).

```powershell
@'
Expand All @@ -1050,31 +1063,75 @@ Help users plan short leisure trips. Always include:
- practical notes about budget, transit, weather, or booking constraints;
- a reminder that you cannot make live reservations or purchases.

Ask one clarifying question only when the destination, duration, or
traveler preference is missing. Do not invent booking confirmations,
prices, or availability.
When you mention monetary amounts, include the currency code when it is
known (for example, USD 120 or EUR 35). Ask one clarifying question
only when the destination, duration, or traveler preference is missing.
Do not invent booking confirmations, prices, or availability.
'@ | Set-Content -Encoding utf8 .agentops\prompts\travel-agent.md
```

Commit and push:

```powershell
git add .agentops\prompts\travel-agent.md
git commit -m "Restore travel agent prompt"
git commit -m "Improve travel agent prompt: restore guardrails, require currency codes"
git push
```

The same PR re-runs (no new PR needed). The eval should pass again and
Doctor's regression findings should clear because the candidate's
metrics return to the rolling baseline. Merge. The dev deploy workflow
records the restored prompt as the dev deployment with a new
`foundry-agent.json` artifact that has the SHA of the recovered prompt.
The same PR re-runs (no new PR is opened). Eval thresholds should pass
again and Doctor's regression findings should clear, because the
candidate's metrics return to the rolling baseline. Merge. The dev
deploy workflow then re-runs and uploads a fresh
`foundry-agent-dev-deployment` artifact whose `foundry-agent.json`
looks like this — note the field values:

```json
{
"action": "created",
"agent_name": "travel-agent",
"source_agent": "travel-agent:2",
"candidate_agent": "travel-agent:N",
"source_version": "2",
"candidate_version": "N",
"prompt_sha256": "<new sha — different from the byte-identical baseline>",
"git_sha": "<your fix commit>",
...
}
```

`N` is whatever number Foundry assigns next in *your* dev project; it
will be higher than the previous candidate. The interesting thing is
the split:

- **`candidate_agent: travel-agent:N`** is what CI actually evaluated
and recorded as the dev deployment. It changes on every PR that
changes the prompt content.
- The Release Evidence card on Cockpit and the `agent:` line in
`agentops.yaml` still show **`travel-agent:2`**. That value is the
*declarative pin* — the rolling baseline your team agreed is "what
production should be aligned to". It does not move just because a PR
flowed through; it moves only when you explicitly bump
`agentops.yaml` to promote a new baseline.
- Durable cross-environment identity lives in `prompt_sha256` +
`git_sha`, not in the version number. Foundry numbers are
environment-local.

> **Recording / re-recording note.** If you already pushed a
> byte-identical restore in an earlier take, `prompt_deploy stage`
> reported `action: "reused"` and `candidate_agent: travel-agent:2`
> (no new version was created because the file matched
> `travel-agent:2`'s instructions in Foundry exactly). Push one more
> commit with the improved prompt above and the stage step will report
> `action: "created"` and a new `candidate_agent: travel-agent:N` —
> which is the state the rest of the tutorial assumes.

The learning loop is the point: the prompt source of truth is in git,
the PR workflow exercises it as a candidate in dev, Doctor catches
regressions that thresholds alone miss, and the merge promotes through
the deploy workflow. None of those gates require the developer to
remember to look at a dashboard.
remember to look at a dashboard, and the candidate-vs-pin split is what
keeps "what we evaluated this PR" honest about being separate from
"what we declare as the baseline".

## 16. Brief observability checkout (Foundry side)

Expand Down Expand Up @@ -1176,9 +1233,12 @@ You are done when:
by the eval threshold gate and once by Doctor's
`--severity-fail critical` step. You can explain that either gate is
sufficient on its own.
- You restored the prompt, the PR returned to green, the merge ran the
dev deploy again, and the new `foundry-agent.json` shows the recovered
prompt's SHA.
- You pushed an improved prompt that re-established the guardrails. The
PR returned to green, the merge ran the dev deploy again, and the new
`foundry-agent.json` shows `action: created` with a higher
`candidate_agent` version number and the new prompt's SHA — while
`agentops.yaml` (and therefore the Release Evidence `Target`) still
pins the same baseline.
- `agentops doctor --evidence-pack` writes
`.agentops/release/latest/evidence.md`, and the GitHub run summary
shows its Doctor finding summary.
Expand Down
Loading