Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,26 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres

## [Unreleased]

### Added
- **`agentops assert run` orchestrates the open-source ASSERT framework.**
AgentOps now invokes the `assert-ai` CLI as an active CI step instead of only
consuming pre-generated artifacts via `assert_path:`. A new `assert:` block in
`agentops.yaml` (`config`, `results_dir`, `suite`, `run_id`,
`fail_on_violations`) drives subprocess invocation, locates the run output
under `<results_dir>/<suite>/<run>/`, parses `metrics.json` and
`scores.jsonl`, and writes a normalized summary at `.agentops/assert/latest.json`
that the release evidence pack ingests automatically. Exit code 2 when any
policy dimension reports violations.
- **`agentops redteam run` orchestrates Foundry's AI Red Teaming agent (PyRIT).**
AgentOps now invokes `azure.ai.evaluation.red_team.RedTeam` against the
configured target (Azure OpenAI deployment, Foundry prompt agent, or HTTP
endpoint) and normalizes the per-category and per-strategy attack outcomes.
A new `redteam:` block in `agentops.yaml` (`target`, `risk_categories`,
`attack_strategies`, `num_objectives`, `fail_on_attack_success_rate`)
controls the scan; results land at `.agentops/redteam/latest.json` so the
evidence pack picks them up via `redteam_path:` automatically. Exit code 2
when attack-success-rate exceeds the configured threshold.

## [0.3.13] - 2026-06-09

### Fixed
Expand Down
84 changes: 51 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
<h1 align="center">AgentOps Accelerator</h1>

<p align="center">
Answer the release question for Microsoft Foundry agents: can we ship it, and where is the proof?
<b>Open-source framework and CLI for continuous evaluation, safety testing, and release readiness of Microsoft Foundry agents.</b>
<br/>
Can we ship it, and where is the proof?
</p>

<p align="center">
Expand All @@ -19,25 +21,52 @@ Answer the release question for Microsoft Foundry agents: can we ship it, and wh

## Overview

AgentOps Accelerator helps teams turn Foundry agent work into a clear release
decision. Foundry is the agent control plane; AgentOps turns Foundry signals and
repo checks into repeatable gates, Doctor readiness, release evidence, and
trace-driven regression loops.

The project enables:

- Local and CI execution for release gates
- Foundry prompt agent, Foundry hosted endpoint, HTTP/JSON agent, and raw model targets
- Auto-selected evaluators for RAG, tools, and model quality
- Stable `results.json` for automation
- PR-friendly `report.md`
- Baseline comparison for regression detection
- Doctor checks for repo, CI/CD, telemetry, landing zones, and Foundry setup
- Release evidence packs for promotion review
- Optional `azd ai agent eval` execution with Rubric/custom metric binding
- ASSERT, ACS, and red-team governance evidence references
- Trace promotion into regression datasets
- Cockpit navigation for AgentOps, Foundry, and Azure Monitor
**AgentOps Accelerator is an open-source framework and CLI that standardizes
continuous evaluation, safety testing, and release readiness for enterprise AI
agents — with Microsoft Foundry as the agent runtime.**

It is an *orchestrator*, not a reimplementation. AgentOps wires together the
tools you already use — Foundry Evaluations, `azd ai agent eval`, the
open-source ASSERT framework, the PyRIT-backed AI Red Teaming agent, Azure
Monitor / Application Insights, and your CI/CD platform — into a single
repeatable release loop:

1. **Evaluate** the agent against datasets, rubrics, and policies — locally or
in the cloud — using auto-selected evaluators for RAG, tool use, model
quality, and safety.
2. **Probe** the agent with adversarial inputs by orchestrating ASSERT
(`agentops assert run`) and the Foundry/PyRIT Red Teaming agent
(`agentops redteam run`) as active CI steps.
3. **Diagnose** repo, telemetry, landing zone, and Foundry readiness with
`agentops doctor`.
4. **Gate** the release with a deterministic exit-code contract that PRs and
pipelines can rely on.
5. **Prove** the release with a stable evidence pack (`evidence.json` +
`evidence.md`) that bundles eval results, ASSERT verdicts, red-team
findings, telemetry readiness, and Doctor findings for promotion review.
6. **Learn from production** by promoting reviewed traces into regression
datasets that feed the next eval cycle.

The output is a clear answer to two questions reviewers actually ask:
**can we ship it, and where is the proof?**

### Core outputs

| Artifact | Produced by | Audience |
|---|---|---|
| `results.json` | `agentops eval run` | CI / automation |
| `report.md` | `agentops eval run` | PR reviewers |
| `.agentops/assert/latest.json` | `agentops assert run` | Evidence pack, CI gate |
| `.agentops/redteam/latest.json` | `agentops redteam run` | Evidence pack, CI gate |
| `evidence.json` / `evidence.md` | `agentops doctor --evidence-pack` | Release approver |
| Cockpit (localhost) | `agentops cockpit` | Engineer reviewing readiness |

### Exit-code contract

- `0` — execution succeeded and all gates passed
- `2` — execution succeeded but a threshold, ASSERT violation, red-team rate,
or Doctor severity gate failed
- `1` — runtime or configuration error

## AgentOps and Microsoft Foundry

Expand All @@ -50,26 +79,15 @@ ship/no-ship workflow.
|---|---|---|
| Build and version | Foundry portal, Foundry SDK/Toolkit, `microsoft-foundry` skill, azd | Pins the exact candidate in `agentops.yaml` and generates the PR/release gate around it |
| Evaluate and compare | Foundry Evaluations, `azd ai agent eval`, Rubric evaluator, and official CI actions/extensions | Keeps datasets and thresholds in the repo, records evidence, normalizes azd/Rubric outputs, and provides local/fallback runs for non-prompt targets |
| Probe safety | ASSERT framework, PyRIT-backed AI Red Teaming agent | Runs both as active CI steps via `agentops assert run` and `agentops redteam run`, normalizes verdicts, and gates the pipeline |
| Observe and investigate | Foundry Monitor, Traces, Azure Monitor, App Insights | Surfaces deep links, telemetry readiness, Doctor findings, and Cockpit navigation |
| Decide release | Branch protection, environments, approvals | Packages `evidence.json` / `evidence.md` for promotion review |
| Govern controls | ASSERT, ACS, Foundry Guardrails, Foundry red-team scans | References reviewed artifacts by path/hash/status without executing or applying the external controls |
| Govern controls | ACS, Foundry Guardrails | References reviewed artifacts by path/hash/status without executing or applying the external controls |
| Improve from production | Production traces and Foundry datasets | Promotes reviewed trace learnings into regression candidates |

The rhythm is simple: build and operate the agent in Foundry, keep the release
contract in the repo, and let AgentOps connect the two into a clean review loop.

Core outputs:

- `results.json` (machine-readable)
- `report.md` (human-readable)
- `evidence.json` / `evidence.md` (from `agentops doctor --evidence-pack`)

Exit code contract:

- `0` execution succeeded and all thresholds passed
- `2` execution succeeded but one or more thresholds failed
- `1` runtime or configuration error

## Quickstart

### 1) Install
Expand Down
138 changes: 110 additions & 28 deletions docs/tutorial-prompt-agent-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -1019,47 +1019,127 @@ bind to an emitted metric. Open `.agentops/results/latest/results.json` to see
which rubric metric names actually appeared in the azd output; that is the
authoritative list of values you can put under `thresholds:`.

### Add ASSERT evidence to the release proof
### Add ASSERT and Red Team to the release gate

The normal AgentOps flow proves the release with evaluation results, Doctor
findings, workflow runs, and release evidence. ASSERT fits into that same release
proof as a governed artifact: run ASSERT in the tool or process your team uses
for policy checks, keep the reviewed policy or result summary in the repo or CI
artifact store, and point AgentOps at it.
findings, workflow runs, and release evidence. Two release-readiness signals
deserve to run inside the same loop:

AgentOps does not execute ASSERT. It records the artifact path, status, and
SHA-256 hash so Doctor and the evidence pack can show reviewers which ASSERT
evidence was used for the release. Store only approved metadata in the repo; keep
raw adversarial prompts, secrets, customer data, and detailed scan payloads in
the approved secure system.
- **ASSERT** (open-source `assert-ai`) — turns natural-language policies into
executable behavior tests (prompt injection, jailbreak, hallucination, PII
leak, unauthorized tool use). Repo: <https://github.com/responsibleai/ASSERT>.
- **AI Red Teaming** (Foundry agent, PyRIT-backed) — generates adversarial
prompts across risk categories (violence, hate, self-harm, sexual) and applies
attack strategies (base64, rot13, morse) to surface safety regressions before
users do. Docs:
<https://learn.microsoft.com/azure/ai-foundry/concepts/ai-red-teaming-agent>.

AgentOps does NOT reimplement either. It orchestrates them as active CI steps,
gates the pipeline on their results, and writes normalized JSON summaries that
the evidence pack ingests automatically.

#### 10a. Run ASSERT against the Travel Agent

Install ASSERT and scaffold a minimal eval config:

```powershell
New-Item -ItemType Directory -Force .agentops\governance | Out-Null
pip install assert-ai

New-Item -ItemType Directory -Force .\assert | Out-Null
@'
# ASSERT evidence

Status: reviewed
Source: <link-to-approved-assert-run-or-policy>
Scope: Travel Agent release readiness
Notes: ASSERT execution remains in the owning ASSERT workflow; AgentOps records
this artifact as release evidence only.
'@ | Set-Content -Encoding utf8 .agentops\governance\assert-evidence.md
suite_id: travel-agent-v1
run_id: ci-tutorial
target:
type: azure_openai
deployment: gpt-4o-mini
dimensions:
- prompt_injection
- pii_leak
- jailbreak
num_cases_per_dimension: 5
'@ | Set-Content -Encoding utf8 .\assert\eval_config.yaml
```

Then reference it from `agentops.yaml`:
Add the `assert:` block to `agentops.yaml`:

```yaml
assert_path: .agentops/governance/assert-evidence.md
assert:
config: ./assert/eval_config.yaml
fail_on_violations: true
```

Run it through AgentOps:

```powershell
agentops assert run
```

When you later run:
What AgentOps does for you:

1. Verifies `assert-ai` is installed.
2. Invokes `assert-ai run --config ./assert/eval_config.yaml`.
3. Locates the run output under `artifacts/results/<suite>/<run>/`.
4. Parses `metrics.json` and `scores.jsonl` for per-dimension verdicts.
5. Writes a normalized summary at `.agentops/assert/latest.json`.
6. Exits non-zero (code 2) when ASSERT reports any policy violation, unless
you pass `--no-gate` or set `assert.fail_on_violations: false`.

#### 10b. Run the AI Red Teaming agent

Install Foundry's Red Team SDK (it ships under an extra of `azure-ai-evaluation`):

```powershell
pip install "azure-ai-evaluation[redteam]"
```

Add the `redteam:` block to `agentops.yaml`:

```yaml
redteam:
target:
model_deployment: gpt-4o-mini
risk_categories: [violence, hate_unfairness, self_harm, sexual]
attack_strategies: [base64, rot13, morse]
num_objectives: 5
fail_on_attack_success_rate: 0.2 # fail if >20% of attacks succeed
```

Run it:

```powershell
agentops redteam run
```

What AgentOps does for you:

1. Verifies the `RedTeam` Python API is importable.
2. Resolves the target (deployment / agent / endpoint) from the YAML.
3. Calls `RedTeam.scan(...)` with the configured risk categories, strategies,
and objective count.
4. Aggregates per-category and per-strategy attack-success-rate.
5. Writes a normalized summary at `.agentops/redteam/latest.json` plus the
raw SDK payload at `.agentops/redteam/raw_summary.json`.
6. Exits non-zero (code 2) when overall attack-success-rate exceeds
`fail_on_attack_success_rate`, unless you pass `--no-gate`.

> **Heads-up.** Both commands hit live Azure services. Run them against a
> non-production deployment and budget for the cost of the configured
> objective count.

#### 10c. Pull both into the release evidence pack

Both runners write to well-known paths the evidence pack already auto-discovers
(via `assert_path` and `redteam_path` resolution). When you produce the
evidence pack:

```powershell
agentops doctor --workspace . --evidence-pack
```

the release evidence includes the ASSERT path, status, and SHA-256 hash without
claiming that AgentOps executed ASSERT.
`evidence.json` and `evidence.md` now include the suite/run id, total cases,
violation counts, attack-success-rate, and SHA-256 hashes for both artifacts —
without claiming AgentOps invented the verdicts. The verdicts come from ASSERT
and PyRIT; AgentOps owns orchestration, normalization, and gating.

## 11. Generate the PR + dev deploy workflows

Expand Down Expand Up @@ -1646,10 +1726,12 @@ You are done when:
- `agentops doctor --evidence-pack` writes
`.agentops/release/latest/evidence.md`, and the GitHub run summary
shows its Doctor finding summary.
- Optional governance artifacts are either absent (no Doctor noise) or wired as
evidence-only paths in `agentops.yaml` (`assert_path`, `acs_path`,
`redteam_path`) so the evidence pack can cite their hash/status without
claiming AgentOps executed ASSERT, applied ACS, or ran red-team scans.
- Optional safety runners are either skipped (no Doctor noise) or wired in:
`assert:` to run `agentops assert run`, and `redteam:` to run
`agentops redteam run`. Both write normalized JSON under `.agentops/` that
the evidence pack ingests automatically. Pre-existing `assert_path`,
`acs_path`, `redteam_path` references for evidence-only hash/status are
still honored.
- Cockpit opens and links the repo-side readiness view back to Foundry
for both sandbox and dev.

Expand Down
Loading
Loading