Sample #7 in the Embr POC fleet. Demonstrates an agent + LLM-as-a-judge loop running entirely in user code on Embr — every user turn is scored by a second Foundry agent on relevance, groundedness, and safety, then surfaced via the chat UI, a JSON API, a CSV export, and OpenTelemetry GenAI–shaped JSON-Lines on stdout.
- Worker agent — a normal Foundry agent with
lookup_factanddo_mathfunction tools (Embr Corp facts + safe arithmetic). - Judge agent — a separate Foundry agent (no tools) prompted to score the worker's reply on three axes and return strict JSON.
- Eval store — bounded
deque(maxlen=500)in process memory. - OTel-shaped logs — JSON-Lines on stdout matching the OpenTelemetry GenAI semantic conventions for both inference and evaluation events.
- Chat UI — per-turn badges (
🎯 R:4 📚 G:5 ✅ Safe), expandable rationale, rolling-20 averages at the top.
This sample exists to show what a customer must build themselves today to get production-quality LLM evals on Embr. Each item below is a missing primitive.
Every customer rolls their own worker→judge→store→export pipeline. embr.yaml
should accept something like:
eval:
judge:
provider: foundry-agent
agentId: judge-v1
metrics: [relevance, groundedness, safety]
emitOtel: true…and Embr would orchestrate the judge call automatically, expose a built-in eval dashboard in the portal, and emit OTel telemetry without app changes.
This sample writes structured JSON lines like:
{"event":"gen_ai.evaluation.score","ts":"2025-…","gen_ai.evaluation.metric":"relevance","gen_ai.evaluation.score":4,"trace.id":"…"}Kusto receives them in the Message column as opaque strings. An Embr-shipped
OTel GenAI auto-detector / Kusto update policy could surface them as first-class
columns and power dashboards/alerts on gen_ai.evaluation.score.
app/store.py is a deque(maxlen=500) that wipes on restart. There is no
embr.evalStore (Cosmos-backed, scoped per-environment) primitive. A real
customer must bring their own Cosmos / Postgres / log ingestion.
Both the worker and judge use the same ClientSecretCredential. With managed
identity (a separate platform ask) and Embr-issued per-agent identities, a
customer could grant the judge narrower permissions — e.g. read-only on the
project, no thread-creation rights — than the worker. Today, no such scoping.
If safetyPassRate < 0.95 over the last hour, the customer must wire their own
alerting via Application Insights metric alerts, Kusto alert rules, or a 3rd
party. A native embr.yaml: alerts: block would close this gap.
POST /api/chat { message, threadId? }
│
▼
┌──────────┐ tools: lookup_fact, do_math
│ worker │ ── runs in app, resolves tool calls locally
└────┬─────┘
│ assistant_text
▼
┌──────────┐ fresh thread per turn
│ judge │ ── { relevance, groundedness, safety, rationale }
└────┬─────┘
│
├─► store.add(record) (in-memory deque, restart-wipes)
├─► emit_inference(...) (gen_ai.client.inference.operation.details)
├─► emit_eval_score(...) (gen_ai.evaluation.score × 3)
└─► return { response, eval, threadId, runIds }
| Method | Path | Description |
|---|---|---|
| GET | / |
Chat UI |
| GET | /health |
{status:"ok"} |
| POST | /api/chat |
{message, threadId?} → run worker + judge, return verdict |
| GET | /api/evals?limit=50 |
Recent eval records (newest first) + stats |
| GET | /api/evals/stats |
Rolling averages |
| GET | /api/evals.csv |
Full eval log as CSV (download) |
| GET | /api/diag |
Config / agents-ready / OTel emit count |
cd embr-foundry-eval-loop-sample
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export FOUNDRY_PROJECT_ENDPOINT="https://<name>.services.ai.azure.com/api/projects/<project>"
export FOUNDRY_MODEL_DEPLOYMENT="gpt-4o-mini" # or any deployment in your project
# Either run `az login`, or set:
# AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET
uvicorn app.main:app --reload --port 8000Then open http://localhost:8000.
EMBR_TOKEN is not needed — this is a Foundry-only sample.
Optional: set FOUNDRY_JUDGE_MODEL_DEPLOYMENT to a different deployment than
the worker's, to demonstrate cross-model evaluation.
embr quickstart deploy embr-devs/embr-foundry-eval-loop-sample -i 120233234
embr variables set FOUNDRY_PROJECT_ENDPOINT "https://…/api/projects/…"
embr variables set FOUNDRY_MODEL_DEPLOYMENT "gpt-4o-mini"
embr variables set AZURE_TENANT_ID "…"
embr variables set AZURE_CLIENT_ID "…"
embr variables set AZURE_CLIENT_SECRET "…" --secret-
Open the chat UI.
-
Send 5 questions covering each tool path:
How many departments does Embr Corp have?(expectslookup_fact)What is 17 * 23 + 7?(expectsdo_math)What's Embr Corp's mission?(expectslookup_fact)What's the capital of France?(no tool — judge should mark this as groundedness 5 if answered, but flag fabrication if hallucinated about Embr Corp)Roll a d20.(no matching tool — worker should refuse / note the limitation)
-
Confirm each assistant turn shows three badges.
-
Click a badge row to expand the judge rationale.
-
Download
/api/evals.csvand confirm a row like:id,ts,user_input,worker_output,relevance,groundedness,safety,rationale,latency_worker_ms,latency_judge_ms,run_id_worker,run_id_judge 8d3f…,2025-…Z,How many departments does Embr Corp have?,Embr Corp has 7 departments…,5,5,pass,Directly answers using a verified fact.,2148,1607,run_AAA…,run_BBB…
-
On Embr, query Kusto:
GlobalControlPlaneEvents | where AksPodName startswith "embr-foundry-eval-loop-sample" | where Message has "gen_ai.evaluation.score" | take 20
Today these events appear only as raw strings in the
Messagecolumn — exactly the platform finding this sample is designed to surface.
# embr.yaml — hypothetical
platform: python
platformVersion: "3.12"
eval:
judge:
provider: foundry-agent
agentId: judge-v1
model: gpt-4o-mini
metrics: [relevance, groundedness, safety]
store: managed # Embr-managed Cosmos-backed eval log
emitOtel: true # Auto-promote gen_ai.* stdout to first-class telemetry
alerts:
- metric: safety
condition: passRate < 0.95
window: 1hWith this, the worker code stays exactly the same — Embr handles judge orchestration, eval persistence, OTel promotion, and alerting.