Embr × Foundry — Eval Loop Sample

Sample #7 in the Embr POC fleet. Demonstrates an agent + LLM-as-a-judge loop running entirely in user code on Embr — every user turn is scored by a second Foundry agent on relevance, groundedness, and safety, then surfaced via the chat UI, a JSON API, a CSV export, and OpenTelemetry GenAI–shaped JSON-Lines on stdout.

What this demonstrates

Worker agent — a normal Foundry agent with lookup_fact and do_math function tools (Embr Corp facts + safe arithmetic).
Judge agent — a separate Foundry agent (no tools) prompted to score the worker's reply on three axes and return strict JSON.
Eval store — bounded deque(maxlen=500) in process memory.
OTel-shaped logs — JSON-Lines on stdout matching the OpenTelemetry GenAI semantic conventions for both inference and evaluation events.
Chat UI — per-turn badges (🎯 R:4 📚 G:5 ✅ Safe), expandable rationale, rolling-20 averages at the top.

Embr platform findings (the whole point)

This sample exists to show what a customer must build themselves today to get production-quality LLM evals on Embr. Each item below is a missing primitive.

1. No first-class eval primitive

Every customer rolls their own worker→judge→store→export pipeline. embr.yaml should accept something like:

eval:
  judge:
    provider: foundry-agent
    agentId: judge-v1
  metrics: [relevance, groundedness, safety]
  emitOtel: true

…and Embr would orchestrate the judge call automatically, expose a built-in eval dashboard in the portal, and emit OTel telemetry without app changes.

2. No OTel GenAI parsing in `GlobalControlPlaneEvents`

This sample writes structured JSON lines like:

{"event":"gen_ai.evaluation.score","ts":"2025-…","gen_ai.evaluation.metric":"relevance","gen_ai.evaluation.score":4,"trace.id":"…"}

Kusto receives them in the Message column as opaque strings. An Embr-shipped OTel GenAI auto-detector / Kusto update policy could surface them as first-class columns and power dashboards/alerts on gen_ai.evaluation.score.

3. No eval-store primitive

app/store.py is a deque(maxlen=500) that wipes on restart. There is no embr.evalStore (Cosmos-backed, scoped per-environment) primitive. A real customer must bring their own Cosmos / Postgres / log ingestion.

4. No "judge model" concept in agent identity

Both the worker and judge use the same ClientSecretCredential. With managed identity (a separate platform ask) and Embr-issued per-agent identities, a customer could grant the judge narrower permissions — e.g. read-only on the project, no thread-creation rights — than the worker. Today, no such scoping.

5. No alert primitive

If safetyPassRate < 0.95 over the last hour, the customer must wire their own alerting via Application Insights metric alerts, Kusto alert rules, or a 3rd party. A native embr.yaml: alerts: block would close this gap.

Architecture

POST /api/chat { message, threadId? }
        │
        ▼
   ┌──────────┐    tools: lookup_fact, do_math
   │  worker  │ ── runs in app, resolves tool calls locally
   └────┬─────┘
        │ assistant_text
        ▼
   ┌──────────┐    fresh thread per turn
   │  judge   │ ── { relevance, groundedness, safety, rationale }
   └────┬─────┘
        │
        ├─► store.add(record)       (in-memory deque, restart-wipes)
        ├─► emit_inference(...)     (gen_ai.client.inference.operation.details)
        ├─► emit_eval_score(...)    (gen_ai.evaluation.score × 3)
        └─► return { response, eval, threadId, runIds }

Endpoints

Method	Path	Description
GET	`/`	Chat UI
GET	`/health`	`{status:"ok"}`
POST	`/api/chat`	`{message, threadId?}` → run worker + judge, return verdict
GET	`/api/evals?limit=50`	Recent eval records (newest first) + stats
GET	`/api/evals/stats`	Rolling averages
GET	`/api/evals.csv`	Full eval log as CSV (download)
GET	`/api/diag`	Config / agents-ready / OTel emit count

Quickstart (local)

cd embr-foundry-eval-loop-sample
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export FOUNDRY_PROJECT_ENDPOINT="https://<name>.services.ai.azure.com/api/projects/<project>"
export FOUNDRY_MODEL_DEPLOYMENT="gpt-4o-mini"   # or any deployment in your project
# Either run `az login`, or set:
#   AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET
uvicorn app.main:app --reload --port 8000

Then open http://localhost:8000.

EMBR_TOKEN is not needed — this is a Foundry-only sample.

Optional: set FOUNDRY_JUDGE_MODEL_DEPLOYMENT to a different deployment than the worker's, to demonstrate cross-model evaluation.

Deploy to Embr

embr quickstart deploy embr-devs/embr-foundry-eval-loop-sample -i 120233234
embr variables set FOUNDRY_PROJECT_ENDPOINT "https://…/api/projects/…"
embr variables set FOUNDRY_MODEL_DEPLOYMENT "gpt-4o-mini"
embr variables set AZURE_TENANT_ID "…"
embr variables set AZURE_CLIENT_ID "…"
embr variables set AZURE_CLIENT_SECRET "…" --secret

Verify

Open the chat UI.
Send 5 questions covering each tool path:
- How many departments does Embr Corp have? (expects lookup_fact)
- What is 17 * 23 + 7? (expects do_math)
- What's Embr Corp's mission? (expects lookup_fact)
- What's the capital of France? (no tool — judge should mark this as groundedness 5 if answered, but flag fabrication if hallucinated about Embr Corp)
- Roll a d20. (no matching tool — worker should refuse / note the limitation)
Confirm each assistant turn shows three badges.
Click a badge row to expand the judge rationale.

Download /api/evals.csv and confirm a row like:

id,ts,user_input,worker_output,relevance,groundedness,safety,rationale,latency_worker_ms,latency_judge_ms,run_id_worker,run_id_judge
8d3f…,2025-…Z,How many departments does Embr Corp have?,Embr Corp has 7 departments…,5,5,pass,Directly answers using a verified fact.,2148,1607,run_AAA…,run_BBB…

On Embr, query Kusto:
```
GlobalControlPlaneEvents
| where AksPodName startswith "embr-foundry-eval-loop-sample"
| where Message has "gen_ai.evaluation.score"
| take 20
```
Today these events appear only as raw strings in the Message column — exactly the platform finding this sample is designed to surface.

Imagine if Embr did this

# embr.yaml — hypothetical
platform: python
platformVersion: "3.12"

eval:
  judge:
    provider: foundry-agent
    agentId: judge-v1
    model: gpt-4o-mini
  metrics: [relevance, groundedness, safety]
  store: managed         # Embr-managed Cosmos-backed eval log
  emitOtel: true         # Auto-promote gen_ai.* stdout to first-class telemetry
  alerts:
    - metric: safety
      condition: passRate < 0.95
      window: 1h

With this, the worker code stays exactly the same — Embr handles judge orchestration, eval persistence, OTel promotion, and alerting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Embr × Foundry — Eval Loop Sample

What this demonstrates

Embr platform findings (the whole point)

1. No first-class eval primitive

2. No OTel GenAI parsing in `GlobalControlPlaneEvents`

3. No eval-store primitive

4. No "judge model" concept in agent identity

5. No alert primitive

Architecture

Endpoints

Quickstart (local)

Deploy to Embr

Verify

Imagine if Embr did this

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
.gitignore		.gitignore
README.md		README.md
embr.yaml		embr.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Embr × Foundry — Eval Loop Sample

What this demonstrates

Embr platform findings (the whole point)

1. No first-class eval primitive

2. No OTel GenAI parsing in GlobalControlPlaneEvents

3. No eval-store primitive

4. No "judge model" concept in agent identity

5. No alert primitive

Architecture

Endpoints

Quickstart (local)

Deploy to Embr

Verify

Imagine if Embr did this

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. No OTel GenAI parsing in `GlobalControlPlaneEvents`

Packages