Skip to content

embr-devs/embr-foundry-eval-loop-sample

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Embr × Foundry — Eval Loop Sample

Sample #7 in the Embr POC fleet. Demonstrates an agent + LLM-as-a-judge loop running entirely in user code on Embr — every user turn is scored by a second Foundry agent on relevance, groundedness, and safety, then surfaced via the chat UI, a JSON API, a CSV export, and OpenTelemetry GenAI–shaped JSON-Lines on stdout.

What this demonstrates

  • Worker agent — a normal Foundry agent with lookup_fact and do_math function tools (Embr Corp facts + safe arithmetic).
  • Judge agent — a separate Foundry agent (no tools) prompted to score the worker's reply on three axes and return strict JSON.
  • Eval store — bounded deque(maxlen=500) in process memory.
  • OTel-shaped logs — JSON-Lines on stdout matching the OpenTelemetry GenAI semantic conventions for both inference and evaluation events.
  • Chat UI — per-turn badges (🎯 R:4 📚 G:5 ✅ Safe), expandable rationale, rolling-20 averages at the top.

Embr platform findings (the whole point)

This sample exists to show what a customer must build themselves today to get production-quality LLM evals on Embr. Each item below is a missing primitive.

1. No first-class eval primitive

Every customer rolls their own worker→judge→store→export pipeline. embr.yaml should accept something like:

eval:
  judge:
    provider: foundry-agent
    agentId: judge-v1
  metrics: [relevance, groundedness, safety]
  emitOtel: true

…and Embr would orchestrate the judge call automatically, expose a built-in eval dashboard in the portal, and emit OTel telemetry without app changes.

2. No OTel GenAI parsing in GlobalControlPlaneEvents

This sample writes structured JSON lines like:

{"event":"gen_ai.evaluation.score","ts":"2025-…","gen_ai.evaluation.metric":"relevance","gen_ai.evaluation.score":4,"trace.id":""}

Kusto receives them in the Message column as opaque strings. An Embr-shipped OTel GenAI auto-detector / Kusto update policy could surface them as first-class columns and power dashboards/alerts on gen_ai.evaluation.score.

3. No eval-store primitive

app/store.py is a deque(maxlen=500) that wipes on restart. There is no embr.evalStore (Cosmos-backed, scoped per-environment) primitive. A real customer must bring their own Cosmos / Postgres / log ingestion.

4. No "judge model" concept in agent identity

Both the worker and judge use the same ClientSecretCredential. With managed identity (a separate platform ask) and Embr-issued per-agent identities, a customer could grant the judge narrower permissions — e.g. read-only on the project, no thread-creation rights — than the worker. Today, no such scoping.

5. No alert primitive

If safetyPassRate < 0.95 over the last hour, the customer must wire their own alerting via Application Insights metric alerts, Kusto alert rules, or a 3rd party. A native embr.yaml: alerts: block would close this gap.

Architecture

POST /api/chat { message, threadId? }
        │
        ▼
   ┌──────────┐    tools: lookup_fact, do_math
   │  worker  │ ── runs in app, resolves tool calls locally
   └────┬─────┘
        │ assistant_text
        ▼
   ┌──────────┐    fresh thread per turn
   │  judge   │ ── { relevance, groundedness, safety, rationale }
   └────┬─────┘
        │
        ├─► store.add(record)       (in-memory deque, restart-wipes)
        ├─► emit_inference(...)     (gen_ai.client.inference.operation.details)
        ├─► emit_eval_score(...)    (gen_ai.evaluation.score × 3)
        └─► return { response, eval, threadId, runIds }

Endpoints

Method Path Description
GET / Chat UI
GET /health {status:"ok"}
POST /api/chat {message, threadId?} → run worker + judge, return verdict
GET /api/evals?limit=50 Recent eval records (newest first) + stats
GET /api/evals/stats Rolling averages
GET /api/evals.csv Full eval log as CSV (download)
GET /api/diag Config / agents-ready / OTel emit count

Quickstart (local)

cd embr-foundry-eval-loop-sample
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export FOUNDRY_PROJECT_ENDPOINT="https://<name>.services.ai.azure.com/api/projects/<project>"
export FOUNDRY_MODEL_DEPLOYMENT="gpt-4o-mini"   # or any deployment in your project
# Either run `az login`, or set:
#   AZURE_TENANT_ID, AZURE_CLIENT_ID, AZURE_CLIENT_SECRET
uvicorn app.main:app --reload --port 8000

Then open http://localhost:8000.

EMBR_TOKEN is not needed — this is a Foundry-only sample.

Optional: set FOUNDRY_JUDGE_MODEL_DEPLOYMENT to a different deployment than the worker's, to demonstrate cross-model evaluation.

Deploy to Embr

embr quickstart deploy embr-devs/embr-foundry-eval-loop-sample -i 120233234
embr variables set FOUNDRY_PROJECT_ENDPOINT "https://…/api/projects/…"
embr variables set FOUNDRY_MODEL_DEPLOYMENT "gpt-4o-mini"
embr variables set AZURE_TENANT_ID ""
embr variables set AZURE_CLIENT_ID ""
embr variables set AZURE_CLIENT_SECRET "" --secret

Verify

  1. Open the chat UI.

  2. Send 5 questions covering each tool path:

    • How many departments does Embr Corp have? (expects lookup_fact)
    • What is 17 * 23 + 7? (expects do_math)
    • What's Embr Corp's mission? (expects lookup_fact)
    • What's the capital of France? (no tool — judge should mark this as groundedness 5 if answered, but flag fabrication if hallucinated about Embr Corp)
    • Roll a d20. (no matching tool — worker should refuse / note the limitation)
  3. Confirm each assistant turn shows three badges.

  4. Click a badge row to expand the judge rationale.

  5. Download /api/evals.csv and confirm a row like:

    id,ts,user_input,worker_output,relevance,groundedness,safety,rationale,latency_worker_ms,latency_judge_ms,run_id_worker,run_id_judge
    8d3f…,2025-…Z,How many departments does Embr Corp have?,Embr Corp has 7 departments…,5,5,pass,Directly answers using a verified fact.,2148,1607,run_AAA…,run_BBB…
  6. On Embr, query Kusto:

    GlobalControlPlaneEvents
    | where AksPodName startswith "embr-foundry-eval-loop-sample"
    | where Message has "gen_ai.evaluation.score"
    | take 20

    Today these events appear only as raw strings in the Message column — exactly the platform finding this sample is designed to surface.

Imagine if Embr did this

# embr.yaml — hypothetical
platform: python
platformVersion: "3.12"

eval:
  judge:
    provider: foundry-agent
    agentId: judge-v1
    model: gpt-4o-mini
  metrics: [relevance, groundedness, safety]
  store: managed         # Embr-managed Cosmos-backed eval log
  emitOtel: true         # Auto-promote gen_ai.* stdout to first-class telemetry
  alerts:
    - metric: safety
      condition: passRate < 0.95
      window: 1h

With this, the worker code stays exactly the same — Embr handles judge orchestration, eval persistence, OTel promotion, and alerting.

About

Embr POC #7: agent + LLM-as-judge eval loop with OTel-shaped GenAI traces

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors