Evaluations Workshop — from Fundamentals to Production

A hands-on, progressive introduction to evaluating LLM applications, built around a single example app: an HR Onboarding agent. You learn the app once, then spend the rest of the workshop on evaluation technique — from a first single-turn eval all the way to monitoring evals in production.

Stack: Python + LangChain (create_agent) + LangSmith.

The workshop runs in two sessions:

Session 1 — Foundations (Modules 1–4): what an eval is, single-turn and agent evals, and gating a PR with an offline suite.
Session 2 — Evals for Production (Modules 5–7): online evals on live traces, improving evaluators with annotation + few-shot, and monitoring production for quality drift.

Who this is for

Development teams meeting evaluations for the first time. We start with concepts and build up. No prior eval experience assumed; basic Python and comfort with the command line is enough.

The arc

SESSION 1 — Foundations                                    SESSION 2 — Evals for Production
Module 1          Module 2            Module 3          Module 4   │  Module 5            Module 6              Module 7
Fundamentals  ->  Single-turn evals -> Agent evals  ->  Evals in   │  Online evals    ->  Improving evals   ->  Production
(the 4 parts)     (judge the answer)  (the process)     CI (gate)  │  (live traces)       (annotate+few-shot)    monitor (drift)

Session 1 — Foundations

Module	You learn to…	Evaluators introduced
1 — Fundamentals	Name the 4 parts of any eval; run one end-to-end.	first deterministic check
2 — Single-turn	Judge a single answer.	deterministic (facts, shape validation) + LLM-judge (correctness, groundedness, tone)
3 — Agent evals	Judge the trajectory and tool calls, not just the answer.	trajectory (exact / required / forbidden / efficiency), tool-args, LLM trajectory judge
4 — CI	Gate a build on eval results.	pytest per-example gate + aggregate threshold gate + GitHub Actions

Session 2 — Evals for Production

Module	You learn to…	What's new
5 — Online evals	Tell offline experiments from online evals; score live traces with no ground truth.	reference-free evaluators, scoring traces + writing feedback, the data flywheel
6 — Improving evals	Align an LLM judge to your humans.	annotation queues, "evaluate the evaluator", few-shot judge alignment
7 — Production CI	Monitor production for quality drift.	scheduled monitor, baseline/drift alerting, scheduled GitHub Actions

The same HR agent (hr_agent/) is the system-under-test in every module.

The example app — HR Onboarding agent

Lives in hr_agent/. A tool-calling agent that:

answers HR policy/benefits questions (single-turn material), and
performs onboarding actions — provisioning accounts, ordering equipment, scheduling orientation (multi-step trajectory material).

Tools are deterministic (they read static mock data), so the same input always produces the same tool output. That reproducibility is what makes evaluation meaningful — you measure the model's behavior, not flaky downstream systems.

Setup

# 1. Python 3.10+ and a virtualenv
python -m venv .venv && source .venv/bin/activate

# 2. Install
pip install -r requirements.txt

# 3. Configure keys
cp .env.example .env      # then edit: LANGSMITH_API_KEY + ANTHROPIC_API_KEY

You need:

a LangSmith API key (smith.langchain.com → Settings → API Keys), and
an Anthropic API key (or set WORKSHOP_MODEL=openai:gpt-4o-mini and an OPENAI_API_KEY — see config.py / .env.example).

Smoke test (no keys needed)

The deterministic evaluators self-test as pure functions:

python module_2_single_turn/deterministic_evals.py
python module_3_agent_evals/trajectory_evals.py
python module_3_agent_evals/tool_evals.py
python module_5_online_evals/reference_free_evals.py

Run a module (keys needed)

# Session 1 — offline experiments
python module_1_fundamentals/01_first_eval.py
python module_2_single_turn/run_eval.py
python module_3_agent_evals/run_eval.py
python module_4_ci/ci_gate.py --suite agent

# Session 2 — production evals
python module_5_online_evals/production_traffic.py   # create live traces
python module_5_online_evals/score_traces.py         # online-eval loop
python module_6_improving_evals/judge_alignment.py   # zero-shot vs few-shot
python module_7_production_ci/monitor.py             # drift vs baseline

Each prints a link/name to open the experiment (or project) in LangSmith.

Repo map

hr_agent/                  # the system under test (agent + tools + mock data)
# --- Session 1: Foundations ---
module_1_fundamentals/     # concepts + first eval
module_2_single_turn/      # deterministic + LLM-judge evaluators
module_3_agent_evals/      # trajectory + tool evaluators
module_4_ci/               # pytest gate, aggregate gate, GitHub Actions
# --- Session 2: Evals for Production ---
module_5_online_evals/     # reference-free evals + scoring live traces
module_6_improving_evals/  # annotation queues + few-shot judge alignment
module_7_production_ci/    # scheduled drift monitor + baseline
.github/workflows/evals.yml         # offline gate (PR)
.github/workflows/online-evals.yml  # online monitor (scheduled)
config.py                  # model + LangSmith config (one place to swap providers)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluations Workshop — from Fundamentals to Production

Who this is for

The arc

Session 1 — Foundations

Session 2 — Evals for Production

The example app — HR Onboarding agent

Setup

Smoke test (no keys needed)

Run a module (keys needed)

Repo map

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
hr_agent		hr_agent
module_1_fundamentals		module_1_fundamentals
module_2_single_turn		module_2_single_turn
module_3_agent_evals		module_3_agent_evals
module_4_ci		module_4_ci
module_5_online_evals		module_5_online_evals
module_6_improving_evals		module_6_improving_evals
module_7_production_ci		module_7_production_ci
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
reset_demo.py		reset_demo.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Evaluations Workshop — from Fundamentals to Production

Who this is for

The arc

Session 1 — Foundations

Session 2 — Evals for Production

The example app — HR Onboarding agent

Setup

Smoke test (no keys needed)

Run a module (keys needed)

Repo map

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages