Eval-discipline at solo-founder scale. Agentic AI systems calibrated by LLM-as-judge,
gated by Kendall-tau, verified across vendors — because shipping without evals is shipping bluff.
| Repo | What it is | First file to open |
|---|---|---|
| ai-eval-toolkit | LLM-as-judge calibration: Cohen's kappa, Kendall-tau, regression gates | eval_tk/calibrate.py |
| agentic-eval-harness | Eval-gated Claude Code runner with cross-vendor decision-support gates | src/aeh/eval_gate/judging.py |
| ai-eval-atlas | A self-taught engineer's structured map of the AI evaluation field | techniques/kendall-tau-agreement.md |
| ai-engineer-best-practices | Field-tested patterns for shipping LLM systems — prompts, evals, agents, observability | docs/eval-rubric.md |
| learn-ai-eval | The Eval Codex — Claude-tutored AI-eval learning engine | README |
The hero line above was selected by a 3-vendor LLM-as-judge ensemble (Anthropic / OpenAI / Google) before this README shipped. Three candidates scored on the criterion:
Which hero line is most likely to make a hiring manager for an AI LLM/Agents Eval Expert role say "this person should get a callback" within a 5-second skim?
WINNER: Candidate B CONFIDENCE: HIGH Margin +1.00 over runner-up Min tau +0.50 Dissent: none
| Cand | claude | gpt | gemini | mean |
|------|--------|-----|--------|------|
| A | 7 | 8 | 7 | 7.33 |
| B * | 8 | 8 | 9 | 8.33 |
| C | 7 | 7 | 5 | 6.33 |
Run on claude-opus-4-7 · gpt-5.5 · gemini-3.1-pro-preview, temperature=0, per-judge position-bias shuffle, content-hash seed for reproducibility. Methodology lives in ai-engineer-best-practices (the portable cross-vendor-judges variant of the score MCP tool).
📄 Full receipt (prompts, raw responses, shuffle mappings, seeds, parsed scores): verdicts/2026-05-22-hero-line.json — audit it yourself.
If you'd ship a hero line without judging it, we should probably talk about evals.
Specific files worth opening — actual code, not vaporware:
- Eval rubric (production format) →
ai-engineer-best-practices/docs/eval-rubric.md - Recorded judge responses (fixture) →
ai-engineer-best-practices/tests/fixtures/recorded_judge_responses.json - Kappa + bias module →
ai-eval-toolkit/eval_tk/bias.py - Judging parity test →
agentic-eval-harness/tests/test_judging_parity.py - Atlas: LLM-as-judge basics →
ai-eval-atlas/techniques/llm-as-judge-basics.md
Claude Agent SDK · MCP · Anthropic SDK · Cohen's kappa · Kendall-tau · MT-Bench · LLM-as-judge · Python · TypeScript
- LinkedIn → https://www.linkedin.com/in/mikeilog
- Website → https://mikeilog.com
- Email → https://mikeilog.com
cooperation FTW · located on Earth




