Mike Ilog Mike-E-Log

Mike Ilog · AI LLM/Agents Eval Engineer

Eval-discipline at solo-founder scale. Agentic AI systems calibrated by LLM-as-judge,
gated by Kendall-tau, verified across vendors — because shipping without evals is shipping bluff.

Now (public, eval-coded)

Repo	What it is	First file to open
ai-eval-toolkit	LLM-as-judge calibration: Cohen's kappa, Kendall-tau, regression gates	`eval_tk/calibrate.py`
agentic-eval-harness	Eval-gated Claude Code runner with cross-vendor decision-support gates	`src/aeh/eval_gate/judging.py`
ai-eval-atlas	A self-taught engineer's structured map of the AI evaluation field	`techniques/kendall-tau-agreement.md`
ai-engineer-best-practices	Field-tested patterns for shipping LLM systems — prompts, evals, agents, observability	`docs/eval-rubric.md`
learn-ai-eval	The Eval Codex — Claude-tutored AI-eval learning engine	README

Cross-vendor verdict on this README

The hero line above was selected by a 3-vendor LLM-as-judge ensemble (Anthropic / OpenAI / Google) before this README shipped. Three candidates scored on the criterion:

Which hero line is most likely to make a hiring manager for an AI LLM/Agents Eval Expert role say "this person should get a callback" within a 5-second skim?

WINNER: Candidate B   CONFIDENCE: HIGH   Margin +1.00 over runner-up   Min tau +0.50   Dissent: none

| Cand | claude | gpt | gemini | mean |
|------|--------|-----|--------|------|
| A    |      7 |   8 |      7 | 7.33 |
| B *  |      8 |   8 |      9 | 8.33 |
| C    |      7 |   7 |      5 | 6.33 |

Run on claude-opus-4-7 · gpt-5.5 · gemini-3.1-pro-preview, temperature=0, per-judge position-bias shuffle, content-hash seed for reproducibility. Methodology lives in ai-engineer-best-practices (the portable cross-vendor-judges variant of the score MCP tool).

📄 Full receipt (prompts, raw responses, shuffle mappings, seeds, parsed scores): verdicts/2026-05-22-hero-line.json — audit it yourself.

If you'd ship a hero line without judging it, we should probably talk about evals.

Receipts

Specific files worth opening — actual code, not vaporware:

Eval rubric (production format) → ai-engineer-best-practices/docs/eval-rubric.md
Recorded judge responses (fixture) → ai-engineer-best-practices/tests/fixtures/recorded_judge_responses.json
Kappa + bias module → ai-eval-toolkit/eval_tk/bias.py
Judging parity test → agentic-eval-harness/tests/test_judging_parity.py
Atlas: LLM-as-judge basics → ai-eval-atlas/techniques/llm-as-judge-basics.md

Stack

Claude Agent SDK · MCP · Anthropic SDK · Cohen's kappa · Kendall-tau · MT-Bench · LLM-as-judge · Python · TypeScript

Contact

LinkedIn → https://www.linkedin.com/in/mikeilog
Website → https://mikeilog.com
Email → https://mikeilog.com

_{cooperation FTW · located on Earth}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mike Ilog Mike-E-Log

Achievements