Skip to content
View Mike-E-Log's full-sized avatar

Block or report Mike-E-Log

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Mike-E-Log/README.md

eval-toolkit harness atlas cross-vendor PII

Mike Ilog · AI LLM/Agents Eval Engineer

Eval-discipline at solo-founder scale. Agentic AI systems calibrated by LLM-as-judge,
gated by Kendall-tau, verified across vendors — because shipping without evals is shipping bluff.


Now (public, eval-coded)

Repo What it is First file to open
ai-eval-toolkit LLM-as-judge calibration: Cohen's kappa, Kendall-tau, regression gates eval_tk/calibrate.py
agentic-eval-harness Eval-gated Claude Code runner with cross-vendor decision-support gates src/aeh/eval_gate/judging.py
ai-eval-atlas A self-taught engineer's structured map of the AI evaluation field techniques/kendall-tau-agreement.md
ai-engineer-best-practices Field-tested patterns for shipping LLM systems — prompts, evals, agents, observability docs/eval-rubric.md
learn-ai-eval The Eval Codex — Claude-tutored AI-eval learning engine README

Cross-vendor verdict on this README

The hero line above was selected by a 3-vendor LLM-as-judge ensemble (Anthropic / OpenAI / Google) before this README shipped. Three candidates scored on the criterion:

Which hero line is most likely to make a hiring manager for an AI LLM/Agents Eval Expert role say "this person should get a callback" within a 5-second skim?

WINNER: Candidate B   CONFIDENCE: HIGH   Margin +1.00 over runner-up   Min tau +0.50   Dissent: none

| Cand | claude | gpt | gemini | mean |
|------|--------|-----|--------|------|
| A    |      7 |   8 |      7 | 7.33 |
| B *  |      8 |   8 |      9 | 8.33 |
| C    |      7 |   7 |      5 | 6.33 |

Run on claude-opus-4-7 · gpt-5.5 · gemini-3.1-pro-preview, temperature=0, per-judge position-bias shuffle, content-hash seed for reproducibility. Methodology lives in ai-engineer-best-practices (the portable cross-vendor-judges variant of the score MCP tool).

📄 Full receipt (prompts, raw responses, shuffle mappings, seeds, parsed scores): verdicts/2026-05-22-hero-line.json — audit it yourself.

If you'd ship a hero line without judging it, we should probably talk about evals.


Receipts

Specific files worth opening — actual code, not vaporware:


Stack

Claude Agent SDK · MCP · Anthropic SDK · Cohen's kappa · Kendall-tau · MT-Bench · LLM-as-judge · Python · TypeScript


Contact

cooperation FTW · located on Earth

Pinned Loading

  1. ai-eval-toolkit ai-eval-toolkit Public

    Eval toolkit for LLM-as-judge calibration — Cohen's kappa, Kendall-tau, regression gates.

    Python

  2. agentic-eval-harness agentic-eval-harness Public

    Eval-gated runner driving Claude Code through phases with cross-vendor decision-support gates.

    Python 1

  3. ai-eval-atlas ai-eval-atlas Public

    A self-taught engineer's structured map of the AI evaluation field.

    1

  4. ai-engineer-best-practices ai-engineer-best-practices Public

    Field-tested patterns for shipping LLM systems — prompts, evals, agents, observability.

    Python

  5. learn-ai-eval learn-ai-eval Public

    The Eval Codex — Claude-tutored AI-eval learning engine. Build eval expertise via guided practice.

    HTML

  6. ai-capability-atlas ai-capability-atlas Public

    An honest, curated map of what AI can actually do — the tool to reach for per capability (open or closed), plus a Goal → Stack Cookbook of effectiveness-graded stacks.

    Python 1