CheckLLM: Reproducible Agent-Trajectory Evaluation

A deterministic, judge-free metric for scoring agent tool-call trajectories -- with AUROC 0.93 against synthetic ground truth and ~1500x faster than DeepEval's ToolCorrectnessMetric.

pip install checkllm

from checkllm.metrics.trajectory_metric import TrajectoryMetric

# Expected plan vs. what the agent actually did
expected = ["search", "fetch", "parse", "respond"]
actual = ["search", "fetch", "parse", "fetch", "respond"]

metric = TrajectoryMetric(expected_trajectory=expected)
sub = metric.compute_subscores(actual)

print(f"ordering   {sub.ordering:.2f}")   # 0.80
print(f"loops      {sub.loops:.2f}")      # 1.00
print(f"coverage   {sub.coverage:.2f}")   # 1.00
print(f"unexpected {sub.unexpected:.2f}") # 1.00
print(f"overall    {sub.overall:.2f}")    # 0.92

No judge LLM. No API key. Bit-identical scores across runs. See the 10-minute tutorial for the full walkthrough.

Why CheckLLM?

Deterministic -- no judge LLM, no API cost, bit-identical scores across runs and machines.
Composite -- 4-axis trajectory scoring (ordering, loops, coverage, unexpected), AUROC 0.93 [0.91, 0.94] on 500 trajectories.
OTel-compatible -- ingest traces from any agent framework via OpenTelemetry GenAI semantic conventions.

Beyond trajectory evaluation, CheckLLM also ships the broader testing suite the project has always provided:

Zero learning curve -- if you know pytest, you know checkllm. Just add a check parameter.
39 free deterministic checks run instantly with zero API calls. No API key needed to start.
72 LLM-as-judge metrics -- hallucination, faithfulness, trajectory, per-turn, dual-judge, and more.
151 red team vulnerability types with 25 attack strategies -- the most comprehensive adversarial testing suite available.
17 compliance frameworks -- OWASP LLM/API/Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, HIPAA, GDPR, and more.
Same checks everywhere -- use them in tests, CI, and production guardrails.

Quickstart

Install

pip install checkllm
checkllm init --use-case rag  # generates a tailored test file

1. Deterministic checks (free, instant)

def test_basic_quality(check):
    output = my_llm("Summarize this article.")

    check.contains(output, "key finding")
    check.max_tokens(output, limit=200)
    check.no_pii(output)
    check.is_json(output)
    check.gleu(output, reference="Expected summary text.", threshold=0.5)
    check.chrf(output, reference="Expected summary text.", threshold=0.4)
    check.latency_check(start_time, end_time, max_ms=3000)
    check.cost_check(input_tokens=500, output_tokens=200, model="gpt-4o", max_cost=0.05)

2. LLM-as-judge (deeper evaluation)

def test_rag_quality(check):
    output = my_rag("What causes climate change?")
    context = retrieve_context("climate change")

    check.hallucination(output, context=context)
    check.faithfulness(output, context=context)
    check.relevance(output, query="What causes climate change?")
    check.toxicity(output)

3. Fluent chaining

def test_with_chaining(check):
    output = my_llm("Explain quantum physics simply.")

    check.that(output) \
        .contains("quantum") \
        .max_tokens(200) \
        .has_no_pii() \
        .scores_above("relevance", 0.8, query="quantum physics")

4. Production guardrails

from checkllm import Guard, CheckSpec

guard = Guard(checks=[
    CheckSpec(check_type="no_pii"),
    CheckSpec(check_type="max_tokens", params={"limit": 500}),
    CheckSpec(check_type="toxicity"),
])

result = guard.validate(llm_output)
if not result.valid:
    result.raise_on_failure()

5. YAML-based evaluation

# checkllm.yaml
description: "Customer support chatbot evaluation"
judge:
  backend: openai
  model: gpt-4o

prompts:
  - "You are a helpful support agent. Answer: {{query}}"

tests:
  - vars:
      query: "How do I return an item?"
    assert:
      - type: contains
        value: "return policy"
      - type: relevance
        threshold: 0.8
      - type: no_pii
      - type: max_tokens
        value: 500

settings:
  budget: 5.0

checkllm eval-yaml checkllm.yaml

How checkllm compares

Independent benchmark, not just feature counts. On the public competitor leaderboard (docs/benchmarks/competitor-comparison.md) checkllm holds rank 1 on every published row against DeepEval and promptfoo: halubench/hallucination 0.783, ragtruth/hallucination 0.663, ragtruth/faithfulness 0.754, ragtruth/context_relevance 0.565, and truthfulqa/answer_relevancy 0.546 (ROC-AUC, gpt-4o-mini judge, 200 source rows per slice). Methodology is in docs/benchmarks/methodology.md; raw scores ship in benchmarks/competitor_comparison/.

Feature comparison

Feature	checkllm	DeepEval	Ragas	promptfoo
pytest native	Yes	Wrapper	No	No
Free deterministic checks	39	Limited	Limited	Yes
LLM-as-judge metrics	72	~50	~40	Custom
Red team vulnerability types	151	40+	0	100+
Attack strategies	25	10+	0	30+
Compliance frameworks	17	3	0	10+
Multi-provider judges	15+ backends	13+	~6	50+
Consensus judging	7 strategies	No	Dual-judge	No
Production guardrails	Built-in	No	No	API
Cost control & budgets	Built-in	No	No	Caching
Knowledge Graph synthesis	Full pipeline	No	Yes	No
Multilingual prompts	20 languages	No	Yes	No
Prompt optimization	4 algorithms	4	2	No
YAML config evaluation	Yes	No	No	Yes
Streaming evaluation	Token-by-token	No	No	No
Regression detection	Statistical (p-values)	No	No	No
DPO export	Yes	No	No	No
Telemetry / phoning home	None	PostHog + Sentry	None	Telemetry
Independence	Fully independent	YC-backed	YC-backed	OpenAI-owned

All metrics by category

RAG Evaluation (14 metrics)

hallucination faithfulness faithfulness_hhem context_relevance context_entity_recall contextual_precision contextual_recall answer_completeness groundedness nonllm_context_precision nonllm_context_recall quoted_spans_alignment nv_context_relevance nv_response_groundedness

General Quality (12 metrics)

relevance coherence fluency consistency correctness factual_correctness sentiment toxicity bias summarization nv_answer_accuracy prompt_alignment

Completeness & Instruction Following (5 metrics)

response_completeness instruction_following instruction_completeness conversation_completeness topic_adherence

Agent & Tool Evaluation (12 metrics)

task_completion tool_accuracy tool_call_f1 plan_adherence plan_quality step_efficiency knowledge_retention goal_accuracy trajectory_goal_success trajectory_tool_sequence trajectory_step_count trajectory_tool_args_match

Per-Turn Conversation (3 metrics)

turn_relevancy turn_faithfulness turn_coherence

Multimodal (6 metrics)

image_relevance image_helpfulness image_coherence text_to_image image_editing image_reference

Structured Output (4 metrics)

code_correctness sql_equivalence comparative_quality datacompy_score

Role & Safety (3 metrics)

role_adherence role_violation non_advice

MCP & Tool-Specific (3 metrics)

mcp_use mcp_task_completion multi_turn_mcp_use

Specialized (3 metrics)

g_eval noise_sensitivity rubric

Deterministic Checks (39, zero API cost)

contains not_contains starts_with ends_with regex exact_match exact_match_strict min_tokens max_tokens min_words max_words min_chars max_chars min_sentences max_sentences is_json json_schema is_xml is_yaml is_html no_pii language readability similarity bleu rouge_l meteor gleu chrf latency_check cost_check string_distance perplexity is_valid_python is_url has_url word_count char_count sentence_count

Red teaming & adversarial testing

from checkllm.redteam import RedTeamer, VulnerabilityType
from checkllm.redteam_strategies import StrategyType

red = RedTeamer()
report = await red.scan(
    target=my_llm_function,
    vulnerability_types=[
        VulnerabilityType.PROMPT_INJECTION,
        VulnerabilityType.JAILBREAK,
        VulnerabilityType.PII_LEAKAGE,
        VulnerabilityType.DATA_EXFILTRATION,
    ],
    strategies=[StrategyType.BASE64, StrategyType.CRESCENDO, StrategyType.PERSONA],
    attacks_per_type=5,
)
print(report.summary())
print(report.risk_summary())  # CVSS severity breakdown

151 vulnerability types across 12 categories: prompt injection, jailbreak, PII leakage, harmful content, encoding attacks, privilege escalation, agentic AI attacks, brand & reputation, industry compliance, and more.

25 attack strategies: BASE64, ROT13, HEX, LEETSPEAK, MORSE, HOMOGLYPH, CRESCENDO (multi-turn escalation), JAILBREAK_TREE, JAILBREAK_META, JAILBREAK_COMPOSITE, BEST_OF_N, PERSONA, HYPOTHETICAL, ROLEPLAY, LAYER (composable chaining), and more.

Coding agent security

from checkllm.redteam_coding_agents import CodingAgentScanner

scanner = CodingAgentScanner(judge=judge)
report = await scanner.scan(target=my_coding_agent)
# Tests: repo prompt injection, sandbox escape, secret leakage, verifier sabotage

Compliance frameworks

from checkllm.compliance_frameworks import ComplianceScanner, ComplianceFramework

scanner = ComplianceScanner(judge=judge)
report = await scanner.scan(
    target=my_llm,
    frameworks=[
        ComplianceFramework.OWASP_LLM_TOP10,
        ComplianceFramework.OWASP_AGENTIC_TOP10,
        ComplianceFramework.EU_AI_ACT,
        ComplianceFramework.HIPAA,
    ],
)
print(report.summary())

17 frameworks: OWASP LLM Top 10, OWASP API Top 10, OWASP Agentic Top 10, MITRE ATLAS, EU AI Act, ISO 42001, NIST AI RMF, NIST CSF, HIPAA, GDPR, PCI-DSS, SOC2, ISO 27001, COPPA, FERPA, CCPA, DoD AI Ethics.

Knowledge Graph test generation

from checkllm.knowledge_graph import KGTestGenerator, EntityExtractor, SimilarityBuilder

gen = KGTestGenerator(judge=judge)
samples = await gen.generate(
    documents=["doc1 text...", "doc2 text..."],
    num_samples=50,
    synthesizers={"single_hop": 0.4, "multi_hop_abstract": 0.3, "multi_hop_specific": 0.3},
    personas=5,
)
cases = gen.to_cases(samples)

Build a knowledge graph from your documents, then generate diverse test cases with single-hop, multi-hop abstract, and multi-hop specific queries. Supports persona variation, query styles (web search, misspelled, conversational), and configurable complexity.

Multilingual evaluation

from checkllm.multilingual import PromptAdapter, detect_language

adapter = PromptAdapter(judge=judge)
translated = await adapter.adapt(template=my_prompt, target_language="ja")
adapter.save_translations("translations/ja.json")

lang = detect_language("Esto es un texto en espanol.")  # "es"

Supports 20+ languages with automatic prompt adaptation. Language detection uses Unicode character-range analysis with LLM fallback.

Prompt optimization

from checkllm.optimize import create_optimizer

optimizer = create_optimizer("miprov2", judge=judge)  # or "genetic", "copro", "simba"
result = await optimizer.optimize(
    prompt="Summarize this document.",
    test_cases=my_test_cases,
    metric_fn=my_metric,
    num_candidates=10,
)
print(f"Improved from {result.initial_score:.2f} to {result.best_score:.2f}")

Four optimization algorithms: Genetic (evolutionary), MIPROv2 (instruction + demonstration), COPRO (failure-driven iterative), SIMBA (similarity-based adaptation).

Multi-provider judges

from checkllm import create_judge

judge = create_judge("openai", model="gpt-4o")
judge = create_judge("anthropic", model="claude-sonnet-4-6")
judge = create_judge("gemini", model="gemini-2.0-flash")
judge = create_judge("ollama", model="llama3.1")       # Free, local
judge = create_judge("litellm", model="any-model")     # 100+ models
judge = create_judge("deepseek")
judge = create_judge("groq")
judge = create_judge("fireworks")

Auto-detection: set OPENAI_API_KEY, ANTHROPIC_API_KEY, or have Ollama running -- checkllm picks the best judge automatically.

Consensus judging

from checkllm import ConsensusJudge

judges = [("gpt4", gpt4_judge), ("claude", claude_judge), ("gemini", gemini_judge)]
consensus = ConsensusJudge(judges, strategy="majority")  # or mean, median, unanimous, min, max, weighted

Cost control

checkllm estimate tests/              # See costs before running
checkllm run tests/ --budget 5.0      # Cap spend at $5
checkllm run tests/ --dry-run         # Estimate without executing

Configuration

# pyproject.toml
[tool.checkllm]
judge_backend = "auto"
judge_model = "gpt-4o"
default_threshold = 0.8
budget = 10.0
cache_enabled = true
engine = "auto"

CLI

Command	Description
`checkllm init`	Scaffold a project (`--use-case`, `--ci`)
`checkllm run`	Run tests (`--budget`, `--dry-run`, `--snapshot`)
`checkllm eval-yaml`	Run YAML-based evaluation
`checkllm estimate`	Estimate costs before running
`checkllm watch`	Re-run on file changes
`checkllm report`	Generate HTML report
`checkllm snapshot`	Save baseline for regression detection
`checkllm diff`	Compare snapshots
`checkllm history`	View run history and trends
`checkllm list-metrics`	Show all available checks and metrics
`checkllm cache`	Manage judge response cache
`checkllm dashboard`	Launch web dashboard

Framework integrations

# LangChain
from checkllm.integrations.langchain import CheckllmCallbackHandler
chain.invoke(input, config={"callbacks": [CheckllmCallbackHandler(checks=["no_pii"])]})

# CrewAI
from checkllm.integrations.crewai import CheckllmCrewCallback

# OpenAI Agents SDK
from checkllm.integrations.openai_agents import CheckllmRunHandler

# Claude Agent SDK
from checkllm.integrations.claude_agents import CheckllmAgentHandler

# PydanticAI
from checkllm.integrations.pydantic_ai import CheckllmResultValidator

# LlamaIndex
from checkllm.integrations.llama_index import CheckllmCallbackHandler

Custom metrics

from checkllm import metric, CheckResult

@metric("brevity")
def brevity_check(output: str, max_words: int = 50, **kwargs) -> CheckResult:
    words = len(output.split())
    return CheckResult(
        passed=words <= max_words,
        score=min(1.0, max_words / max(words, 1)),
        reasoning=f"{words} words (limit: {max_words})",
        cost=0.0, latency_ms=0, metric_name="brevity",
    )

Citing CheckLLM

If you use CheckLLM's trajectory metric in academic work, please cite the companion paper:

@article{dejesus2026checkllm,
  title        = {{CheckLLM}: Reproducible Agent-Trajectory Evaluation at Scale},
  author       = {de Jesus, Javier},
  journal      = {arXiv preprint arXiv:XXXX.YYYYY},
  year         = {2026},
  doi          = {10.5281/zenodo.PLACEHOLDER},
  url          = {https://github.com/javierdejesusda/checkllm}
}

The arXiv ID and Zenodo DOI placeholders will be replaced once the paper-v1 tag is cut. See CITATION.cff for the canonical citation metadata.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 204 Commits
.github		.github
benchmarks		benchmarks
docs		docs
editor/vscode		editor/vscode
examples		examples
paper		paper
scripts		scripts
src/checkllm		src/checkllm
tests		tests
vscode-checkllm		vscode-checkllm
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.lock		requirements.lock

Folders and files

Latest commit

History

Repository files navigation

CheckLLM: Reproducible Agent-Trajectory Evaluation

Why CheckLLM?

Quickstart

Install

1. Deterministic checks (free, instant)

2. LLM-as-judge (deeper evaluation)

3. Fluent chaining

4. Production guardrails

5. YAML-based evaluation

How checkllm compares

Feature comparison

All metrics by category

RAG Evaluation (14 metrics)

General Quality (12 metrics)

Completeness & Instruction Following (5 metrics)

Agent & Tool Evaluation (12 metrics)

Per-Turn Conversation (3 metrics)

Multimodal (6 metrics)

Structured Output (4 metrics)

Role & Safety (3 metrics)

MCP & Tool-Specific (3 metrics)

Specialized (3 metrics)

Deterministic Checks (39, zero API cost)

Red teaming & adversarial testing

Coding agent security

Compliance frameworks

Knowledge Graph test generation

Multilingual evaluation

Prompt optimization

Multi-provider judges

Consensus judging

Cost control

Configuration

CLI

Framework integrations

Custom metrics

Citing CheckLLM

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages