A curated reading list on rubric-based evaluation, reward modeling, and post-training for large models.
Rubrics turn expert judgment into structured criteria, auditable LLM judges, and trainable reward signals.
- [2026-06] Our survey paper, Structuring Human Objectives: A Survey of Rubrics for Evaluation, Alignment, and Agentic AI, is now available: PDF.
Papers with publicly released code or project resources are marked with inline [[Code](...)] or [[Proj](...)] links. Entries without verified repositories omit that link.
Contributions are welcome. If you find missing papers, inaccurate classifications, or newly released code, feel free to update this list.
In the context of LLM evaluation and alignment, a rubric is a structured set of criteria for judging open-ended model outputs. Instead of asking a human or LLM judge for one vague preference, rubrics decompose quality into explicit dimensions, scoring rules, and evidence requirements.
Rubrics make subjective judgment more inspectable:
- What to judge: relevance, factuality, completeness, safety, reasoning quality, style, or domain-specific standards.
- How to judge: score levels, checklists, pairwise criteria, evidence anchors, or weighted dimensions.
- How to use the judgment: evaluation reports, LLM-as-a-Judge protocols, reward models, preference tuning, policy optimization, and curriculum learning.
Figure 1. Rubrics convert coarse feedback into fine-grained, inspectable reward signals.
| Feedback style | Typical signal | Best fit | Main limitation |
|---|---|---|---|
| RLHF / model-based preference | "Output A is better than output B." | Open-ended comparison | Coarse and hard to inspect |
| RLVR / rule-based reward | Format is correct, answer matches, reasoning token appears, list structure exists | Verifiable tasks | Too rigid for subjective or open-ended tasks |
| Rubric-based feedback | Relevance, completeness, clarity, safety, each scored separately | Open-ended evaluation and training | Requires careful design and calibration |
Rubrics are the middle layer: more structured than model-only preference, more flexible than hard rules.
"In this new era, evaluation becomes more important than training."
- Shunyu Yao, The Second Half (2025)
As large models move from closed-form QA to open-ended reasoning, agents, multimodal generation, and professional domains, progress is increasingly bottlenecked by evaluation and feedback design. Training can optimize only what the system can measure, and many important tasks cannot be reduced to a single scalar reward.
| Rubrics help answer | Why it matters |
|---|---|
| What counts as good behavior? | They define explicit criteria, scoring boundaries, and failure modes. |
| How can expert judgment scale? | They convert tacit standards into reusable evaluation instructions and datasets. |
| How can LLM-as-a-Judge become less opaque? | Judges can be required to expose criteria, evidence, scores, and rationales. |
| How does evaluation become training signal? | Rubric-level feedback can supervise SFT, preference tuning, policy optimization, reward modeling, and curriculum learning. |
Rubrics therefore act as a bridge between human standards and machine-optimizable signals. They are not merely annotation templates; they are a control surface for evaluation, reward modeling, and post-training.
Figure 2. The number of rubric-related papers has grown rapidly, suggesting increasing research attention to structured evaluation and reward design.
The rising trend shows that rubric-based methods are becoming an increasingly important direction for large-model alignment, especially as evaluation, reward modeling, and post-training move toward more structured and auditable feedback.
Evaluation is no longer only a post-hoc metric. It is becoming part of the infrastructure of AI systems:
π§ββοΈ Expert Standards β π Rubrics β π Evaluation Signals β π― Rewards β π Training Dynamics
Rubrics are therefore not just for judging model outputs. They provide a way to automate parts of expert feedback: experts define criteria, models apply them at scale, and failures reveal where the rubric or judge must be revised. In this sense, evaluation becomes an executable form of domain knowledge.
For the query:
How can cities encourage more people to use public transport?
a rubric does not directly ask "which answer is better?" It decomposes the judgment:
| Component | What the judge checks |
|---|---|
| Relevance | Does the answer address public transport adoption rather than unrelated urban issues? |
| Clarity | Is the answer easy to understand and well organized? |
| Completeness | Does it cover affordability, convenience, infrastructure, reliability, and incentives? |
| Safety / fairness | Does it avoid harmful, biased, or exclusionary suggestions? |
This makes the reward more interpretable, decomposable, and actionable.
Figure 3. Rubric construction paradigms for large model alignment.
| Strategy | Core idea | When it is useful |
|---|---|---|
| Expert direct annotation | Experts write criteria explicitly. | High-stakes domains and seed rubrics |
| Induction from expert QA annotations | Criteria are extracted from annotated examples. | Scaling expert knowledge beyond manual templates |
| Distillation from teacher demonstrations | Rubrics are derived from high-quality model outputs. | Bootstrapping scalable reward signals |
Together, these strategies show how rubric construction moves from manual specification toward data-driven induction and model-driven distillation.
This repository is organized as a conceptual map of rubric-related research. We group papers by the role rubrics play in the large-model pipeline.
This organization helps show rubrics not only as evaluation tools, but also as structured interfaces connecting expert standards, feedback data, reward signals, training objectives, and deployment-time assessment.
| Section | Role in the repository |
|---|---|
| Data | Covers how rubrics are collected, generated, refined, and organized into reusable supervision signals through human annotation, synthetic generation, expert labeling, and rubric datasets. |
| Training | Summarizes how rubric-level judgments can be transformed into SFT data, preference objectives, RL rewards, curriculum signals, and self-improvement loops. |
| Evaluation | Connects rubrics to LLM-as-a-judge protocols, benchmark design, calibration, reliability analysis, and robustness checks, where explicit and auditable criteria are especially important. |
| Applications | Shows how rubric-based methods extend beyond text QA to multimodal tasks, agent systems, and professional domains that require domain-specific standards. |
Overall, this structure follows the lifecycle of rubric-based large-model alignment:
Define criteria β collect or generate rubric data β train with rubric signals β evaluate with structured judges β apply in domain-specific tasks
Rubrics provide a structured layer for connecting data, training, evaluation, and applications.
Browse the reading list
Papers with publicly released code are marked with π.
This section collects work on how rubrics are created, refined, and validated before use in evaluation or training.
Expert-based annotation relies on domain specialists or expert-designed protocols when reliable rubrics require professional standards or task-specific expertise.
These works study tasks where expert knowledge is necessary to define what counts as a correct, safe, complete, or high-quality answer.
- π [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Proj]
- π [arXiv 2026.01] PLAW BENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]
- π [arXiv 2025.11] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning [Proj]
- π [ACL 26] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
- π [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
- π [arXiv 2025.06] Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability [Proj]
- π [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Proj]
These works release or organize expert-provided rubrics, checklists, or evaluation criteria as reusable assets for judging model outputs.
- π [arXiv 2026.03] XpertBench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
- π [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
- π [arXiv 2025.12] The AI Consumer Index (ACE) [Proj]
- π [ICLR 26] RESEARCH RUBRICS: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
- π [arXiv 2025.09] The AI Productivity Index: APEX-v1-extended [Proj]
- π [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]
Model-based annotation uses LLMs or automated pipelines to construct rubric criteria at scale, reducing reliance on fully manual rubric authoring.
Naive generation methods ask models to produce criteria or checklists directly from the task, prompt, answer, or context, without grounding them in preference pairs or iterative human correction.
- π [arXiv 2026.03] Qworld: Question-Specific Evaluation Criteria for LLMs [Proj]
- π [arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation [Data]
- π [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
- π [ACL 26] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
Pairs-grounded methods infer criteria from preferences, comparisons, or contrastive response pairs, converting implicit relative judgments into explicit rubric dimensions.
- π [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
- π [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation [Model]
- π [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
- [arXiv 2025.10] Online Rubrics Elicitation from Pairwise Comparisons
- π [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
- π [arXiv 2025.06] AutoRule: Reasoning Chain-of-Thought Extracted Rule-Based Rewards Improve Preference Learning [Code]
- [ACL Findings 2025] CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling
Iterative refinement methods repeatedly revise rubrics using feedback, disagreement, scoring errors, or self-reflection so that the criteria better match intended judgments.
- π [ICLR 26] OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation [Code]
- [arXiv 2026.03] Confusion-Aware Rubric Optimization for LLM-based Automated Grading
- [ICLR 26] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
- [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
- π [ACL 26] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]
- π [ICLR 26] RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [Proj]
Human-AI collaboration treats rubric construction as a joint process where humans guide, inspect, or correct model-generated criteria rather than fully delegating rubric design.
- π [arXiv 2026.04] XpertBench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
- π [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
- π [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
- π [ICLR 25] WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild [Code]
This section covers methods that turn rubrics into training signals for models, reward models, evaluators, and agents.
Pre-training work would use rubric-like supervision before task-specific alignment, preparing models for later rubric-guided evaluation or optimization.
- No retained papers after full-text justification review.
Post-training work applies rubrics after base-model training, using them for supervised fine-tuning, reward modeling, reinforcement learning, or self-improvement.
Rubrics can guide supervised fine-tuning by filtering examples, weighting samples, generating rationales, or teaching models to follow explicit criteria.
- π [arXiv 2026.01] P-Check: Advancing Personalized Reward Models via Learning to Generate Dynamic Checklists [Code]
- π [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
- π [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
- π [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
Preference-reward methods use rubrics or criteria to structure preference data and train reward models, rather than directly optimizing a hand-written rubric score.
- π [arXiv 2026.05] Rubric-based On-policy Distillation [Code]
- π [arXiv 2026.04] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences [Code]
- [arXiv 2026.04] Visual Preference Optimization with Rubric Rewards
- π [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
- π [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
- π [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
- π [ICML-W] Configurable Preference Tuning with Rubric-Guided Synthetic Data [Code]
- π [NeurIPS 25] Checklists Are Better Than Reward Models For Aligning Language Models [Code]
Direct-reward RL directly optimizes policies using rubric scores, checklist outcomes, verifier judgments, or criterion-level feedback as rewards.
Rubric judgement pattern methods ask a judge, verifier, or rubric model to score outputs against explicit criteria and aggregate those judgments into rewards.
- π [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
- π [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
- π [arXiv 2025.10] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs [Code]
- π [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
- π [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
- π [ICLR 26] Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation [Code]
- π [arXiv 2025.03] WritingBench: A Comprehensive Benchmark for Generative Writing [Code]
- π [ACL 24] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts [Code]
Rubric grader analysis studies the reliability, calibration, robustness, and failure modes of rubric graders used as reward or evaluation signals.
- π [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
- [arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
- [arXiv 2026.03] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
- [arXiv 2026.03] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
- [arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
- π [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
- π [arXiv 2026.01] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation [Code]
- π [arXiv 2025.12] Training AI Co-Scientists Using Rubric Rewards [Data]
- π [arXiv 2025.10] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs [Code]
- π [arXiv 2025.03] Rubric Is All You Need: Improving LLM-Based Code Evaluation With Question-Specific Rubrics [Code]
- [ICLR 26] QuRL: Rubrics As Judge For Open-Ended Question Answering
- π [arXiv 2024.10] JudgeBench: A Benchmark for Evaluating LLM-based Judges [Code]
- π [arXiv 2023.06] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [Code]
Multi-objective optimization treats rubric dimensions as separate reward components, balancing competing goals such as correctness, safety, helpfulness, and style.
- π [arXiv 2026.03] PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization [Code]
- [arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
- [arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
- π [arXiv 2026.01] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [Proj]
- π [arXiv 2025.11] Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints [Model]
- π [arXiv 2025.08] Reinforcement Learning with Rubric Anchors [Model]
- [arXiv 2025.08] Pareto Multi-Objective Alignment for Language Models
- π [arXiv 2024.06] Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [Model]
Credit assignment methods distribute rubric feedback from a final answer to intermediate steps, tokens, stages, or features so training receives denser supervision.
- π [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
- π [arXiv 2026.03] A Rubric-Supervised Critic from Sparse Real-World Outcomes [Code]
- π [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
- π [arXiv 2026.02] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability [Proj]
- π [arXiv 2026.02] Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [Code]
- [arXiv 2026.02] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
- π [arXiv 2026.01] Technical Report Tongyi DeepResearch [Code]
- π [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
- π [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
- π [arXiv 2025.12] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning [Model]
- π [arXiv 2025.10] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards [Code]
- [arXiv 2025.06] Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
Agent harness methods embed rubric-based judges, verifiers, or reward models inside long-horizon agent loops to guide planning, tool use, and process quality.
- [arXiv 2026.04] SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling
- π [arXiv 2026.04] Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents [Proj]
- π [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning [Code]
- π [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents [Proj]
Advanced training methods make rubrics dynamic training instruments, using them as curricula, evolving objectives, or guidance signals rather than fixed scoring sheets.
Curriculum learning methods use rubric structure to order tasks, examples, or reward difficulty from easier criteria to harder ones.
- [arXiv 2026.02] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
- [arXiv 2026.02] Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling
- π [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
- π [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [Code]
Self-evolving learning methods let rubrics change during training, using model failures, self-play, memory, or feedback to raise or adjust standards.
- [arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
- π [arXiv 2026.02] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics [Proj]
- [arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
- π [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
- π [arXiv 2025.10] Online Rubrics Elicitation from Pairwise Comparisons [Proj]
- π [arXiv 2025.08] Reinforcement Learning with Rubric Anchors [Model]
Hint-based learning uses rubrics as scaffolds, critiques, or guidance signals that point the learner toward missing criteria during optimization.
- [arXiv 2026.05] DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
- [arXiv 2026.05] Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
- [arXiv 2025.11] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
- π [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
This section covers rubric signals constructed or adapted at test time to guide judging, verification, refinement, or response selection without updating model weights.
Inference-time rubric supervision generates or adapts criteria during deployment, using extra test-time computation to decompose, judge, and refine outputs when fixed references are unavailable.
- π [arXiv 2025.09] Language Models that Think, Chat Better [Code]
- [ICML 26] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
- π [arXiv 2025.04] Inference-Time Scaling for Generalist Reward Modeling [Model]
This section focuses on reward-hacking failures that arise when models optimize against imperfect rubric rewards or rubric-based judges.
Rubric reward hacking work studies how models exploit loopholes, misspecified criteria, judge artifacts, or weak reward tails to obtain high rubric scores without genuinely better behavior.
- [arXiv 2026.05] Reward Hacking in Rubric-Based Reinforcement Learning
- [arXiv 2026.04] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
- [arXiv 2026.03] Comparing Developer and LLM Biases in Code Evaluation
- π [arXiv 2026.02] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges [Code]
- [arXiv 2026.02] Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge
- [arXiv 2025.06] Robust Reward Modeling via Causal Rubrics
- [NeurIPS 24] Rule Based Rewards for Language Model Safety
Rubric-based evaluation work introduces benchmarks, datasets, or protocols that make open-ended model behavior comparable through explicit criteria and structured scoring.
Real-world task benchmarks evaluate performance in concrete domains where rubric criteria encode professional, institutional, or task-specific standards.
Medical benchmarks use rubrics to assess health-related accuracy, safety, reasoning, empathy, and usefulness under clinical or expert-informed standards.
- π [arXiv 2026.03] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models [Code]
- [arXiv 2026.03] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
- π [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation [Code]
- [arXiv 2026.01] RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
- [arXiv 2026.01] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models
- π [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Proj]
Legal benchmarks use rubrics to evaluate legal reasoning, issue identification, evidence use, and procedural or substantive correctness.
- π [arXiv 2026.01] PLAW BENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]
- π [arXiv 2025.11] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics [Code]
Office labor benchmarks evaluate workplace tasks such as customer service, finance, hiring, productivity, and business workflows where outputs must satisfy operational criteria.
- [arXiv 2026.03] LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
- π [arXiv 2026.03] $OneMillion-Bench: How Far are Language Agents from Human Experts? [Data]
- π [arXiv 2026.03] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
- [arXiv 2025.11] UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI
- π [arXiv 2025.11] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning [Proj]
- π [arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue [Proj]
- π [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
- [arXiv 2025.10] GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks [Data]
- π [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]
Academic benchmarks use rubrics to assess educational, scientific, or research tasks where outputs must satisfy discipline-specific correctness and presentation standards.
- π [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
- [arXiv 2026.01] FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks [Data]
- [arXiv 2025.12] TechImage-Bench: Rubric-Based Evaluation for Technical Image Generation
- π [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]
Deep research benchmarks evaluate long-form research agents or reports, emphasizing evidence coverage, citation quality, logical support, completeness, and objectivity.
- π [arXiv 2026.03] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome [Proj]
- π [arXiv 2026.01] DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report [Code]
- π [arXiv 2025.12] DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation [Code]
- π [ICLR 26] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
- π [arXiv 2025.06] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [Code]
Creative generation benchmarks use rubrics to judge open-ended artifacts such as writing, tables, images, or videos along multiple quality dimensions.
- π [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Proj]
- π [arXiv 2026.01] UEval: A Benchmark for Unified Multimodal Generation [Proj]
- π [arXiv 2025.03] WritingBench: A Comprehensive Benchmark for Generative Writing [Code]
- π [arXiv 2024.09] HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models [Code]
General capability benchmarks test transferable abilities with rubric-based protocols, abstracting away from a single professional domain.
Agentic benchmarks evaluate planning, tool use, environment interaction, and long-horizon decision making with rubric-guided or process-aware assessment.
- [arXiv 2026.03] ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
- π [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
- [arXiv 2025.10] TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use [Data]
- π [arXiv 2025.08] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [Code]
- π [NeurIPS 24] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [Proj]
Reasoning benchmarks use rubrics to assess reasoning quality when answers involve partial credit, social or moral concepts, explanations, or hard-to-verify intermediate logic.
- π [arXiv 2025.10] MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes [Proj]
- π [arXiv 2024.07] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist [Code]
Alignment benchmarks assess whether models or judges follow intended preferences, instructions, safety constraints, and consistency principles under explicit criteria.
- [arXiv 2026.03] RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
- π [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
- π [LREC 26] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas [Code]
- π [arXiv 2026.01] From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges [Code]
- π [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
- π [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
- [ICLR 26] MENLO: From Preferences to Proficiency β Evaluating and Modeling Native-like Quality Across 47 Languages [Data]
- π [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
- π [NeurIPS 25] XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [Code]
- π [arXiv 2025.01] MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs [Code]
- π [NeurIPS 25] Generalizing Verifiable Instruction Following [Code]
- π [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
- π [EMNLP 24] SedarEval: Automated Evaluation using Self-Adaptive Rubrics [Code]
- π [arXiv 2024.10] Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [Code]
- π [ICLR 25] JudgeBench: A Benchmark for Evaluating LLM-based Judges [Code]
- π [arXiv 2024.02] A StrongREJECT for Empty Jailbreaks [Proj]
- π [arXiv 2024.01] InFoBench: Evaluating Instruction Following Ability in Large Language Models [Code]
Applications use rubrics as practical task interfaces: they guide generation, evaluation, training, or refinement in concrete systems rather than only proposing benchmarks.
Domain applications apply rubrics within specific task settings such as healthcare, writing, retrieval, deep research, code, and agent workflows.
Medical applications use rubrics to align healthcare models with clinical reasoning, patient safety, expert preferences, and domain-specific response standards.
- π [CVPR 26] OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis [Code]
- π [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
- [arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
- [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
- π [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
- π [arXiv 2025.09] Baichuan-M2: Scaling Medical Capability with Large Verifier System [Code]
Writing and retrieval applications use rubrics to guide text generation, revision, explanation, document retrieval, and automated assessment of written outputs.
- π [arXiv 2025.09] Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval [Code]
- [arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
- π [ACL 24] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts [Code]
- [ICTIR 24] Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems
DeepResearch applications use rubrics to supervise search, evidence chaining, report generation, and long-horizon research-agent optimization.
- π [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation [Model]
- π [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
- π [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
- π [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
Code applications apply rubrics to software engineering agents, patch evaluation, code collaboration, and programming workflows.
- [arXiv 2026.03] StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
- π [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents [Proj]
General agentic applications use rubrics to coordinate, evaluate, or train agents across tool use, simulated worlds, interviews, and other open-ended environments.
- π [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning [Code]
- π [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
- π [arXiv 2026.01] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards [Code]
- π [arXiv 2026.01] ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [Code]
- [arXiv 2026.01] SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models
- [arXiv 2025.12] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
Multimodal applications extend rubric supervision beyond text, using criteria to assess or train systems that combine language with vision, speech, or omni-modal signals.
Text-and-vision applications use rubrics for image or video generation, captioning, visual reasoning, and visual reward modeling.
- π [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
- [arXiv 2026.04] Visual Preference Optimization with Rubric Rewards
- π [arXiv 2026.03] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models [Code]
- [arXiv 2026.03] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
- [arXiv 2025.11] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
- π [arXiv 2025.10] AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning [Code]
Text-and-audio applications use rubrics to evaluate or fine-tune speech-language systems across multiple raters, aspects, and quality dimensions.
- [arXiv 2026.03] Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment
- π [ACL 26] SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation [Code]
Omni-modal applications use rubric-grounded preference or reward modeling across multiple modalities within a unified training or evaluation framework.
- π [arXiv 2026.01] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis [Code]
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions or suggestions, please feel free to contact Hongru Xiao.



