Awesome-Rubrics

A curated reading list on rubric-based evaluation, reward modeling, and post-training for large models.
Rubrics turn expert judgment into structured criteria, auditable LLM judges, and trainable reward signals.

News

[2026-06] Our survey paper, Structuring Human Objectives: A Survey of Rubrics for Evaluation, Alignment, and Agentic AI, is now available: PDF.

Papers with publicly released code or project resources are marked with inline [[Code](...)] or [[Proj](...)] links. Entries without verified repositories omit that link.

Contributions are welcome. If you find missing papers, inaccurate classifications, or newly released code, feel free to update this list.

What are Rubrics?

In the context of LLM evaluation and alignment, a rubric is a structured set of criteria for judging open-ended model outputs. Instead of asking a human or LLM judge for one vague preference, rubrics decompose quality into explicit dimensions, scoring rules, and evidence requirements.

Rubrics make subjective judgment more inspectable:

What to judge: relevance, factuality, completeness, safety, reasoning quality, style, or domain-specific standards.
How to judge: score levels, checklists, pairwise criteria, evidence anchors, or weighted dimensions.
How to use the judgment: evaluation reports, LLM-as-a-Judge protocols, reward models, preference tuning, policy optimization, and curriculum learning.

Figure 1. Rubrics convert coarse feedback into fine-grained, inspectable reward signals.

Feedback style	Typical signal	Best fit	Main limitation
RLHF / model-based preference	"Output A is better than output B."	Open-ended comparison	Coarse and hard to inspect
RLVR / rule-based reward	Format is correct, answer matches, reasoning token appears, list structure exists	Verifiable tasks	Too rigid for subjective or open-ended tasks
Rubric-based feedback	Relevance, completeness, clarity, safety, each scored separately	Open-ended evaluation and training	Requires careful design and calibration

Rubrics are the middle layer: more structured than model-only preference, more flexible than hard rules.

Why Rubrics Matter Now

"In this new era, evaluation becomes more important than training."

Shunyu Yao, The Second Half (2025)

As large models move from closed-form QA to open-ended reasoning, agents, multimodal generation, and professional domains, progress is increasingly bottlenecked by evaluation and feedback design. Training can optimize only what the system can measure, and many important tasks cannot be reduced to a single scalar reward.

Rubrics help answer	Why it matters
What counts as good behavior?	They define explicit criteria, scoring boundaries, and failure modes.
How can expert judgment scale?	They convert tacit standards into reusable evaluation instructions and datasets.
How can LLM-as-a-Judge become less opaque?	Judges can be required to expose criteria, evidence, scores, and rationales.
How does evaluation become training signal?	Rubric-level feedback can supervise SFT, preference tuning, policy optimization, reward modeling, and curriculum learning.

Rubrics therefore act as a bridge between human standards and machine-optimizable signals. They are not merely annotation templates; they are a control surface for evaluation, reward modeling, and post-training.

Growing Research Momentum

Figure 2. The number of rubric-related papers has grown rapidly, suggesting increasing research attention to structured evaluation and reward design.

The rising trend shows that rubric-based methods are becoming an increasingly important direction for large-model alignment, especially as evaluation, reward modeling, and post-training move toward more structured and auditable feedback.

From Evaluation to Reward

Evaluation is no longer only a post-hoc metric. It is becoming part of the infrastructure of AI systems:

🧑‍⚖️ Expert Standards → 📋 Rubrics → 📊 Evaluation Signals → 🎯 Rewards → 🔁 Training Dynamics

Rubrics are therefore not just for judging model outputs. They provide a way to automate parts of expert feedback: experts define criteria, models apply them at scale, and failures reveal where the rubric or judge must be revised. In this sense, evaluation becomes an executable form of domain knowledge.

A Minimal Rubric Example

For the query:

How can cities encourage more people to use public transport?

a rubric does not directly ask "which answer is better?" It decomposes the judgment:

Component	What the judge checks
Relevance	Does the answer address public transport adoption rather than unrelated urban issues?
Clarity	Is the answer easy to understand and well organized?
Completeness	Does it cover affordability, convenience, infrastructure, reliability, and incentives?
Safety / fairness	Does it avoid harmful, biased, or exclusionary suggestions?

This makes the reward more interpretable, decomposable, and actionable.

Rubric Generation Strategies

Figure 3. Rubric construction paradigms for large model alignment.

Strategy	Core idea	When it is useful
Expert direct annotation	Experts write criteria explicitly.	High-stakes domains and seed rubrics
Induction from expert QA annotations	Criteria are extracted from annotated examples.	Scaling expert knowledge beyond manual templates
Distillation from teacher demonstrations	Rubrics are derived from high-quality model outputs.	Bootstrapping scalable reward signals

Together, these strategies show how rubric construction moves from manual specification toward data-driven induction and model-driven distillation.

Repository Map

This repository is organized as a conceptual map of rubric-related research. We group papers by the role rubrics play in the large-model pipeline.

This organization helps show rubrics not only as evaluation tools, but also as structured interfaces connecting expert standards, feedback data, reward signals, training objectives, and deployment-time assessment.

Section	Role in the repository
Data	Covers how rubrics are collected, generated, refined, and organized into reusable supervision signals through human annotation, synthetic generation, expert labeling, and rubric datasets.
Training	Summarizes how rubric-level judgments can be transformed into SFT data, preference objectives, RL rewards, curriculum signals, and self-improvement loops.
Evaluation	Connects rubrics to LLM-as-a-judge protocols, benchmark design, calibration, reliability analysis, and robustness checks, where explicit and auditable criteria are especially important.
Applications	Shows how rubric-based methods extend beyond text QA to multimodal tasks, agent systems, and professional domains that require domain-specific standards.

Overall, this structure follows the lifecycle of rubric-based large-model alignment:

Define criteria → collect or generate rubric data → train with rubric signals → evaluate with structured judges → apply in domain-specific tasks

Rubrics provide a structured layer for connecting data, training, evaluation, and applications.

Data

This section collects work on how rubrics are created, refined, and validated before use in evaluation or training.

Expert-Based Annotation

Expert-based annotation relies on domain specialists or expert-designed protocols when reliable rubrics require professional standards or task-specific expertise.

Expert Requirement

These works study tasks where expert knowledge is necessary to define what counts as a correct, safe, complete, or high-quality answer.

2026

🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Proj]
🌟 [arXiv 2026.01] PLAW BENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]

2025

🌟 [arXiv 2025.11] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning [Proj]
🌟 [ACL 26] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
🌟 [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
🌟 [arXiv 2025.06] Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability [Proj]
🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Proj]

Expert Provider

These works release or organize expert-provided rubrics, checklists, or evaluation criteria as reusable assets for judging model outputs.

2026

🌟 [arXiv 2026.03] XpertBench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]

2025

🌟 [arXiv 2025.12] The AI Consumer Index (ACE) [Proj]
🌟 [ICLR 26] RESEARCH RUBRICS: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
🌟 [arXiv 2025.09] The AI Productivity Index: APEX-v1-extended [Proj]
🌟 [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]

Model-Based Annotation

Model-based annotation uses LLMs or automated pipelines to construct rubric criteria at scale, reducing reliance on fully manual rubric authoring.

Naive Generation Analysis

Naive generation methods ask models to produce criteria or checklists directly from the task, prompt, answer, or context, without grounding them in preference pairs or iterative human correction.

2026

🌟 [arXiv 2026.03] Qworld: Question-Specific Evaluation Criteria for LLMs [Proj]
🌟 [arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation [Data]
🌟 [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]

2025

🌟 [ACL 26] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]

Pairs-Grounded Generation

Pairs-grounded methods infer criteria from preferences, comparisons, or contrastive response pairs, converting implicit relative judgments into explicit rubric dimensions.

2026

🌟 [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
🌟 [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation [Model]

2025

🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
[arXiv 2025.10] Online Rubrics Elicitation from Pairwise Comparisons
🌟 [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
🌟 [arXiv 2025.06] AutoRule: Reasoning Chain-of-Thought Extracted Rule-Based Rewards Improve Preference Learning [Code]

2024

[ACL Findings 2025] CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling

Iterative Refinement

Iterative refinement methods repeatedly revise rubrics using feedback, disagreement, scoring errors, or self-reflection so that the criteria better match intended judgments.

2026

🌟 [ICLR 26] OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation [Code]
[arXiv 2026.03] Confusion-Aware Rubric Optimization for LLM-based Automated Grading
[ICLR 26] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
[arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
🌟 [ACL 26] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]

2025

🌟 [ICLR 26] RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [Proj]

Human-AI Collaboration

Human-AI collaboration treats rubric construction as a joint process where humans guide, inspect, or correct model-generated criteria rather than fully delegating rubric design.

2026

🌟 [arXiv 2026.04] XpertBench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]

2025

🌟 [ICLR 25] WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild [Code]

2024

🌟 [ICLR 24] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [Code]

Training

This section covers methods that turn rubrics into training signals for models, reward models, evaluators, and agents.

Pre-training

Pre-training work would use rubric-like supervision before task-specific alignment, preparing models for later rubric-guided evaluation or optimization.

No retained papers after full-text justification review.

Post-training

Post-training work applies rubrics after base-model training, using them for supervised fine-tuning, reward modeling, reinforcement learning, or self-improvement.

Rubrics for Supervised FT

Rubrics can guide supervised fine-tuning by filtering examples, weighting samples, generating rationales, or teaching models to follow explicit criteria.

2026

🌟 [arXiv 2026.01] P-Check: Advancing Personalized Reward Models via Learning to Generate Dynamic Checklists [Code]

2025

🌟 [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
🌟 [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]

2023

🌟 [ICLR 24] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models [Code]

Rubrics for Preference-Reward RL

Preference-reward methods use rubrics or criteria to structure preference data and train reward models, rather than directly optimizing a hand-written rubric score.

2026

🌟 [arXiv 2026.05] Rubric-based On-policy Distillation [Code]
🌟 [arXiv 2026.04] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences [Code]
[arXiv 2026.04] Visual Preference Optimization with Rubric Rewards
🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
🌟 [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
🌟 [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]

2025

🌟 [ICML-W] Configurable Preference Tuning with Rubric-Guided Synthetic Data [Code]
🌟 [NeurIPS 25] Checklists Are Better Than Reward Models For Aligning Language Models [Code]

Rubrics for Direct-Reward RL

Direct-reward RL directly optimizes policies using rubric scores, checklist outcomes, verifier judgments, or criterion-level feedback as rewards.

Rubric Judgement Pattern

Rubric judgement pattern methods ask a judge, verifier, or rubric model to score outputs against explicit criteria and aggregate those judgments into rewards.

2025

🌟 [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
🌟 [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
🌟 [arXiv 2025.10] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs [Code]
🌟 [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
🌟 [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
🌟 [ICLR 26] Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation [Code]
🌟 [arXiv 2025.03] WritingBench: A Comprehensive Benchmark for Generative Writing [Code]

2024

🌟 [ACL 24] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts [Code]

Rubric Grader Analysis

Rubric grader analysis studies the reliability, calibration, robustness, and failure modes of rubric graders used as reward or evaluation signals.

2026

🌟 [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
[arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
[arXiv 2026.03] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
[arXiv 2026.03] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
[arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
🌟 [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
🌟 [arXiv 2026.01] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation [Code]

2025

🌟 [arXiv 2025.12] Training AI Co-Scientists Using Rubric Rewards [Data]
🌟 [arXiv 2025.10] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs [Code]
🌟 [arXiv 2025.03] Rubric Is All You Need: Improving LLM-Based Code Evaluation With Question-Specific Rubrics [Code]
[ICLR 26] QuRL: Rubrics As Judge For Open-Ended Question Answering

2024

🌟 [arXiv 2024.10] JudgeBench: A Benchmark for Evaluating LLM-based Judges [Code]

2023

🌟 [arXiv 2023.06] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [Code]

Multi-Objective Optimization

Multi-objective optimization treats rubric dimensions as separate reward components, balancing competing goals such as correctness, safety, helpfulness, and style.

2026

🌟 [arXiv 2026.03] PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization [Code]
[arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
[arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
🌟 [arXiv 2026.01] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [Proj]

2025

🌟 [arXiv 2025.11] Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints [Model]
🌟 [arXiv 2025.08] Reinforcement Learning with Rubric Anchors [Model]
[arXiv 2025.08] Pareto Multi-Objective Alignment for Language Models

2024

🌟 [arXiv 2024.06] Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [Model]

Credit Assignment

Credit assignment methods distribute rubric feedback from a final answer to intermediate steps, tokens, stages, or features so training receives denser supervision.

2026

🌟 [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
🌟 [arXiv 2026.03] A Rubric-Supervised Critic from Sparse Real-World Outcomes [Code]
🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
🌟 [arXiv 2026.02] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability [Proj]
🌟 [arXiv 2026.02] Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [Code]
[arXiv 2026.02] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
🌟 [arXiv 2026.01] Technical Report Tongyi DeepResearch [Code]
🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]

2025

🌟 [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
🌟 [arXiv 2025.12] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning [Model]
🌟 [arXiv 2025.10] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards [Code]
[arXiv 2025.06] Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks

Agent Harness

Agent harness methods embed rubric-based judges, verifiers, or reward models inside long-horizon agent loops to guide planning, tool use, and process quality.

2026

[arXiv 2026.04] SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling
🌟 [arXiv 2026.04] Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents [Proj]
🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning [Code]
🌟 [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents [Proj]

2024

🌟 [COLM 24] Autonomous Evaluation and Refinement of Digital Agents [Code]

Rubrics for Advanced Training

Advanced training methods make rubrics dynamic training instruments, using them as curricula, evolving objectives, or guidance signals rather than fixed scoring sheets.

Curriculum Learning

Curriculum learning methods use rubric structure to order tasks, examples, or reward difficulty from easier criteria to harder ones.

2026

[arXiv 2026.02] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
[arXiv 2026.02] Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

2025

🌟 [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
🌟 [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [Code]

Self-Evolving Learning

Self-evolving learning methods let rubrics change during training, using model failures, self-play, memory, or feedback to raise or adjust standards.

2026

[arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
🌟 [arXiv 2026.02] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics [Proj]
[arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks

2025

🌟 [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
🌟 [arXiv 2025.10] Online Rubrics Elicitation from Pairwise Comparisons [Proj]
🌟 [arXiv 2025.08] Reinforcement Learning with Rubric Anchors [Model]

Hint-Based Learning

Hint-based learning uses rubrics as scaffolds, critiques, or guidance signals that point the learner toward missing criteria during optimization.

2026

[arXiv 2026.05] DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
[arXiv 2026.05] Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

2025

[arXiv 2025.11] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
🌟 [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]

Inference

This section covers rubric signals constructed or adapted at test time to guide judging, verification, refinement, or response selection without updating model weights.

Inference-Time Rubric Supervision

Inference-time rubric supervision generates or adapts criteria during deployment, using extra test-time computation to decompose, judge, and refine outputs when fixed references are unavailable.

2025

🌟 [arXiv 2025.09] Language Models that Think, Chat Better [Code]
[ICML 26] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
🌟 [arXiv 2025.04] Inference-Time Scaling for Generalist Reward Modeling [Model]

Risk

This section focuses on reward-hacking failures that arise when models optimize against imperfect rubric rewards or rubric-based judges.

Rubric Reward Hacking

Rubric reward hacking work studies how models exploit loopholes, misspecified criteria, judge artifacts, or weak reward tails to obtain high rubric scores without genuinely better behavior.

2026

[arXiv 2026.05] Reward Hacking in Rubric-Based Reinforcement Learning
[arXiv 2026.04] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
[arXiv 2026.03] Comparing Developer and LLM Biases in Code Evaluation
🌟 [arXiv 2026.02] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges [Code]
[arXiv 2026.02] Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

2025

[arXiv 2025.06] Robust Reward Modeling via Causal Rubrics

2024

[NeurIPS 24] Rule Based Rewards for Language Model Safety

Evaluation

Rubric-based evaluation work introduces benchmarks, datasets, or protocols that make open-ended model behavior comparable through explicit criteria and structured scoring.

Real-World Tasks

Real-world task benchmarks evaluate performance in concrete domains where rubric criteria encode professional, institutional, or task-specific standards.

Medical

Medical benchmarks use rubrics to assess health-related accuracy, safety, reasoning, empathy, and usefulness under clinical or expert-informed standards.

2026

🌟 [arXiv 2026.03] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models [Code]
[arXiv 2026.03] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
🌟 [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation [Code]
[arXiv 2026.01] RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
[arXiv 2026.01] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models

2025

🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Proj]

Legal

Legal benchmarks use rubrics to evaluate legal reasoning, issue identification, evidence use, and procedural or substantive correctness.

2026

🌟 [arXiv 2026.01] PLAW BENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]

2025

🌟 [arXiv 2025.11] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics [Code]

Office Labor

Office labor benchmarks evaluate workplace tasks such as customer service, finance, hiring, productivity, and business workflows where outputs must satisfy operational criteria.

2026

[arXiv 2026.03] LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
🌟 [arXiv 2026.03] $OneMillion-Bench: How Far are Language Agents from Human Experts? [Data]
🌟 [arXiv 2026.03] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]

2025

[arXiv 2025.11] UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI
🌟 [arXiv 2025.11] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning [Proj]
🌟 [arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue [Proj]
🌟 [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
[arXiv 2025.10] GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks [Data]
🌟 [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]

Academic

Academic benchmarks use rubrics to assess educational, scientific, or research tasks where outputs must satisfy discipline-specific correctness and presentation standards.

2026

🌟 [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
[arXiv 2026.01] FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks [Data]

2025

[arXiv 2025.12] TechImage-Bench: Rubric-Based Evaluation for Technical Image Generation
🌟 [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]

Deep Research

Deep research benchmarks evaluate long-form research agents or reports, emphasizing evidence coverage, citation quality, logical support, completeness, and objectivity.

2026

🌟 [arXiv 2026.03] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome [Proj]
🌟 [arXiv 2026.01] DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report [Code]

2025

🌟 [arXiv 2025.12] DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation [Code]
🌟 [ICLR 26] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
🌟 [arXiv 2025.06] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [Code]

Creative Generation

Creative generation benchmarks use rubrics to judge open-ended artifacts such as writing, tables, images, or videos along multiple quality dimensions.

2026

🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Proj]
🌟 [arXiv 2026.01] UEval: A Benchmark for Unified Multimodal Generation [Proj]

2025

🌟 [arXiv 2025.03] WritingBench: A Comprehensive Benchmark for Generative Writing [Code]

2024

🌟 [arXiv 2024.09] HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models [Code]

General Capability Evaluation

General capability benchmarks test transferable abilities with rubric-based protocols, abstracting away from a single professional domain.

Agentic

Agentic benchmarks evaluate planning, tool use, environment interaction, and long-horizon decision making with rubric-guided or process-aware assessment.

2026

[arXiv 2026.03] ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
🌟 [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]

2025

[arXiv 2025.10] TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use [Data]
🌟 [arXiv 2025.08] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [Code]

2024

🌟 [NeurIPS 24] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [Proj]

Reasoning

Reasoning benchmarks use rubrics to assess reasoning quality when answers involve partial credit, social or moral concepts, explanations, or hard-to-verify intermediate logic.

2025

🌟 [arXiv 2025.10] MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes [Proj]

2024

🌟 [arXiv 2024.07] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist [Code]

Alignment

Alignment benchmarks assess whether models or judges follow intended preferences, instructions, safety constraints, and consistency principles under explicit criteria.

2026

[arXiv 2026.03] RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
🌟 [LREC 26] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas [Code]
🌟 [arXiv 2026.01] From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges [Code]
🌟 [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
🌟 [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
[ICLR 26] MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages [Data]

2025

🌟 [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
🌟 [NeurIPS 25] XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [Code]
🌟 [arXiv 2025.01] MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs [Code]
🌟 [NeurIPS 25] Generalizing Verifiable Instruction Following [Code]
🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]

2024

🌟 [EMNLP 24] SedarEval: Automated Evaluation using Self-Adaptive Rubrics [Code]
🌟 [arXiv 2024.10] Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [Code]
🌟 [ICLR 25] JudgeBench: A Benchmark for Evaluating LLM-based Judges [Code]
🌟 [arXiv 2024.02] A StrongREJECT for Empty Jailbreaks [Proj]
🌟 [arXiv 2024.01] InFoBench: Evaluating Instruction Following Ability in Large Language Models [Code]

Applications

Applications use rubrics as practical task interfaces: they guide generation, evaluation, training, or refinement in concrete systems rather than only proposing benchmarks.

Domain

Domain applications apply rubrics within specific task settings such as healthcare, writing, retrieval, deep research, code, and agent workflows.

Medical

Medical applications use rubrics to align healthcare models with clinical reasoning, patient safety, expert preferences, and domain-specific response standards.

2026

🌟 [CVPR 26] OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis [Code]
🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
[arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
[arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

2025

🌟 [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
🌟 [arXiv 2025.09] Baichuan-M2: Scaling Medical Capability with Large Verifier System [Code]

Writing and Retrieval

Writing and retrieval applications use rubrics to guide text generation, revision, explanation, document retrieval, and automated assessment of written outputs.

2025

🌟 [arXiv 2025.09] Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval [Code]
[arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?

2024

🌟 [ACL 24] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts [Code]
[ICTIR 24] Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems

DeepResearch

DeepResearch applications use rubrics to supervise search, evidence chaining, report generation, and long-horizon research-agent optimization.

2026

🌟 [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation [Model]
🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]

2025

🌟 [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
🌟 [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]

Code

Code applications apply rubrics to software engineering agents, patch evaluation, code collaboration, and programming workflows.

2026

[arXiv 2026.03] StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
🌟 [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents [Proj]

General Agentic

General agentic applications use rubrics to coordinate, evaluate, or train agents across tool use, simulated worlds, interviews, and other open-ended environments.

2026

🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning [Code]
🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
🌟 [arXiv 2026.01] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards [Code]
🌟 [arXiv 2026.01] ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [Code]
[arXiv 2026.01] SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

2025

[arXiv 2025.12] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment

Multimodal

Multimodal applications extend rubric supervision beyond text, using criteria to assess or train systems that combine language with vision, speech, or omni-modal signals.

Text + Vision

Text-and-vision applications use rubrics for image or video generation, captioning, visual reasoning, and visual reward modeling.

2026

🌟 [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
[arXiv 2026.04] Visual Preference Optimization with Rubric Rewards
🌟 [arXiv 2026.03] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models [Code]
[arXiv 2026.03] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

2025

[arXiv 2025.11] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
🌟 [arXiv 2025.10] AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning [Code]

Text + Audio

Text-and-audio applications use rubrics to evaluate or fine-tune speech-language systems across multiple raters, aspects, and quality dimensions.

2026

[arXiv 2026.03] Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment

2025

🌟 [ACL 26] SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation [Code]

Omni-modal

Omni-modal applications use rubric-grounded preference or reward modeling across multiple modalities within a unified training or evaluation framework.

2026

🌟 [arXiv 2026.01] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis [Code]

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions or suggestions, please feel free to contact Hongru Xiao.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
utils		utils
README.md		README.md
Rubric_Survey_Paper.pdf		Rubric_Survey_Paper.pdf

Folders and files

Latest commit

History

Repository files navigation

Awesome-Rubrics

News

What are Rubrics?

Why Rubrics Matter Now

Growing Research Momentum

From Evaluation to Reward

A Minimal Rubric Example

Rubric Generation Strategies

Repository Map

Table of Contents

Data

Expert-Based Annotation

Expert Requirement

2026

2025

Expert Provider

2026

2025

Model-Based Annotation

Naive Generation Analysis

2026

2025

Pairs-Grounded Generation

2026

2025

2024

Iterative Refinement

2026

2025

Human-AI Collaboration

2026

2025

2024

Training

Pre-training

Post-training

Rubrics for Supervised FT

2026

2025

2023

Rubrics for Preference-Reward RL

2026

2025

Rubrics for Direct-Reward RL

Rubric Judgement Pattern

2025

2024

Rubric Grader Analysis

2026

2025

2024

2023

Multi-Objective Optimization

2026

2025

2024

Credit Assignment

2026

2025

Agent Harness

2026

2024

Rubrics for Advanced Training

Curriculum Learning

2026

2025

Self-Evolving Learning

2026

2025

Hint-Based Learning

2026

2025

Inference

Inference-Time Rubric Supervision

2025

Risk