Skip to content

FreedomIntelligence/Awesome-Rubrics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

90 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome-Rubrics

Survey Paper Awesome Maintenance PRs Welcome

A curated reading list on rubric-based evaluation, reward modeling, and post-training for large models.
Rubrics turn expert judgment into structured criteria, auditable LLM judges, and trainable reward signals.


News

  • [2026-06] Our survey paper, Structuring Human Objectives: A Survey of Rubrics for Evaluation, Alignment, and Agentic AI, is now available: PDF.

Taxonomy of rubric-centered supervision


Papers with publicly released code or project resources are marked with inline [[Code](...)] or [[Proj](...)] links. Entries without verified repositories omit that link.

Contributions are welcome. If you find missing papers, inaccurate classifications, or newly released code, feel free to update this list.

What are Rubrics?

In the context of LLM evaluation and alignment, a rubric is a structured set of criteria for judging open-ended model outputs. Instead of asking a human or LLM judge for one vague preference, rubrics decompose quality into explicit dimensions, scoring rules, and evidence requirements.

Rubrics make subjective judgment more inspectable:

  • What to judge: relevance, factuality, completeness, safety, reasoning quality, style, or domain-specific standards.
  • How to judge: score levels, checklists, pairwise criteria, evidence anchors, or weighted dimensions.
  • How to use the judgment: evaluation reports, LLM-as-a-Judge protocols, reward models, preference tuning, policy optimization, and curriculum learning.

Rubrics: from coarse to fine-grained reward signals

Figure 1. Rubrics convert coarse feedback into fine-grained, inspectable reward signals.

Feedback style Typical signal Best fit Main limitation
RLHF / model-based preference "Output A is better than output B." Open-ended comparison Coarse and hard to inspect
RLVR / rule-based reward Format is correct, answer matches, reasoning token appears, list structure exists Verifiable tasks Too rigid for subjective or open-ended tasks
Rubric-based feedback Relevance, completeness, clarity, safety, each scored separately Open-ended evaluation and training Requires careful design and calibration

Rubrics are the middle layer: more structured than model-only preference, more flexible than hard rules.

Why Rubrics Matter Now

"In this new era, evaluation becomes more important than training."

As large models move from closed-form QA to open-ended reasoning, agents, multimodal generation, and professional domains, progress is increasingly bottlenecked by evaluation and feedback design. Training can optimize only what the system can measure, and many important tasks cannot be reduced to a single scalar reward.

Rubrics help answer Why it matters
What counts as good behavior? They define explicit criteria, scoring boundaries, and failure modes.
How can expert judgment scale? They convert tacit standards into reusable evaluation instructions and datasets.
How can LLM-as-a-Judge become less opaque? Judges can be required to expose criteria, evidence, scores, and rationales.
How does evaluation become training signal? Rubric-level feedback can supervise SFT, preference tuning, policy optimization, reward modeling, and curriculum learning.

Rubrics therefore act as a bridge between human standards and machine-optimizable signals. They are not merely annotation templates; they are a control surface for evaluation, reward modeling, and post-training.

Growing Research Momentum

Growing number of rubric-related papers

Figure 2. The number of rubric-related papers has grown rapidly, suggesting increasing research attention to structured evaluation and reward design.

The rising trend shows that rubric-based methods are becoming an increasingly important direction for large-model alignment, especially as evaluation, reward modeling, and post-training move toward more structured and auditable feedback.

From Evaluation to Reward

Evaluation is no longer only a post-hoc metric. It is becoming part of the infrastructure of AI systems:

πŸ§‘β€βš–οΈ Expert Standards β†’ πŸ“‹ Rubrics β†’ πŸ“Š Evaluation Signals β†’ 🎯 Rewards β†’ πŸ” Training Dynamics

Rubrics are therefore not just for judging model outputs. They provide a way to automate parts of expert feedback: experts define criteria, models apply them at scale, and failures reveal where the rubric or judge must be revised. In this sense, evaluation becomes an executable form of domain knowledge.

A Minimal Rubric Example

For the query:

How can cities encourage more people to use public transport?

a rubric does not directly ask "which answer is better?" It decomposes the judgment:

Component What the judge checks
Relevance Does the answer address public transport adoption rather than unrelated urban issues?
Clarity Is the answer easy to understand and well organized?
Completeness Does it cover affordability, convenience, infrastructure, reliability, and incentives?
Safety / fairness Does it avoid harmful, biased, or exclusionary suggestions?

This makes the reward more interpretable, decomposable, and actionable.

Rubric Generation Strategies

Rubric generation strategies

Figure 3. Rubric construction paradigms for large model alignment.

Strategy Core idea When it is useful
Expert direct annotation Experts write criteria explicitly. High-stakes domains and seed rubrics
Induction from expert QA annotations Criteria are extracted from annotated examples. Scaling expert knowledge beyond manual templates
Distillation from teacher demonstrations Rubrics are derived from high-quality model outputs. Bootstrapping scalable reward signals

Together, these strategies show how rubric construction moves from manual specification toward data-driven induction and model-driven distillation.

Repository Map

This repository is organized as a conceptual map of rubric-related research. We group papers by the role rubrics play in the large-model pipeline.

This organization helps show rubrics not only as evaluation tools, but also as structured interfaces connecting expert standards, feedback data, reward signals, training objectives, and deployment-time assessment.

Section Role in the repository
Data Covers how rubrics are collected, generated, refined, and organized into reusable supervision signals through human annotation, synthetic generation, expert labeling, and rubric datasets.
Training Summarizes how rubric-level judgments can be transformed into SFT data, preference objectives, RL rewards, curriculum signals, and self-improvement loops.
Evaluation Connects rubrics to LLM-as-a-judge protocols, benchmark design, calibration, reliability analysis, and robustness checks, where explicit and auditable criteria are especially important.
Applications Shows how rubric-based methods extend beyond text QA to multimodal tasks, agent systems, and professional domains that require domain-specific standards.

Overall, this structure follows the lifecycle of rubric-based large-model alignment:

Define criteria β†’ collect or generate rubric data β†’ train with rubric signals β†’ evaluate with structured judges β†’ apply in domain-specific tasks

Rubrics provide a structured layer for connecting data, training, evaluation, and applications.

Table of Contents

Browse the reading list

Papers with publicly released code are marked with 🌟.

Data

This section collects work on how rubrics are created, refined, and validated before use in evaluation or training.

Expert-Based Annotation

Expert-based annotation relies on domain specialists or expert-designed protocols when reliable rubrics require professional standards or task-specific expertise.

Expert Requirement

These works study tasks where expert knowledge is necessary to define what counts as a correct, safe, complete, or high-quality answer.

2026
  • 🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Proj]
    Expert Requirement Creative Generation
  • 🌟 [arXiv 2026.01] PLAW BENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]
    Expert Requirement Legal
2025
  • 🌟 [arXiv 2025.11] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning [Proj]
    Expert Requirement Office Labor
  • 🌟 [ACL 26] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
    Expert Requirement Alignment
  • 🌟 [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
    Expert Requirement Office Labor
  • 🌟 [arXiv 2025.06] Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability [Proj]
    Expert Requirement
  • 🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Proj]
    Expert Requirement Medical

Expert Provider

These works release or organize expert-provided rubrics, checklists, or evaluation criteria as reusable assets for judging model outputs.

2026
  • 🌟 [arXiv 2026.03] XpertBench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
    Expert Provider Human-AI Collaboration
  • 🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
    Expert Provider Human-AI Collaboration Alignment
2025
  • 🌟 [arXiv 2025.12] The AI Consumer Index (ACE) [Proj]
    Expert Provider
  • 🌟 [ICLR 26] RESEARCH RUBRICS: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
    Expert Provider
  • 🌟 [arXiv 2025.09] The AI Productivity Index: APEX-v1-extended [Proj]
    Expert Provider
  • 🌟 [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]
    Expert Provider Office Labor

Model-Based Annotation

Model-based annotation uses LLMs or automated pipelines to construct rubric criteria at scale, reducing reliance on fully manual rubric authoring.

Naive Generation Analysis

Naive generation methods ask models to produce criteria or checklists directly from the task, prompt, answer, or context, without grounding them in preference pairs or iterative human correction.

2026
  • 🌟 [arXiv 2026.03] Qworld: Question-Specific Evaluation Criteria for LLMs [Proj]
    Naive Generation Analysis
  • 🌟 [arXiv 2026.03] RubricRAG: Towards Interpretable and Reliable LLM Evaluation via Domain Knowledge Retrieval for Rubric Generation [Data]
    Naive Generation Analysis
  • 🌟 [arXiv 2026.03] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge [Code]
    Naive Generation Analysis
2025
  • 🌟 [ACL 26] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
    Naive Generation Analysis Rubric Judgement Pattern

Pairs-Grounded Generation

Pairs-grounded methods infer criteria from preferences, comparisons, or contrastive response pairs, converting implicit relative judgments into explicit rubric dimensions.

2026
  • 🌟 [arXiv 2026.03] CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling [Code]
    Pairs-Grounded Generation
  • 🌟 [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation [Model]
    Pairs-Grounded Generation DeepResearch
2025
  • 🌟 [arXiv 2025.10] From Implicit Weights to Explicit Rubrics: A Training-Free Framework for Reward Modeling [Code]
    Pairs-Grounded Generation
  • [arXiv 2025.10] Online Rubrics Elicitation from Pairwise Comparisons
    Pairs-Grounded Generation Self-Evolving Learning
  • 🌟 [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
    Pairs-Grounded Generation Rubric Judgement Pattern
  • 🌟 [arXiv 2025.06] AutoRule: Reasoning Chain-of-Thought Extracted Rule-Based Rewards Improve Preference Learning [Code]
    Pairs-Grounded Generation
2024
  • [ACL Findings 2025] CARMO: Dynamic Criteria Generation for Context Aware Reward Modelling
    Pairs-Grounded Generation

Iterative Refinement

Iterative refinement methods repeatedly revise rubrics using feedback, disagreement, scoring errors, or self-reflection so that the criteria better match intended judgments.

2026
  • 🌟 [ICLR 26] OptimSyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation [Code]
    Iterative Refinement
  • [arXiv 2026.03] Confusion-Aware Rubric Optimization for LLM-based Automated Grading
    Iterative Refinement
  • [ICLR 26] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
    Iterative Refinement Self-Evolving Learning
  • [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
    Iterative Refinement Medical
  • 🌟 [ACL 26] RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation [Code]
    Iterative Refinement
2025
  • 🌟 [ICLR 26] RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks [Proj]
    Iterative Refinement

Human-AI Collaboration

Human-AI collaboration treats rubric construction as a joint process where humans guide, inspect, or correct model-generated criteria rather than fully delegating rubric design.

2026

  • 🌟 [arXiv 2026.04] XpertBench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
    Expert Provider Human-AI Collaboration
  • 🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
    Expert Provider Human-AI Collaboration Alignment
  • 🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
    Human-AI Collaboration Medical

2025

  • 🌟 [ICLR 25] WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild [Code]
    Human-AI Collaboration

2024

  • 🌟 [ICLR 24] FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [Code]
    Human-AI Collaboration

Training

This section covers methods that turn rubrics into training signals for models, reward models, evaluators, and agents.

Pre-training

Pre-training work would use rubric-like supervision before task-specific alignment, preparing models for later rubric-guided evaluation or optimization.

  • No retained papers after full-text justification review.

Post-training

Post-training work applies rubrics after base-model training, using them for supervised fine-tuning, reward modeling, reinforcement learning, or self-improvement.

Rubrics for Supervised FT

Rubrics can guide supervised fine-tuning by filtering examples, weighting samples, generating rationales, or teaching models to follow explicit criteria.

2026
  • 🌟 [arXiv 2026.01] P-Check: Advancing Personalized Reward Models via Learning to Generate Dynamic Checklists [Code]
    Supervised Fine-Tuning
2025
  • 🌟 [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
    Supervised Fine-Tuning Alignment
  • 🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
    Supervised Fine-Tuning Alignment
  • 🌟 [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
    Supervised Fine-Tuning Alignment
2023
  • 🌟 [ICLR 24] Prometheus: Inducing Fine-grained Evaluation Capability in Language Models [Code]
    Supervised Fine-Tuning
Rubrics for Preference-Reward RL

Preference-reward methods use rubrics or criteria to structure preference data and train reward models, rather than directly optimizing a hand-written rubric score.

2026
  • 🌟 [arXiv 2026.05] Rubric-based On-policy Distillation [Code]
    Preference-Reward RL
  • 🌟 [arXiv 2026.04] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences [Code]
    Preference-Reward RL
  • [arXiv 2026.04] Visual Preference Optimization with Rubric Rewards
    Preference-Reward RL Text + Vision
  • 🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation [Code]
    Preference-Reward RL
  • 🌟 [arXiv 2026.02] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric [Code]
    Preference-Reward RL
  • 🌟 [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
    Preference-Reward RL Rubric Grader Analysis
2025
  • 🌟 [ICML-W] Configurable Preference Tuning with Rubric-Guided Synthetic Data [Code]
    Preference-Reward RL
  • 🌟 [NeurIPS 25] Checklists Are Better Than Reward Models For Aligning Language Models [Code]
    Preference-Reward RL
Rubrics for Direct-Reward RL

Direct-reward RL directly optimizes policies using rubric scores, checklist outcomes, verifier judgments, or criterion-level feedback as rewards.

Rubric Judgement Pattern

Rubric judgement pattern methods ask a judge, verifier, or rubric model to score outputs against explicit criteria and aggregate those judgments into rewards.

2025
  • 🌟 [ICLR 26] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [Data]
    Pairs-Grounded Generation Rubric Judgement Pattern
  • 🌟 [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
    Rubric Judgement Pattern Self-Evolving Learning DeepResearch
  • 🌟 [arXiv 2025.10] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs [Code]
    Rubric Judgement Pattern Rubric Grader Analysis
  • 🌟 [arXiv 2025.10] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment [Model]
    Naive Generation Analysis Rubric Judgement Pattern
  • 🌟 [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
    Rubric Judgement Pattern Hint-Based Learning
  • 🌟 [ICLR 26] Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation [Code]
    Rubric Judgement Pattern
  • 🌟 [arXiv 2025.03] WritingBench: A Comprehensive Benchmark for Generative Writing [Code]
    Rubric Judgement Pattern Creative Generation
2024
  • 🌟 [ACL 24] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts [Code]
    Rubric Judgement Pattern Writing & Retrieval
Rubric Grader Analysis

Rubric grader analysis studies the reliability, calibration, robustness, and failure modes of rubric graders used as reward or evaluation signals.

2026
  • 🌟 [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
    Rubric Grader Analysis Text + Vision
  • [arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
    Rubric Grader Analysis Multi-Objective Optimization
  • [arXiv 2026.03] Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
    Rubric Grader Analysis
  • [arXiv 2026.03] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
    Rubric Grader Analysis
  • [arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
    Rubric Grader Analysis Self-Evolving Learning
  • 🌟 [arXiv 2026.02] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training [Model]
    Preference-Reward RL Rubric Grader Analysis
  • 🌟 [arXiv 2026.01] RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation [Code]
    Rubric Grader Analysis
2025
  • 🌟 [arXiv 2025.12] Training AI Co-Scientists Using Rubric Rewards [Data]
    Rubric Grader Analysis
  • 🌟 [arXiv 2025.10] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs [Code]
    Rubric Judgement Pattern Rubric Grader Analysis
  • 🌟 [arXiv 2025.03] Rubric Is All You Need: Improving LLM-Based Code Evaluation With Question-Specific Rubrics [Code]
    Rubric Grader Analysis
  • [ICLR 26] QuRL: Rubrics As Judge For Open-Ended Question Answering
    Rubric Grader Analysis
2024
  • 🌟 [arXiv 2024.10] JudgeBench: A Benchmark for Evaluating LLM-based Judges [Code]
    Rubric Grader Analysis Alignment
2023
  • 🌟 [arXiv 2023.06] Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [Code]
    Rubric Grader Analysis
Multi-Objective Optimization

Multi-objective optimization treats rubric dimensions as separate reward components, balancing competing goals such as correctness, safety, helpfulness, and style.

2026
  • 🌟 [arXiv 2026.03] PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization [Code]
    Multi-Objective Optimization
  • [arXiv 2026.03] Alternating Reinforcement Learning with Contextual Rubric Rewards
    Rubric Grader Analysis Multi-Objective Optimization
  • [arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
    Multi-Objective Optimization Medical
  • 🌟 [arXiv 2026.01] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [Proj]
    Multi-Objective Optimization
2025
  • 🌟 [arXiv 2025.11] Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints [Model]
    Multi-Objective Optimization
  • 🌟 [arXiv 2025.08] Reinforcement Learning with Rubric Anchors [Model]
    Multi-Objective Optimization Self-Evolving Learning
  • [arXiv 2025.08] Pareto Multi-Objective Alignment for Language Models
    Multi-Objective Optimization
2024
  • 🌟 [arXiv 2024.06] Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts [Model]
    Multi-Objective Optimization
Credit Assignment

Credit assignment methods distribute rubric feedback from a final answer to intermediate steps, tokens, stages, or features so training receives denser supervision.

2026
  • 🌟 [arXiv 2026.04] Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks [Code]
    Credit Assignment
  • 🌟 [arXiv 2026.03] A Rubric-Supervised Critic from Sparse Real-World Outcomes [Code]
    Credit Assignment
  • 🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
    Credit Assignment General Agentic
  • 🌟 [arXiv 2026.02] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability [Proj]
    Credit Assignment
  • 🌟 [arXiv 2026.02] Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [Code]
    Credit Assignment
  • [arXiv 2026.02] Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning
    Credit Assignment
  • 🌟 [arXiv 2026.01] Technical Report Tongyi DeepResearch [Code]
    Credit Assignment
  • 🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
    Credit Assignment DeepResearch
2025
  • 🌟 [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
    Credit Assignment DeepResearch
  • 🌟 [arXiv 2025.12] TableGPT-R1: Advancing Tabular Reasoning Through Reinforcement Learning [Model]
    Credit Assignment
  • 🌟 [arXiv 2025.10] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards [Code]
    Credit Assignment
  • [arXiv 2025.06] Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks
    Credit Assignment
Agent Harness

Agent harness methods embed rubric-based judges, verifiers, or reward models inside long-horizon agent loops to guide planning, tool use, and process quality.

2026
  • [arXiv 2026.04] SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling
    Agent Harness
  • 🌟 [arXiv 2026.04] Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents [Proj]
    Agent Harness
  • 🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning [Code]
    Agent Harness General Agentic
  • 🌟 [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents [Proj]
    Agent Harness Code
2024
  • 🌟 [COLM 24] Autonomous Evaluation and Refinement of Digital Agents [Code]
    Agent Harness
Rubrics for Advanced Training

Advanced training methods make rubrics dynamic training instruments, using them as curricula, evolving objectives, or guidance signals rather than fixed scoring sheets.

Curriculum Learning

Curriculum learning methods use rubric structure to order tasks, examples, or reward difficulty from easier criteria to harder ones.

2026
  • [arXiv 2026.02] RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
    Curriculum Learning
  • [arXiv 2026.02] Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling
    Curriculum Learning
2025
  • 🌟 [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
    Curriculum Learning Medical
  • 🌟 [ICLR 26] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling [Code]
    Curriculum Learning
Self-Evolving Learning

Self-evolving learning methods let rubrics change during training, using model failures, self-play, memory, or feedback to raise or adjust standards.

2026
  • [arXiv 2026.02] SibylSense: Adaptive Rubric Learning via Memory Tuning and Adversarial Probing
    Rubric Grader Analysis Self-Evolving Learning
  • 🌟 [arXiv 2026.02] Reinforcing Chain-of-Thought Reasoning with Self-Evolving Rubrics [Proj]
    Self-Evolving Learning
  • [arXiv 2026.02] Rethinking Rubric Generation for Improving LLM Judge and Reward Modeling for Open-ended Tasks
    Iterative Refinement Self-Evolving Learning
2025
  • 🌟 [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
    Rubric Judgement Pattern Self-Evolving Learning DeepResearch
  • 🌟 [arXiv 2025.10] Online Rubrics Elicitation from Pairwise Comparisons [Proj]
    Pairs-Grounded Generation Self-Evolving Learning
  • 🌟 [arXiv 2025.08] Reinforcement Learning with Rubric Anchors [Model]
    Multi-Objective Optimization Self-Evolving Learning
Hint-Based Learning

Hint-based learning uses rubrics as scaffolds, critiques, or guidance signals that point the learner toward missing criteria during optimization.

2026
  • [arXiv 2026.05] DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
    Hint-Based Learning
  • [arXiv 2026.05] Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
    Hint-Based Learning
2025
  • [arXiv 2025.11] Reward and Guidance through Rubrics: Promoting Exploration to Improve Multi-Domain Reasoning
    Hint-Based Learning
  • 🌟 [arXiv 2025.08] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning [Code]
    Rubric Judgement Pattern Hint-Based Learning

Inference

This section covers rubric signals constructed or adapted at test time to guide judging, verification, refinement, or response selection without updating model weights.

Inference-Time Rubric Supervision

Inference-time rubric supervision generates or adapts criteria during deployment, using extra test-time computation to decompose, judge, and refine outputs when fixed references are unavailable.

2025

  • 🌟 [arXiv 2025.09] Language Models that Think, Chat Better [Code]
    Inference-Time Rubric Supervision
  • [ICML 26] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision
    Inference-Time Rubric Supervision
  • 🌟 [arXiv 2025.04] Inference-Time Scaling for Generalist Reward Modeling [Model]
    Inference-Time Rubric Supervision

Risk

This section focuses on reward-hacking failures that arise when models optimize against imperfect rubric rewards or rubric-based judges.

Rubric Reward Hacking

Rubric reward hacking work studies how models exploit loopholes, misspecified criteria, judge artifacts, or weak reward tails to obtain high rubric scores without genuinely better behavior.

2026

  • [arXiv 2026.05] Reward Hacking in Rubric-Based Reinforcement Learning
    Rubric Reward Hacking
  • [arXiv 2026.04] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
    Rubric Reward Hacking
  • [arXiv 2026.03] Comparing Developer and LLM Biases in Code Evaluation
    Rubric Reward Hacking
  • 🌟 [arXiv 2026.02] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges [Code]
    Rubric Reward Hacking
  • [arXiv 2026.02] Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge
    Rubric Reward Hacking

2025

2024

  • [NeurIPS 24] Rule Based Rewards for Language Model Safety
    Rubric Reward Hacking

Evaluation

Rubric-based evaluation work introduces benchmarks, datasets, or protocols that make open-ended model behavior comparable through explicit criteria and structured scoring.

Real-World Tasks

Real-world task benchmarks evaluate performance in concrete domains where rubric criteria encode professional, institutional, or task-specific standards.

Medical

Medical benchmarks use rubrics to assess health-related accuracy, safety, reasoning, empathy, and usefulness under clinical or expert-informed standards.

2026
  • 🌟 [arXiv 2026.03] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models [Code]
    Medical
  • [arXiv 2026.03] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?
    Medical
  • 🌟 [arXiv 2026.02] LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation [Code]
    Medical
  • [arXiv 2026.01] RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions
    Medical
  • [arXiv 2026.01] MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models
    Medical
2025
  • 🌟 [arXiv 2025.05] HealthBench: Evaluating Large Language Models Towards Improved Human Health [Proj]
    Expert Requirement Medical

Legal

Legal benchmarks use rubrics to evaluate legal reasoning, issue identification, evidence use, and procedural or substantive correctness.

2026
  • 🌟 [arXiv 2026.01] PLAW BENCH: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice [Code]
    Expert Requirement Legal
2025
  • 🌟 [arXiv 2025.11] Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics [Code]
    Legal

Office Labor

Office labor benchmarks evaluate workplace tasks such as customer service, finance, hiring, productivity, and business workflows where outputs must satisfy operational criteria.

2026
  • [arXiv 2026.03] LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
    Office Labor
  • 🌟 [arXiv 2026.03] $OneMillion-Bench: How Far are Language Agents from Human Experts? [Data]
    Office Labor
  • 🌟 [arXiv 2026.03] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation [Proj]
    Office Labor
2025
  • [arXiv 2025.11] UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI
    Office Labor
  • 🌟 [arXiv 2025.11] PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning [Proj]
    Expert Requirement Office Labor
  • 🌟 [arXiv 2025.10] Benchmarking and Learning Real-World Customer Service Dialogue [Proj]
    Office Labor
  • 🌟 [ICLR 26] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge [Code]
    Expert Requirement Office Labor
  • [arXiv 2025.10] GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks [Data]
    Office Labor
  • 🌟 [ICLR 26] ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists [Proj]
    Expert Provider Office Labor

Academic

Academic benchmarks use rubrics to assess educational, scientific, or research tasks where outputs must satisfy discipline-specific correctness and presentation standards.

2026
  • 🌟 [arXiv 2026.03] PRBench: End-to-end Paper Reproduction in Physics Research [Code]
    Academic
  • [arXiv 2026.01] FrontierScience: Evaluating AI's Ability to Perform Expert-Level Scientific Tasks [Data]
    Academic
2025
  • [arXiv 2025.12] TechImage-Bench: Rubric-Based Evaluation for Technical Image Generation
    Academic
  • 🌟 [arXiv 2025.04] PaperBench: Evaluating AI's Ability to Replicate AI Research [Code]
    Academic

Deep Research

Deep research benchmarks evaluate long-form research agents or reports, emphasizing evidence coverage, citation quality, logical support, completeness, and objectivity.

2026
  • 🌟 [arXiv 2026.03] MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome [Proj]
    Deep Research
  • 🌟 [arXiv 2026.01] DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report [Code]
    Deep Research
2025
  • 🌟 [arXiv 2025.12] DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation [Code]
    Deep Research
  • 🌟 [ICLR 26] ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents [Proj]
    Deep Research
  • 🌟 [arXiv 2025.06] DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents [Code]
    Deep Research

Creative Generation

Creative generation benchmarks use rubrics to judge open-ended artifacts such as writing, tables, images, or videos along multiple quality dimensions.

2026
  • 🌟 [arXiv 2026.03] PresentBench: A Fine-Grained Rubric-Based Benchmark for Slide Generation [Proj]
    Expert Requirement Creative Generation
  • 🌟 [arXiv 2026.01] UEval: A Benchmark for Unified Multimodal Generation [Proj]
    Creative Generation
2025
  • 🌟 [arXiv 2025.03] WritingBench: A Comprehensive Benchmark for Generative Writing [Code]
    Rubric Judgement Pattern Creative Generation
2024
  • 🌟 [arXiv 2024.09] HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models [Code]
    Creative Generation

General Capability Evaluation

General capability benchmarks test transferable abilities with rubric-based protocols, abstracting away from a single professional domain.

Agentic

Agentic benchmarks evaluate planning, tool use, environment interaction, and long-horizon decision making with rubric-guided or process-aware assessment.

2026
  • [arXiv 2026.03] ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context
    Agentic
  • 🌟 [arXiv 2026.02] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification [Code]
    Agentic
2025
  • [arXiv 2025.10] TRAJECT-Bench: A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use [Data]
    Agentic
  • 🌟 [arXiv 2025.08] MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [Code]
    Agentic
2024
  • 🌟 [NeurIPS 24] AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents [Proj]
    Agentic

Reasoning

Reasoning benchmarks use rubrics to assess reasoning quality when answers involve partial credit, social or moral concepts, explanations, or hard-to-verify intermediate logic.

2025
  • 🌟 [arXiv 2025.10] MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes [Proj]
    Reasoning
2024
  • 🌟 [arXiv 2024.07] Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist [Code]
    Reasoning

Alignment

Alignment benchmarks assess whether models or judges follow intended preferences, instructions, safety constraints, and consistency principles under explicit criteria.

2026
  • [arXiv 2026.03] RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
    Alignment
  • 🌟 [arXiv 2026.03] RubricBench: Aligning Model-Generated Rubrics with Human Standards [Code]
    Expert Provider Human-AI Collaboration Alignment
  • 🌟 [LREC 26] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas [Code]
    Alignment
  • 🌟 [arXiv 2026.01] From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges [Code]
    Alignment
  • 🌟 [ICLR 26] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models [Code]
    Supervised Fine-Tuning Alignment
  • 🌟 [ICLR 26] RM-R1: Reward Modeling as Reasoning [Proj]
    Supervised Fine-Tuning Alignment
  • [ICLR 26] MENLO: From Preferences to Proficiency – Evaluating and Modeling Native-like Quality Across 47 Languages [Data]
    Alignment
2025
  • 🌟 [arXiv 2025.11] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following [Code]
    Expert Requirement Alignment
  • 🌟 [NeurIPS 25] XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [Code]
    Alignment
  • 🌟 [arXiv 2025.01] MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs [Code]
    Alignment
  • 🌟 [NeurIPS 25] Generalizing Verifiable Instruction Following [Code]
    Alignment
  • 🌟 [arXiv 2025.05] R3: Robust Rubric-Agnostic Reward Models [Code]
    Supervised Fine-Tuning Alignment
2024
  • 🌟 [EMNLP 24] SedarEval: Automated Evaluation using Self-Adaptive Rubrics [Code]
    Alignment
  • 🌟 [arXiv 2024.10] Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [Code]
    Alignment
  • 🌟 [ICLR 25] JudgeBench: A Benchmark for Evaluating LLM-based Judges [Code]
    Rubric Grader Analysis Alignment
  • 🌟 [arXiv 2024.02] A StrongREJECT for Empty Jailbreaks [Proj]
    Alignment
  • 🌟 [arXiv 2024.01] InFoBench: Evaluating Instruction Following Ability in Large Language Models [Code]
    Alignment

Applications

Applications use rubrics as practical task interfaces: they guide generation, evaluation, training, or refinement in concrete systems rather than only proposing benchmarks.

Domain

Domain applications apply rubrics within specific task settings such as healthcare, writing, retrieval, deep research, code, and agent workflows.

Medical

Medical applications use rubrics to align healthcare models with clinical reasoning, patient safety, expert preferences, and domain-specific response standards.

2026
  • 🌟 [CVPR 26] OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis [Code]
    Medical
  • 🌟 [arXiv 2026.02] ClinAlign: Scaling Healthcare Alignment from Clinician Preference [Code]
    Human-AI Collaboration Medical
  • [arXiv 2026.02] Quark Medical Alignment: A Holistic Multi-Dimensional Alignment and Collaborative Optimization Paradigm
    Multi-Objective Optimization Medical
  • [arXiv 2026.01] Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs
    Iterative Refinement Medical
2025
  • 🌟 [arXiv 2025.10] InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training [Code]
    Curriculum Learning Medical
  • 🌟 [arXiv 2025.09] Baichuan-M2: Scaling Medical Capability with Large Verifier System [Code]
    Medical

Writing and Retrieval

Writing and retrieval applications use rubrics to guide text generation, revision, explanation, document retrieval, and automated assessment of written outputs.

2025
  • 🌟 [arXiv 2025.09] Retro*: Optimizing LLMs for Reasoning-Intensive Document Retrieval [Code]
    Writing & Retrieval
  • [arXiv 2025.08] Are Today's LLMs Ready to Explain Well-Being Concepts?
    Writing & Retrieval
2024
  • 🌟 [ACL 24] LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts [Code]
    Rubric Judgement Pattern Writing & Retrieval
  • [ICTIR 24] Pencils Down! Automatic Rubric-based Evaluation of Retrieve/Generate Systems
    Writing & Retrieval

DeepResearch

DeepResearch applications use rubrics to supervise search, evidence chaining, report generation, and long-horizon research-agent optimization.

2026
  • 🌟 [arXiv 2026.02] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation [Model]
    Pairs-Grounded Generation DeepResearch
  • 🌟 [arXiv 2026.01] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards [Code]
    Credit Assignment DeepResearch
2025
  • 🌟 [arXiv 2025.12] Step-DeepResearch Technical Report [Code]
    Credit Assignment DeepResearch
  • 🌟 [arXiv 2025.11] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [Code]
    Rubric Judgement Pattern Self-Evolving Learning DeepResearch

Code

Code applications apply rubrics to software engineering agents, patch evaluation, code collaboration, and programming workflows.

2026
  • [arXiv 2026.03] StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning
    Code
  • 🌟 [arXiv 2026.01] Agentic Rubrics as Contextual Verifiers for SWE Agents [Proj]
    Agent Harness Code

General Agentic

General agentic applications use rubrics to coordinate, evaluate, or train agents across tool use, simulated worlds, interviews, and other open-ended environments.

2026
  • 🌟 [arXiv 2026.03] AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning [Code]
    Agent Harness General Agentic
  • 🌟 [arXiv 2026.02] CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use [Code]
    Credit Assignment General Agentic
  • 🌟 [arXiv 2026.01] Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards [Code]
    General Agentic
  • 🌟 [arXiv 2026.01] ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [Code]
    General Agentic
  • [arXiv 2026.01] SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models
    General Agentic
2025
  • [arXiv 2025.12] ARCANE: A Multi-Agent Framework for Interpretable and Configurable Alignment
    General Agentic

Multimodal

Multimodal applications extend rubric supervision beyond text, using criteria to assess or train systems that combine language with vision, speech, or omni-modal signals.

Text + Vision

Text-and-vision applications use rubrics for image or video generation, captioning, visual reasoning, and visual reward modeling.

2026
  • 🌟 [arXiv 2026.05] Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria [Code]
    Rubric Grader Analysis Text + Vision
  • [arXiv 2026.04] Visual Preference Optimization with Rubric Rewards
    Preference-Reward RL Text + Vision
  • 🌟 [arXiv 2026.03] Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLM Reward Models [Code]
    Text + Vision
  • [arXiv 2026.03] RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning
    Text + Vision
2025
  • [arXiv 2025.11] RubricRL: Simple Generalizable Rewards for Text-to-Image Generation
    Text + Vision
  • 🌟 [arXiv 2025.10] AutoRubric: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning [Code]
    Text + Vision

Text + Audio

Text-and-audio applications use rubrics to evaluate or fine-tune speech-language systems across multiple raters, aspects, and quality dimensions.

2026
  • [arXiv 2026.03] Rubric-Guided Fine-tuning of SpeechLLMs for Multi-Aspect, Multi-Rater L2 Reading-Speech Assessment
    Text + Audio
2025
  • 🌟 [ACL 26] SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation [Code]
    Text + Audio

Omni-modal

Omni-modal applications use rubric-grounded preference or reward modeling across multiple modalities within a unified training or evaluation framework.

2026
  • 🌟 [arXiv 2026.01] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis [Code]
    Omni-modal

LICENSE

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

If you have any questions or suggestions, please feel free to contact Hongru Xiao.

About

A curated list of resources (surveys, papers, benchmarks, and opensource projects) on Rubrics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors