From 9352abdff5f5181c4703700b028415a2464d863e Mon Sep 17 00:00:00 2001
From: Ksenia Blokhina <kseniablokhina@MacBook-Pro-Ksenia.local>
Date: Tue, 7 Apr 2026 13:29:13 +0300
Subject: [PATCH 1/3] 246 update evaluation docs

---
 docs/source/evaluation.rst | 124 ++++++++++++++++++++++++++++++++++---
 1 file changed, 117 insertions(+), 7 deletions(-)

diff --git a/docs/source/evaluation.rst b/docs/source/evaluation.rst
index 7259a6ca..862f242b 100644
--- a/docs/source/evaluation.rst
+++ b/docs/source/evaluation.rst
@@ -1,13 +1,123 @@
 Evaluation
 ==========
 
-The evaluation of a conversational recommender system (CRS) is performed by first generating dialogues between the CRS and the user simulator, then computing evaluation measures on these synthetic dialogues. 
-The evaluation scripts are located in the directory `scripts/evaluation`.
+UserSimCRS evaluates conversational recommender systems (CRSs) on exported dialogues. The evaluation pipeline loads dialogues from a JSON file, computes one or more metrics, and stores the results as JSON together with the resolved configuration.
 
-Currently, we provide the following evaluation scripts:
+A default evaluation configuration is provided in `config/default/config_evaluation.yaml`.
 
-  * **Dialogue quality evaluation**: Evaluates the dialogue quality with regards to five aspects: recommendation relevance, communication style, fluency, conversational flow, and overall satisfaction. The scores for each aspect are obtained from a large language model (LLM) hosted on a Ollama server.
-  * **Satisfaction evaluation**: Evaluates the user satisfaction using a pre-trained model from DialogueKit.
-  * **Utility evaluation**: Evaluates dialogues based on user-centric utility metrics: success rate, successful recommendation round ratio, and reward-per-dialogue-length.
 
-Please refer to the documentation of each script for more details on how to run them. 
\ No newline at end of file
+Usage
+-----
+
+Run evaluation with:
+
+.. code-block:: shell
+
+    python -m usersimcrs.run_evaluation -c <path_to_config.yaml>
+
+
+Some parameters can also be overridden from the command line, for example:
+
+.. code-block:: shell
+
+    python -m usersimcrs.run_evaluation \
+      -c config/default/config_evaluation.yaml \
+      --dialogues data/datasets/moviebot/annotated_dialogues.json \
+      --metrics satisfaction success_rate \
+      --output data/evaluation/results.json
+
+
+Configuration
+-------------
+
+The evaluation configuration is defined in a YAML file. The main parameters are:
+
+  * `dialogues`: Path to the dialogues JSON file.
+  * `metrics`: List of metrics to compute.
+  * `output`: Path to the JSON file where evaluation results will be saved.
+  * `quality_aspects`: Quality aspects to evaluate when `quality` is included in `metrics`.
+  * `quality_llm_interface`: LLM interface configuration used by the quality metric.
+  * `user_nlu_config`: Configuration file used to instantiate the user-side NLU for utility metrics.
+  * `agent_nlu_config`: Configuration file used to instantiate the agent-side NLU for utility metrics.
+  * `recommendation_intent_labels`: Intent labels that mark recommendation turns.
+  * `accept_intent_labels`: Intent labels that mark acceptance.
+  * `reject_intent_labels`: Intent labels that mark rejection.
+
+
+The following metrics are currently supported:
+
+  * `quality`
+  * `satisfaction`
+  * `success_rate`
+  * `successful_recommendation_round_ratio`
+  * `reward_per_dialogue_length`
+
+
+Metric Overview
+---------------
+
+Quality
+"""""""
+
+The quality metric uses an LLM to score each dialogue aspect separately. The supported aspects are defined by ``QualityRubrics``:
+
+  * `REC_RELEVANCE`
+  * `COM_STYLE`
+  * `FLUENCY`
+  * `CONV_FLOW`
+  * `OVERALL_SAT`
+
+
+When `quality` is requested, the configuration must include `quality_llm_interface`.
+
+
+Satisfaction
+""""""""""""
+
+The satisfaction metric uses the pre-trained DialogueKit satisfaction classifier and returns one score per dialogue.
+
+
+Utility Metrics
+"""""""""""""""
+
+The utility metrics are:
+
+  * **Success rate**: Returns `1.0` if at least one recommendation was accepted in the dialogue, otherwise `0.0`.
+  * **Successful recommendation round ratio**: Returns the ratio of accepted recommendation rounds to all recommendation rounds in the dialogue.
+  * **Reward per dialogue length**: Returns the number of accepted recommendations divided by the total number of utterances in the dialogue.
+
+
+If the input dialogues are not already annotated, UserSimCRS annotates them in place using the NLU components loaded from `user_nlu_config` and `agent_nlu_config`.
+
+When any utility metric is requested, the following configuration fields are required:
+
+  * `user_nlu_config`
+  * `agent_nlu_config`
+  * `recommendation_intent_labels`
+  * `accept_intent_labels`
+  * `reject_intent_labels`
+
+
+Output
+------
+
+The evaluation script writes two files:
+
+  * The JSON result file specified by `output`.
+  * A companion metadata file with the suffix `.meta.yaml`, containing the resolved configuration.
+
+
+The result JSON contains:
+
+  * `dialogues_path`: Path to the evaluated dialogues.
+  * `metrics_requested`: List of requested metrics.
+  * `metrics`: Metric results.
+
+
+For `satisfaction` and all utility metrics, each metric entry contains:
+
+  * `per_dialogue`: Mapping from conversation ID to score.
+  * `summary_by_agent`: Aggregate statistics per agent (`count`, `min`, `max`, `mean`, `stdev`).
+
+
+For `quality`, the output is grouped by aspect. Each aspect contains its own `per_dialogue` scores and `summary_by_agent` statistics.

From 6ce1a4cdd201f6af48eb0ff0e99dce3fa7f9288e Mon Sep 17 00:00:00 2001
From: Ksenia Blokhina <kseniablokhina@MacBook-Pro-Ksenia.local>
Date: Tue, 21 Apr 2026 16:21:33 +0200
Subject: [PATCH 2/3] change docs

---
 docs/source/evaluation.rst | 55 ++++++++++++++++++++++++++++----------
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/docs/source/evaluation.rst b/docs/source/evaluation.rst
index 862f242b..1aea9d7e 100644
--- a/docs/source/evaluation.rst
+++ b/docs/source/evaluation.rst
@@ -1,7 +1,7 @@
 Evaluation
 ==========
 
-UserSimCRS evaluates conversational recommender systems (CRSs) on exported dialogues. The evaluation pipeline loads dialogues from a JSON file, computes one or more metrics, and stores the results as JSON together with the resolved configuration.
+UserSimCRS evaluates conversational recommender systems (CRSs) on previously generated synthetic dialogues. The evaluation pipeline loads dialogues from a JSON file, computes one or more metrics, and stores the results as JSON together with the copy of the configuration used.
 
 A default evaluation configuration is provided in `config/default/config_evaluation.yaml`.
 
@@ -21,10 +21,12 @@ Some parameters can also be overridden from the command line, for example:
 .. code-block:: shell
 
     python -m usersimcrs.run_evaluation \
-      -c config/default/config_evaluation.yaml \
+      -c <path_to_evaluation_config.yaml> \
       --dialogues data/datasets/moviebot/annotated_dialogues.json \
       --metrics satisfaction success_rate \
-      --output data/evaluation/results.json
+      --output-dir data/evaluation
+
+Run ``python -m usersimcrs.run_evaluation -h`` for the full list of available command-line arguments. The configuration fields used by these arguments are described below.
 
 
 Configuration
@@ -34,11 +36,10 @@ The evaluation configuration is defined in a YAML file. The main parameters are:
 
   * `dialogues`: Path to the dialogues JSON file.
   * `metrics`: List of metrics to compute.
-  * `output`: Path to the JSON file where evaluation results will be saved.
+  * `output_dir`: Directory where evaluation results and metadata will be saved.
   * `quality_aspects`: Quality aspects to evaluate when `quality` is included in `metrics`.
   * `quality_llm_interface`: LLM interface configuration used by the quality metric.
-  * `user_nlu_config`: Configuration file used to instantiate the user-side NLU for utility metrics.
-  * `agent_nlu_config`: Configuration file used to instantiate the agent-side NLU for utility metrics.
+  * `annotate_dialogues`: Whether dialogues should be annotated before metric computation.
   * `recommendation_intent_labels`: Intent labels that mark recommendation turns.
   * `accept_intent_labels`: Intent labels that mark acceptance.
   * `reject_intent_labels`: Intent labels that mark rejection.
@@ -59,6 +60,8 @@ Metric Overview
 Quality
 """""""
 
+:py:class:`usersimcrs.evaluation.quality_metric.QualityMetric`
+
 The quality metric uses an LLM to score each dialogue aspect separately. The supported aspects are defined by ``QualityRubrics``:
 
   * `REC_RELEVANCE`
@@ -74,37 +77,61 @@ When `quality` is requested, the configuration must include `quality_llm_interfa
 Satisfaction
 """"""""""""
 
+:py:class:`usersimcrs.evaluation.satisfaction_metric.SatisfactionMetric`
+
 The satisfaction metric uses the pre-trained DialogueKit satisfaction classifier and returns one score per dialogue.
 
 
 Utility Metrics
 """""""""""""""
 
-The utility metrics are:
+The utility metrics capture recommendation outcomes from annotated dialogues. If the input dialogues are not already annotated, they can be annotated before evaluation by enabling `annotate_dialogues` and providing `user_nlu` and `agent_nlu` configurations. For additional context on their role in the evaluation setup, see `Bernard and Balog, 2026 <https://arxiv.org/abs/2512.04588>`_.
+
+
+Success Rate
+''''''''''''
+
+:py:class:`usersimcrs.evaluation.success_rate_metric.SuccessRateMetric`
+
+Returns `1.0` if at least one recommendation was accepted in the dialogue, otherwise `0.0`.
+
 
-  * **Success rate**: Returns `1.0` if at least one recommendation was accepted in the dialogue, otherwise `0.0`.
-  * **Successful recommendation round ratio**: Returns the ratio of accepted recommendation rounds to all recommendation rounds in the dialogue.
-  * **Reward per dialogue length**: Returns the number of accepted recommendations divided by the total number of utterances in the dialogue.
+Successful Recommendation Round Ratio
+''''''''''''''''''''''''''''''''''''''
 
+:py:class:`usersimcrs.evaluation.successful_recommendation_round_ratio_metric.SuccessfulRecommendationRoundRatioMetric`
+
+Returns the ratio of accepted recommendation rounds to all recommendation rounds in the dialogue.
+
+
+Reward per Dialogue Length
+''''''''''''''''''''''''''
+
+:py:class:`usersimcrs.evaluation.reward_per_dialogue_length_metric.RewardPerDialogueLengthMetric`
+
+Returns the number of accepted recommendations divided by the total number of utterances in the dialogue.
 
 If the input dialogues are not already annotated, UserSimCRS annotates them in place using the NLU components loaded from `user_nlu_config` and `agent_nlu_config`.
 
 When any utility metric is requested, the following configuration fields are required:
 
-  * `user_nlu_config`
-  * `agent_nlu_config`
   * `recommendation_intent_labels`
   * `accept_intent_labels`
   * `reject_intent_labels`
 
+When `annotate_dialogues` is enabled, the following configuration fields are also required:
+
+  * `user_nlu`
+  * `agent_nlu`
+
 
 Output
 ------
 
 The evaluation script writes two files:
 
-  * The JSON result file specified by `output`.
-  * A companion metadata file with the suffix `.meta.yaml`, containing the resolved configuration.
+  * `results.json` in the directory specified by `output_dir`.
+  * `config_evaluation.meta.yaml` in the same directory, containing a copy of the configuration used.
 
 
 The result JSON contains:

From e78f37d80ce7d18340c6d0bfd0cb7259badf58b4 Mon Sep 17 00:00:00 2001
From: Ksenia Blokhina <kseniablokhina@MacBook-Pro-Ksenia.local>
Date: Tue, 16 Jun 2026 09:43:13 +0200
Subject: [PATCH 3/3] update the doc, add examples

---
 docs/source/evaluation.rst | 77 ++++++++++++++++++++++++++++++++------
 1 file changed, 65 insertions(+), 12 deletions(-)

diff --git a/docs/source/evaluation.rst b/docs/source/evaluation.rst
index 1aea9d7e..aaed3401 100644
--- a/docs/source/evaluation.rst
+++ b/docs/source/evaluation.rst
@@ -64,11 +64,11 @@ Quality
 
 The quality metric uses an LLM to score each dialogue aspect separately. The supported aspects are defined by ``QualityRubrics``:
 
-  * `REC_RELEVANCE`
-  * `COM_STYLE`
-  * `FLUENCY`
-  * `CONV_FLOW`
-  * `OVERALL_SAT`
+  * `REC_RELEVANCE`: Recommendation relevance measures how closely the recommended items align with the user’s preferences and needs.
+  * `COM_STYLE`: Communication style corresponds to the conciseness and clarity of the responses.
+  * `FLUENCY`: Fluency is the degree of naturalness of the responses compared to human-generated responses.
+  * `CONV_FLOW`: Conversational flow assesses the coherence and consistency of the conversation.
+  * `OVERALL_SAT`: Overall satisfaction encapsulates the user’s holistic experience. 
 
 
 When `quality` is requested, the configuration must include `quality_llm_interface`.
@@ -82,10 +82,10 @@ Satisfaction
 The satisfaction metric uses the pre-trained DialogueKit satisfaction classifier and returns one score per dialogue.
 
 
-Utility Metrics
-"""""""""""""""
+User Utility Metrics
+""""""""""""""""""""
 
-The utility metrics capture recommendation outcomes from annotated dialogues. If the input dialogues are not already annotated, they can be annotated before evaluation by enabling `annotate_dialogues` and providing `user_nlu` and `agent_nlu` configurations. For additional context on their role in the evaluation setup, see `Bernard and Balog, 2026 <https://arxiv.org/abs/2512.04588>`_.
+The user utility metrics capture recommendation outcomes from annotated dialogues. If the input dialogues are not already annotated, they can be annotated before evaluation by enabling `annotate_dialogues` and providing `user_nlu` and `agent_nlu` configurations. For additional context on their role in the evaluation setup, see `Bernard and Balog, 2025 <https://doi.org/10.1145/3767695.3769478>`_.
 
 
 Success Rate
@@ -111,9 +111,7 @@ Reward per Dialogue Length
 
 Returns the number of accepted recommendations divided by the total number of utterances in the dialogue.
 
-If the input dialogues are not already annotated, UserSimCRS annotates them in place using the NLU components loaded from `user_nlu_config` and `agent_nlu_config`.
-
-When any utility metric is requested, the following configuration fields are required:
+When any user utility metric is requested, the following configuration fields are required:
 
   * `recommendation_intent_labels`
   * `accept_intent_labels`
@@ -141,10 +139,65 @@ The result JSON contains:
   * `metrics`: Metric results.
 
 
-For `satisfaction` and all utility metrics, each metric entry contains:
+For `satisfaction` and all user utility metrics, each metric entry contains:
 
   * `per_dialogue`: Mapping from conversation ID to score.
   * `summary_by_agent`: Aggregate statistics per agent (`count`, `min`, `max`, `mean`, `stdev`).
 
 
 For `quality`, the output is grouped by aspect. Each aspect contains its own `per_dialogue` scores and `summary_by_agent` statistics.
+
+Example output structure:
+
+.. code-block:: json
+
+    {
+      "dialogues_path": "data/datasets/moviebot/annotated_dialogues.json",
+      "metrics_requested": ["satisfaction", "success_rate", "quality"],
+      "metrics": {
+        "satisfaction": {
+          "per_dialogue": {
+            "conv_001": 0.82
+          },
+          "summary_by_agent": {
+            "moviebot": {
+              "count": 1,
+              "min": 0.82,
+              "max": 0.82,
+              "mean": 0.82,
+              "stdev": 0.0
+            }
+          }
+        },
+        "success_rate": {
+          "per_dialogue": {
+            "conv_001": 1.0
+          },
+          "summary_by_agent": {
+            "moviebot": {
+              "count": 1,
+              "min": 1.0,
+              "max": 1.0,
+              "mean": 1.0,
+              "stdev": 0.0
+            }
+          }
+        },
+        "quality": {
+          "REC_RELEVANCE": {
+            "per_dialogue": {
+              "conv_001": 4.5
+            },
+            "summary_by_agent": {
+              "moviebot": {
+                "count": 1,
+                "min": 4.5,
+                "max": 4.5,
+                "mean": 4.5,
+                "stdev": 0.0
+              }
+            }
+          }
+        }
+      }
+    }