diff --git a/docs/source/evaluation.rst b/docs/source/evaluation.rst index 7259a6ca..aaed3401 100644 --- a/docs/source/evaluation.rst +++ b/docs/source/evaluation.rst @@ -1,13 +1,203 @@ Evaluation ========== -The evaluation of a conversational recommender system (CRS) is performed by first generating dialogues between the CRS and the user simulator, then computing evaluation measures on these synthetic dialogues. -The evaluation scripts are located in the directory `scripts/evaluation`. +UserSimCRS evaluates conversational recommender systems (CRSs) on previously generated synthetic dialogues. The evaluation pipeline loads dialogues from a JSON file, computes one or more metrics, and stores the results as JSON together with the copy of the configuration used. -Currently, we provide the following evaluation scripts: +A default evaluation configuration is provided in `config/default/config_evaluation.yaml`. - * **Dialogue quality evaluation**: Evaluates the dialogue quality with regards to five aspects: recommendation relevance, communication style, fluency, conversational flow, and overall satisfaction. The scores for each aspect are obtained from a large language model (LLM) hosted on a Ollama server. - * **Satisfaction evaluation**: Evaluates the user satisfaction using a pre-trained model from DialogueKit. - * **Utility evaluation**: Evaluates dialogues based on user-centric utility metrics: success rate, successful recommendation round ratio, and reward-per-dialogue-length. -Please refer to the documentation of each script for more details on how to run them. \ No newline at end of file +Usage +----- + +Run evaluation with: + +.. code-block:: shell + + python -m usersimcrs.run_evaluation -c + + +Some parameters can also be overridden from the command line, for example: + +.. code-block:: shell + + python -m usersimcrs.run_evaluation \ + -c \ + --dialogues data/datasets/moviebot/annotated_dialogues.json \ + --metrics satisfaction success_rate \ + --output-dir data/evaluation + +Run ``python -m usersimcrs.run_evaluation -h`` for the full list of available command-line arguments. The configuration fields used by these arguments are described below. + + +Configuration +------------- + +The evaluation configuration is defined in a YAML file. The main parameters are: + + * `dialogues`: Path to the dialogues JSON file. + * `metrics`: List of metrics to compute. + * `output_dir`: Directory where evaluation results and metadata will be saved. + * `quality_aspects`: Quality aspects to evaluate when `quality` is included in `metrics`. + * `quality_llm_interface`: LLM interface configuration used by the quality metric. + * `annotate_dialogues`: Whether dialogues should be annotated before metric computation. + * `recommendation_intent_labels`: Intent labels that mark recommendation turns. + * `accept_intent_labels`: Intent labels that mark acceptance. + * `reject_intent_labels`: Intent labels that mark rejection. + + +The following metrics are currently supported: + + * `quality` + * `satisfaction` + * `success_rate` + * `successful_recommendation_round_ratio` + * `reward_per_dialogue_length` + + +Metric Overview +--------------- + +Quality +""""""" + +:py:class:`usersimcrs.evaluation.quality_metric.QualityMetric` + +The quality metric uses an LLM to score each dialogue aspect separately. The supported aspects are defined by ``QualityRubrics``: + + * `REC_RELEVANCE`: Recommendation relevance measures how closely the recommended items align with the user’s preferences and needs. + * `COM_STYLE`: Communication style corresponds to the conciseness and clarity of the responses. + * `FLUENCY`: Fluency is the degree of naturalness of the responses compared to human-generated responses. + * `CONV_FLOW`: Conversational flow assesses the coherence and consistency of the conversation. + * `OVERALL_SAT`: Overall satisfaction encapsulates the user’s holistic experience. + + +When `quality` is requested, the configuration must include `quality_llm_interface`. + + +Satisfaction +"""""""""""" + +:py:class:`usersimcrs.evaluation.satisfaction_metric.SatisfactionMetric` + +The satisfaction metric uses the pre-trained DialogueKit satisfaction classifier and returns one score per dialogue. + + +User Utility Metrics +"""""""""""""""""""" + +The user utility metrics capture recommendation outcomes from annotated dialogues. If the input dialogues are not already annotated, they can be annotated before evaluation by enabling `annotate_dialogues` and providing `user_nlu` and `agent_nlu` configurations. For additional context on their role in the evaluation setup, see `Bernard and Balog, 2025 `_. + + +Success Rate +'''''''''''' + +:py:class:`usersimcrs.evaluation.success_rate_metric.SuccessRateMetric` + +Returns `1.0` if at least one recommendation was accepted in the dialogue, otherwise `0.0`. + + +Successful Recommendation Round Ratio +'''''''''''''''''''''''''''''''''''''' + +:py:class:`usersimcrs.evaluation.successful_recommendation_round_ratio_metric.SuccessfulRecommendationRoundRatioMetric` + +Returns the ratio of accepted recommendation rounds to all recommendation rounds in the dialogue. + + +Reward per Dialogue Length +'''''''''''''''''''''''''' + +:py:class:`usersimcrs.evaluation.reward_per_dialogue_length_metric.RewardPerDialogueLengthMetric` + +Returns the number of accepted recommendations divided by the total number of utterances in the dialogue. + +When any user utility metric is requested, the following configuration fields are required: + + * `recommendation_intent_labels` + * `accept_intent_labels` + * `reject_intent_labels` + +When `annotate_dialogues` is enabled, the following configuration fields are also required: + + * `user_nlu` + * `agent_nlu` + + +Output +------ + +The evaluation script writes two files: + + * `results.json` in the directory specified by `output_dir`. + * `config_evaluation.meta.yaml` in the same directory, containing a copy of the configuration used. + + +The result JSON contains: + + * `dialogues_path`: Path to the evaluated dialogues. + * `metrics_requested`: List of requested metrics. + * `metrics`: Metric results. + + +For `satisfaction` and all user utility metrics, each metric entry contains: + + * `per_dialogue`: Mapping from conversation ID to score. + * `summary_by_agent`: Aggregate statistics per agent (`count`, `min`, `max`, `mean`, `stdev`). + + +For `quality`, the output is grouped by aspect. Each aspect contains its own `per_dialogue` scores and `summary_by_agent` statistics. + +Example output structure: + +.. code-block:: json + + { + "dialogues_path": "data/datasets/moviebot/annotated_dialogues.json", + "metrics_requested": ["satisfaction", "success_rate", "quality"], + "metrics": { + "satisfaction": { + "per_dialogue": { + "conv_001": 0.82 + }, + "summary_by_agent": { + "moviebot": { + "count": 1, + "min": 0.82, + "max": 0.82, + "mean": 0.82, + "stdev": 0.0 + } + } + }, + "success_rate": { + "per_dialogue": { + "conv_001": 1.0 + }, + "summary_by_agent": { + "moviebot": { + "count": 1, + "min": 1.0, + "max": 1.0, + "mean": 1.0, + "stdev": 0.0 + } + } + }, + "quality": { + "REC_RELEVANCE": { + "per_dialogue": { + "conv_001": 4.5 + }, + "summary_by_agent": { + "moviebot": { + "count": 1, + "min": 4.5, + "max": 4.5, + "mean": 4.5, + "stdev": 0.0 + } + } + } + } + } + }