Skip to content

views-platform/views-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

191 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GitHub License GitHub branch check runs GitHub Issues or Pull Requests GitHub Release

VIEWS Twitter Header

VIEWS Evaluation πŸ“Š

Part of the VIEWS Platform ecosystem for large-scale conflict forecasting.


⚠️ ATTENTION: Migration Notice (v0.4.0+)

The evaluation ontology has been updated to be more explicit and task-specific. If your pipeline broke after updating, please update your configuration dictionary. The library now distinguishes between regression vs classification tasks, and point vs sample predictions.

Key Changes:

  • targets is now regression_targets or classification_targets.
  • metrics is now regression_point_metrics.
  • All uncertainty keys have been renamed to sample (reflecting that we evaluate draws/samples from a distribution).
Legacy Key New Canonical Key
targets regression_targets
metrics regression_point_metrics
regression_uncertainty_metrics regression_sample_metrics
classification_uncertainty_metrics classification_sample_metrics

Note: Legacy keys still work but will trigger a DeprecationWarning.


πŸ“š Table of Contents

  1. Overview
  2. Quick Start
  3. Role in the VIEWS Pipeline
  4. Features
  5. Installation
  6. Architecture
  7. Project Structure
  8. Contributing
  9. License
  10. Acknowledgements

🧠 Overview

The VIEWS Evaluation repository provides a standardized framework for assessing time-series forecasting models used in the VIEWS conflict prediction pipeline. It ensures consistent, robust, and interpretable evaluations through metrics tailored to conflict-related data, which often exhibit right-skewness and zero-inflation.

The library is built on a three-layer architecture with a framework-agnostic NumPy core, ensuring that all mathematical evaluation logic is independent of Pandas or any other data-frame library.


πŸš€ Quick Start

from views_evaluation import EvaluationFrame, NativeEvaluator
import numpy as np

# 1. Construct EvaluationFrame with NumPy arrays
ef = EvaluationFrame(
    y_true=y_true_array,
    y_pred=y_pred_array,  # shape (N, S) where S >= 1
    identifiers={'time': times, 'unit': units, 'origin': origins, 'step': steps},
    metadata={'target': 'ged_sb_best'},
)

# 2. Configure and evaluate
config = {
    "steps": [1, 2, 3, 4, 5, 6],
    "regression_targets": ["ged_sb_best"],
    "regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
}
evaluator = NativeEvaluator(config)
report = evaluator.evaluate(ef)

# 3. Access results
report.to_dataframe("step")          # pd.DataFrame
report.to_dict()                     # nested dict
report.get_schema_results("month")   # typed metrics dataclass

For the full walkthrough including input formatting and sample evaluation, see documentation/integration_guide.md.


🌍 Role in the VIEWS Pipeline

VIEWS Evaluation ensures forecasting accuracy and model robustness as the official evaluation component of the VIEWS ecosystem.

Pipeline Integration:

  1. Model Predictions β†’
  2. EvaluationFrame (validated NumPy container) β†’
  3. NativeEvaluator (metrics computation) β†’
  4. EvaluationReport (structured results)

Integration with Other Repositories:


✨ Features

  • Comprehensive Evaluation Framework: The NativeEvaluator provides structured, stateless evaluation of time series predictions across a 2Γ—2 matrix of regression/classification tasks and point/sample prediction types.
  • Multiple Evaluation Schemas:
    • Step-wise evaluation: groups and evaluates predictions by the respective steps from all models.
    • Time-series-wise evaluation: evaluates predictions for each time-series.
    • Month-wise evaluation: groups and evaluates predictions at a monthly level.
  • Support for Multiple Metrics (see table below for details)

Available Metrics

Metrics are organized by the 2Γ—2 evaluation matrix: task (regression / classification) Γ— prediction type (point / sample).

Regression Point Metrics

Metric Key Description Status
Mean Squared Error MSE Average of squared differences βœ…
Mean Squared Log Error MSLE MSE computed on log-transformed values βœ…
Root Mean Squared Log Error RMSLE Square root of MSLE βœ…
Earth Mover's Distance EMD Wasserstein distance between distributions βœ…
Pearson Correlation Pearson Linear correlation between predictions and actuals βœ…
Mean Tweedie Deviance MTD Tweedie deviance (configurable power), ideal for zero-inflated data βœ…
Mean Prediction y_hat_bar Average of all predicted values (diagnostic) βœ…
Magnitude Calibration Ratio MCR_point Ratio of predicted to actual magnitude βœ…
Sinkhorn Distance SD Regularized optimal transport distance ❌
pseudo-Earth Mover Divergence pEMDiv Efficient EMD approximation ❌
Variogram Variogram Spatial/temporal correlation structure score ❌

Regression Sample Metrics

Metric Key Description Status
Continuous Ranked Probability Score CRPS Calibration and sharpness of probabilistic forecasts βœ…
Threshold-Weighted CRPS twCRPS CRPS emphasizing values above a threshold βœ…
Mean Interval Score MIS Prediction interval width and coverage βœ…
Quantile Interval Score QIS Interval score at specified quantiles βœ…
Coverage Coverage Proportion of actuals within prediction intervals βœ…
Ignorance Score Ignorance Logarithmic scoring rule for probabilistic predictions βœ…
Mean Prediction y_hat_bar Average of all predicted values (diagnostic) βœ…
Magnitude Calibration Ratio MCR_sample Ratio of predicted to actual magnitude βœ…

Classification Point Metrics

Metric Key Description Status
Average Precision AP Area under precision-recall curve βœ…

Classification Sample Metrics

Metric Key Description Status
Continuous Ranked Probability Score CRPS Calibration and sharpness βœ…
Threshold-Weighted CRPS twCRPS CRPS emphasizing values above a threshold βœ…
Brier Score Brier Accuracy of probabilistic binary predictions ❌
Jeffreys Divergence Jeffreys Symmetric measure of distribution difference ❌

Note: Metrics marked ❌ are defined in the catalog but not yet implemented β€” requesting them raises a clear ValueError.


πŸ“ Configuration Schema

The NativeEvaluator accepts a configuration dictionary (EvaluationConfig TypedDict) with the following keys:

Key Type Description
steps List[int] List of forecast steps to evaluate (e.g., [1, 3, 6, 12]).
regression_targets List[str] List of continuous targets (e.g., ['ged_sb_best']).
regression_point_metrics List[str] Metrics to compute for regression point predictions.
regression_sample_metrics List[str] Metrics to compute for regression sample predictions (e.g., ['CRPS']).
classification_targets List[str] List of binary targets (e.g., ['by_sb_best']).
classification_point_metrics List[str] Metrics to compute for classification probability scores.
classification_sample_metrics List[str] Metrics to compute for classification sample predictions.
evaluation_profile str Named hyperparameter profile (default: "base"). See views_evaluation/profiles/.
metric_hyperparameters Dict[str, Dict] Per-metric overrides that take precedence over the profile.

Example Configuration:

config = {
    "steps": [1, 3, 6, 12],
    "regression_targets": ["ged_sb_best"],
    "regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
    "regression_sample_metrics": ["CRPS", "twCRPS", "MIS", "Coverage"],
    "evaluation_profile": "base",  # or "hydranet_ucdp"
    "metric_hyperparameters": {
        "twCRPS": {"threshold": 10.0},  # override profile default
    },
}

  • Data Integrity Checks: Validates input arrays for shape consistency, NaN/infinity, and required identifiers.
  • Framework-Agnostic Core: All evaluation operates on pure NumPy arrays via EvaluationFrame.
  • Metric Catalog & Profiles: Hyperparameters are managed through named evaluation profiles with a Chain of Responsibility resolver (model overrides β†’ profile β†’ fail loud).

βš™οΈ Installation

Prerequisites

  • Python >= 3.11

From PyPI

pip install views_evaluation

πŸ— Architecture

The library follows a strict three-layer architecture (ADR-011):

Level 0 β€” Pure Core (NumPy + SciPy only, zero framework imports)
  EvaluationFrame       Canonical data container (y_true, y_pred, identifiers)
  NativeEvaluator       Stateless evaluation engine (month/sequence/step schemas)
  MetricCatalog         Genome registry mapping metrics β†’ functions + required params
  Profiles              Named hyperparameter sets (base, hydranet_ucdp, ...)

Level 1 β€” Bridge / Adapter
  EvaluationFrame       Validated NumPy data container
  EvaluationReport      Results container with DataFrame/dict export

Level 2 β€” Legacy Orchestrator
  MetricCatalog         Genome registry and parameter resolver

Key design decisions:

  • ADR-011: No Pandas/Polars imports in Level 0 β€” math is framework-agnostic.
  • ADR-013: Fail-loud β€” all structural failures raise exceptions with actionable messages, never silently degrade.
  • ADR-042: Metric catalog β€” each metric declares its required hyperparameters ("genome"); values are resolved via Chain of Responsibility.

πŸ—‚ Project Structure

views-evaluation/
β”œβ”€β”€ views_evaluation/
β”‚   β”œβ”€β”€ __init__.py                        # Public API exports
β”‚   β”œβ”€β”€ adapters/
β”‚   β”‚   └── __init__.py                     # Reserved for future framework bridges
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   β”œβ”€β”€ config_schema.py               # EvaluationConfig TypedDict
β”‚   β”‚   β”œβ”€β”€ evaluation_frame.py            # Core data container
β”‚   β”‚   β”œβ”€β”€ evaluation_manager.py          # Legacy orchestrator (deprecated)
β”‚   β”‚   β”œβ”€β”€ evaluation_report.py           # Results container
β”‚   β”‚   β”œβ”€β”€ metric_catalog.py              # ADR-042 registry + resolver
β”‚   β”‚   β”œβ”€β”€ metrics.py                     # Typed metric dataclasses
β”‚   β”‚   β”œβ”€β”€ native_evaluator.py            # Core evaluation engine
β”‚   β”‚   └── native_metric_calculators.py   # Metric implementations
β”‚   └── profiles/
β”‚       β”œβ”€β”€ base.py                        # Standard hyperparameter defaults
β”‚       └── hydranet_ucdp.py               # Domain-specific profile
β”œβ”€β”€ tests/                                 # 242 tests (Green/Beige/Red)
β”œβ”€β”€ documentation/
β”‚   β”œβ”€β”€ ADRs/                              # 17 Architecture Decision Records
β”‚   β”œβ”€β”€ CICs/                              # Class Intent Contracts
β”‚   β”œβ”€β”€ integration_guide.md               # Full API walkthrough
β”‚   └── evaluation_concepts.md             # Domain concepts
β”œβ”€β”€ pyproject.toml
└── README.md

🀝 Contributing

We welcome contributions! Please follow the VIEWS Contribution Guidelines.


πŸ“œ License

This project is licensed under the LICENSE file.


πŸ’¬ Acknowledgements

Views Funders

Special thanks to the VIEWS MD&D Team for their collaboration and support.

About

VIEWS Evaluation is a package that contains evaluation schemas and metrics to evaluate VIEWS models

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages