Part of the VIEWS Platform ecosystem for large-scale conflict forecasting.
The evaluation ontology has been updated to be more explicit and task-specific. If your pipeline broke after updating, please update your configuration dictionary. The library now distinguishes between regression vs classification tasks, and point vs sample predictions.
Key Changes:
targetsis nowregression_targetsorclassification_targets.metricsis nowregression_point_metrics.- All
uncertaintykeys have been renamed tosample(reflecting that we evaluate draws/samples from a distribution).
| Legacy Key | New Canonical Key |
|---|---|
targets |
regression_targets |
metrics |
regression_point_metrics |
regression_uncertainty_metrics |
regression_sample_metrics |
classification_uncertainty_metrics |
classification_sample_metrics |
Note: Legacy keys still work but will trigger a DeprecationWarning.
- Overview
- Quick Start
- Role in the VIEWS Pipeline
- Features
- Installation
- Architecture
- Project Structure
- Contributing
- License
- Acknowledgements
The VIEWS Evaluation repository provides a standardized framework for assessing time-series forecasting models used in the VIEWS conflict prediction pipeline. It ensures consistent, robust, and interpretable evaluations through metrics tailored to conflict-related data, which often exhibit right-skewness and zero-inflation.
The library is built on a three-layer architecture with a framework-agnostic NumPy core, ensuring that all mathematical evaluation logic is independent of Pandas or any other data-frame library.
from views_evaluation import EvaluationFrame, NativeEvaluator
import numpy as np
# 1. Construct EvaluationFrame with NumPy arrays
ef = EvaluationFrame(
y_true=y_true_array,
y_pred=y_pred_array, # shape (N, S) where S >= 1
identifiers={'time': times, 'unit': units, 'origin': origins, 'step': steps},
metadata={'target': 'ged_sb_best'},
)
# 2. Configure and evaluate
config = {
"steps": [1, 2, 3, 4, 5, 6],
"regression_targets": ["ged_sb_best"],
"regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
}
evaluator = NativeEvaluator(config)
report = evaluator.evaluate(ef)
# 3. Access results
report.to_dataframe("step") # pd.DataFrame
report.to_dict() # nested dict
report.get_schema_results("month") # typed metrics dataclassFor the full walkthrough including input formatting and sample evaluation, see
documentation/integration_guide.md.
VIEWS Evaluation ensures forecasting accuracy and model robustness as the official evaluation component of the VIEWS ecosystem.
- Model Predictions β
- EvaluationFrame (validated NumPy container) β
- NativeEvaluator (metrics computation) β
- EvaluationReport (structured results)
- views-pipeline-core: Supplies preprocessed data for evaluation.
- views-models: Provides trained models to be assessed.
- views-stepshifter: Evaluates time-shifted forecasting models.
- views-hydranet: Supports spatiotemporal deep learning model evaluations.
- Comprehensive Evaluation Framework: The
NativeEvaluatorprovides structured, stateless evaluation of time series predictions across a 2Γ2 matrix of regression/classification tasks and point/sample prediction types. - Multiple Evaluation Schemas:
- Step-wise evaluation: groups and evaluates predictions by the respective steps from all models.
- Time-series-wise evaluation: evaluates predictions for each time-series.
- Month-wise evaluation: groups and evaluates predictions at a monthly level.
- Support for Multiple Metrics (see table below for details)
Metrics are organized by the 2Γ2 evaluation matrix: task (regression / classification) Γ prediction type (point / sample).
| Metric | Key | Description | Status |
|---|---|---|---|
| Mean Squared Error | MSE |
Average of squared differences | β |
| Mean Squared Log Error | MSLE |
MSE computed on log-transformed values | β |
| Root Mean Squared Log Error | RMSLE |
Square root of MSLE | β |
| Earth Mover's Distance | EMD |
Wasserstein distance between distributions | β |
| Pearson Correlation | Pearson |
Linear correlation between predictions and actuals | β |
| Mean Tweedie Deviance | MTD |
Tweedie deviance (configurable power), ideal for zero-inflated data | β |
| Mean Prediction | y_hat_bar |
Average of all predicted values (diagnostic) | β |
| Magnitude Calibration Ratio | MCR_point |
Ratio of predicted to actual magnitude | β |
| Sinkhorn Distance | SD |
Regularized optimal transport distance | β |
| pseudo-Earth Mover Divergence | pEMDiv |
Efficient EMD approximation | β |
| Variogram | Variogram |
Spatial/temporal correlation structure score | β |
| Metric | Key | Description | Status |
|---|---|---|---|
| Continuous Ranked Probability Score | CRPS |
Calibration and sharpness of probabilistic forecasts | β |
| Threshold-Weighted CRPS | twCRPS |
CRPS emphasizing values above a threshold | β |
| Mean Interval Score | MIS |
Prediction interval width and coverage | β |
| Quantile Interval Score | QIS |
Interval score at specified quantiles | β |
| Coverage | Coverage |
Proportion of actuals within prediction intervals | β |
| Ignorance Score | Ignorance |
Logarithmic scoring rule for probabilistic predictions | β |
| Mean Prediction | y_hat_bar |
Average of all predicted values (diagnostic) | β |
| Magnitude Calibration Ratio | MCR_sample |
Ratio of predicted to actual magnitude | β |
| Metric | Key | Description | Status |
|---|---|---|---|
| Average Precision | AP |
Area under precision-recall curve | β |
| Metric | Key | Description | Status |
|---|---|---|---|
| Continuous Ranked Probability Score | CRPS |
Calibration and sharpness | β |
| Threshold-Weighted CRPS | twCRPS |
CRPS emphasizing values above a threshold | β |
| Brier Score | Brier |
Accuracy of probabilistic binary predictions | β |
| Jeffreys Divergence | Jeffreys |
Symmetric measure of distribution difference | β |
Note: Metrics marked β are defined in the catalog but not yet implemented β requesting them raises a clear
ValueError.
The NativeEvaluator accepts a configuration dictionary (EvaluationConfig TypedDict) with the following keys:
| Key | Type | Description |
|---|---|---|
steps |
List[int] |
List of forecast steps to evaluate (e.g., [1, 3, 6, 12]). |
regression_targets |
List[str] |
List of continuous targets (e.g., ['ged_sb_best']). |
regression_point_metrics |
List[str] |
Metrics to compute for regression point predictions. |
regression_sample_metrics |
List[str] |
Metrics to compute for regression sample predictions (e.g., ['CRPS']). |
classification_targets |
List[str] |
List of binary targets (e.g., ['by_sb_best']). |
classification_point_metrics |
List[str] |
Metrics to compute for classification probability scores. |
classification_sample_metrics |
List[str] |
Metrics to compute for classification sample predictions. |
evaluation_profile |
str |
Named hyperparameter profile (default: "base"). See views_evaluation/profiles/. |
metric_hyperparameters |
Dict[str, Dict] |
Per-metric overrides that take precedence over the profile. |
config = {
"steps": [1, 3, 6, 12],
"regression_targets": ["ged_sb_best"],
"regression_point_metrics": ["MSE", "RMSLE", "Pearson"],
"regression_sample_metrics": ["CRPS", "twCRPS", "MIS", "Coverage"],
"evaluation_profile": "base", # or "hydranet_ucdp"
"metric_hyperparameters": {
"twCRPS": {"threshold": 10.0}, # override profile default
},
}- Data Integrity Checks: Validates input arrays for shape consistency, NaN/infinity, and required identifiers.
- Framework-Agnostic Core: All evaluation operates on pure NumPy arrays via
EvaluationFrame. - Metric Catalog & Profiles: Hyperparameters are managed through named evaluation profiles with a Chain of Responsibility resolver (model overrides β profile β fail loud).
- Python >= 3.11
pip install views_evaluation
The library follows a strict three-layer architecture (ADR-011):
Level 0 β Pure Core (NumPy + SciPy only, zero framework imports)
EvaluationFrame Canonical data container (y_true, y_pred, identifiers)
NativeEvaluator Stateless evaluation engine (month/sequence/step schemas)
MetricCatalog Genome registry mapping metrics β functions + required params
Profiles Named hyperparameter sets (base, hydranet_ucdp, ...)
Level 1 β Bridge / Adapter
EvaluationFrame Validated NumPy data container
EvaluationReport Results container with DataFrame/dict export
Level 2 β Legacy Orchestrator
MetricCatalog Genome registry and parameter resolver
Key design decisions:
- ADR-011: No Pandas/Polars imports in Level 0 β math is framework-agnostic.
- ADR-013: Fail-loud β all structural failures raise exceptions with actionable messages, never silently degrade.
- ADR-042: Metric catalog β each metric declares its required hyperparameters ("genome"); values are resolved via Chain of Responsibility.
views-evaluation/
βββ views_evaluation/
β βββ __init__.py # Public API exports
β βββ adapters/
β β βββ __init__.py # Reserved for future framework bridges
β βββ evaluation/
β β βββ config_schema.py # EvaluationConfig TypedDict
β β βββ evaluation_frame.py # Core data container
β β βββ evaluation_manager.py # Legacy orchestrator (deprecated)
β β βββ evaluation_report.py # Results container
β β βββ metric_catalog.py # ADR-042 registry + resolver
β β βββ metrics.py # Typed metric dataclasses
β β βββ native_evaluator.py # Core evaluation engine
β β βββ native_metric_calculators.py # Metric implementations
β βββ profiles/
β βββ base.py # Standard hyperparameter defaults
β βββ hydranet_ucdp.py # Domain-specific profile
βββ tests/ # 242 tests (Green/Beige/Red)
βββ documentation/
β βββ ADRs/ # 17 Architecture Decision Records
β βββ CICs/ # Class Intent Contracts
β βββ integration_guide.md # Full API walkthrough
β βββ evaluation_concepts.md # Domain concepts
βββ pyproject.toml
βββ README.md
We welcome contributions! Please follow the VIEWS Contribution Guidelines.
This project is licensed under the LICENSE file.
Special thanks to the VIEWS MD&D Team for their collaboration and support.

