pz-challenge-v2

Scientist: denario-6 Date: 2026-06-03

DESC PZ Data Challenge — Data Description

Challenge Goal

We are participating in the LSST DESC Photometric Redshift (PZ) Data Challenge (https://pz-data-challenge.readthedocs.io/en/latest/), deadline July 17 2026. The challenge asks participants to develop algorithms that estimate photometric redshifts p(z) — full probability distributions over redshift — for simulated LSST-like galaxies.

This is a competition. Submissions are evaluated on multiple metrics and ranked on a public leaderboard. We want to win.

What we must produce

For each of 8 test files (see below), we must produce a qp ensemble — a per-object probability distribution p(z) over redshift — in HDF5 format. The ensemble must include ancillary data with:

zmode: point estimate (mode of p(z)) for each object
object_id: matching the object IDs in the test file

We must also implement two Python functions per task set:

run_taskset_N_estimation_only(model_file, test_file, output_file): run inference using a pre-trained model
run_taskset_N_training_and_estimation(train_file, test_file, output_file): train a model from scratch and run inference

Evaluation metrics (lower is better unless noted)

Point estimate metrics (computed from zmode vs. true redshift):

Bias: median of Δ = (z_phot - z_true) / (1 + z_true) → want ~0
SigmaMAD: 1.4862 × median|Δ - median(Δ)| → want as small as possible
OutlierRate: fraction of |Δ| > max(0.06, 3×σ_IQR) → want as small as possible

p(z) distribution metrics (measure calibration of the full posterior):

CDE Loss: Conditional Density Estimate loss (Izbicki & Lee 2017) → want as small (most negative) as possible. FlexZBoost is specifically designed to minimize this.
PIT-KS: Kolmogorov-Smirnov statistic of the PIT distribution vs. uniform
PIT-RMSE: RMSE of PIT histogram vs. uniform
PIT-KL: KL divergence of PIT distribution vs. uniform (all PIT metrics → want ~0, indicating well-calibrated posteriors)

Computational metrics (assessed but not ranked):

Training time, model size, inference time per object, output size per object

Key references

Schmidt et al. 2020 (MNRAS 499, 1587): head-to-head benchmark of 12 photo-z methods on LSST-like data. FlexZBoost had the best CDE loss and was top-2 overall. This is the paper the challenge metrics are based on.
The RAIL Team 2025 (arXiv:2505.02928): the RAIL framework used by the challenge. FlexZBoost, BPZ-lite, and sklearn MLP are all available.
Izbicki & Lee 2017 (arXiv:1704.08095): the CDE loss — our primary optimization target for p(z) quality.

Data Files

All data files are in HDF5 format at /home/node/work/pz_challenge/data/. File naming: pz_challenge_{taskset}_{simulation}_{label}_{scenario}.hdf5

Columns in all files

Photometry (present in all files):

mag_u_lsst, mag_g_lsst, mag_r_lsst, mag_i_lsst, mag_z_lsst, mag_y_lsst: LSST magnitudes in 6 optical bands (320–1600 nm). NaN = non-detection.
mag_u_lsst_err ... mag_y_lsst_err: photometric uncertainties for LSST bands
mag_Y_roman, mag_J_roman, mag_H_roman: Roman Space Telescope near-IR magnitudes (500–2300 nm). Extend coverage to higher redshift.
mag_Y_roman_err, mag_J_roman_err, mag_H_roman_err: Roman uncertainties
object_id: unique object identifier (int or float)
ra, dec: sky coordinates (present in training files; also in test files)

Labels (training files only):

redshift: true spectroscopic redshift

Spectroscopic survey selection flags (Task Set 2 training files only):

DESI_BGS, DESI_ELG_LOP, DESI_LRG: binary flags indicating DESI survey selection (Bright Galaxy Survey, Emission Line Galaxies, Luminous Red Galaxies)
VVDSf02: binary flag for VIMOS VLT Deep Survey F02 selection
DEEP2_LSST: binary flag for DEEP2 survey selection
zCOSMOS: binary flag for zCOSMOS survey selection

These flags indicate which spectroscopic survey would have targeted each object. They encode the spectroscopic selection bias and are critically useful for importance weighting in Task Set 2.

Task Set 1: Representative Training (the easier task)

Scientific setup: Training and test sets are drawn from the same distribution. Selection cut: i < 23 (bright objects only). This is the standard supervised photo-z problem.

Task Set 1 — Training files (have true redshifts)

File	N	z range	z median	i range	NaN bands
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_training_1yr.hdf5`	100,000	[0.005, 2.189]	0.547	[12.9, 23.0]	u: 6.5%, g: 0.07%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_training_10yr.hdf5`	100,000	[0.005, 2.286]	0.547	[13.5, 23.0]	u: 2.2%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_training_1yr.hdf5`	100,000	[0.006, 2.991]	0.575	[13.3, 23.0]	u: 3.4%, g: 0.03%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_training_10yr.hdf5`	100,000	[0.006, 2.967]	0.572	[14.0, 23.0]	u: 1.0%

Task Set 1 — Test files (NO true redshifts — these are what we predict)

File	N	i range	NaN bands
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_test_1yr.hdf5`	20,000	[15.4, 23.0]	u: 6.4%, g: 0.07%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_test_10yr.hdf5`	20,000	[14.6, 23.0]	u: 2.2%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_test_1yr.hdf5`	20,000	[12.9, 23.0]	u: 3.5%, g: 0.02%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_test_10yr.hdf5`	20,000	[12.1, 23.0]	u: 0.9%

Key properties of Task Set 1:

1yr scenario has worse photometric depth (larger errors, more NaNs in u-band) than 10yr
Cardinal and Flagship use different N-body simulations with different galaxy populations
The algorithm must generalize across both simulations and both depth scenarios
Recommended split for internal validation: 80% train / 20% validation on training files

Task Set 2: Non-Representative Training (the hard task — where we win)

Scientific setup: Training samples are spectroscopically selected (biased toward bright, easily-observable galaxies). Test samples include all objects down to i < 25.4 — 2.4 magnitudes deeper than the training cut. Training and test are drawn from different distributions.

This is the realistic LSST scenario. Spectroscopic redshift surveys cannot observe the faint majority of the photometric sample, creating a systematic mismatch. Naive ML methods trained on biased spec-z samples fail badly at faint objects (high-z, low-surface-brightness galaxies).

Task Set 2 — Training files (have true redshifts, spec-selected)

File	N	z range	z median	i range	NaN bands
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_training_1yr.hdf5`	100,000	[0.005, 2.286]	0.585	[14.0, 25.1]	none
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_training_10yr.hdf5`	100,000	[0.005, 2.259]	0.595	[11.8, 24.5]	none
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_training_1yr.hdf5`	100,000	[0.007, 2.997]	0.639	[13.7, 24.5]	u: 0.2%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_training_10yr.hdf5`	100,000	[0.005, 2.984]	0.642	[14.3, 24.3]	u: 0.08%

Spectroscopic selection breakdown (cardinal 10yr example):

VVDSf02: 65% of training objects
DEEP2_LSST: 30%
DESI_BGS: 1.7%, DESI_LRG: 1.5%, DESI_ELG_LOP: 0.9%, zCOSMOS: 1.0%

Task Set 2 — Test files (NO true redshifts; deeper than training)

File	N	i range	NaN bands (many more — deeper survey)
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_test_1yr.hdf5`	20,000	[15.3, 25.5]	u: 22.9%, g: 3.7%, r: 0.9%, z: 1.5%, y: 6.6%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_test_10yr.hdf5`	20,000	[15.3, 25.5]	u: 11.2%, g: 0.3%, y: 0.2%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_test_1yr.hdf5`	20,000	[14.3, 25.5]	u: 15.3%, g: 1.4%, r: 0.5%, z: 2.0%, y: 8.4%
`/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_test_10yr.hdf5`	20,000	[14.5, 25.5]	u: 4.8%, g: 0.06%, y: 0.4%

Key challenges for Task Set 2:

Test objects are 2.4 mag fainter than training — many will be outside the training distribution entirely
Heavy NaN rates in test files (especially u-band in 1yr scenario) — algorithm must handle missing photometry gracefully
Roman near-IR bands (Y, J, H) become especially important at high redshift where LSST u-band drops out

Suggested Analysis Strategy

Primary algorithm: FlexZBoost

Based on Schmidt et al. 2020, FlexZBoost achieved the best CDE loss in a head-to-head comparison of 12 methods on LSST-like data. It is available via pz-rail-flexzboost (already installed in /opt/denario-venv). It uses gradient-boosted regression trees with conditional density estimation (FlexCode framework) to produce well-calibrated p(z) posteriors.

RAIL gotcha — mag_limits must cover all bands on both stages. When using Roman bands (Y/J/H) alongside LSST ugrizy, pass mag_limits={b: 99.0 for b in bands} to both FlexZBoostInformer.make_stage() and FlexZBoostEstimator.make_stage(). RAIL's defaults only cover LSST bands, so _process_chunk will raise KeyError: 'mag_Y_roman' at estimation time if mag_limits is passed only to the informer.

FlexCode gotcha — FlexCodeModel.predict() returns a tuple, not an array. If you use the lower-level flexcode.FlexCodeModel directly (instead of RAIL's FlexZBoost wrapper), model.predict(X, n_grid=N) returns (cdes, z_grid):

cdes — density array, shape (n_samples, n_grid); this is what you feed into metrics / normalization.
z_grid — 1-D redshift grid, shape (n_grid,).

Always unpack explicitly:

cdes, z_grid = model.predict(X_val, n_grid=n_grid)
pz = np.clip(cdes, 0.0, None)
norms = np.trapz(pz, z_grid, axis=1)
norms[norms == 0] = 1.0
pz /= norms[:, None]

Assigning pz = model.predict(...) directly treats the tuple as an ndarray and produces garbage shapes / indexing errors downstream.

Feature engineering

Use all 18 photometric features: 9 magnitudes + 9 magnitude errors across LSST ugrizy and Roman YJH. Derive color features (magnitude differences between adjacent bands): u-g, g-r, r-i, i-z, z-y, y-Y, Y-J, J-H. Replace NaN magnitudes with a sentinel value (e.g., 99.0) and include binary NaN indicator features — do not drop objects.

Task Set 2 — Importance weighting (key differentiator)

The spectroscopic selection flags in training files enable density ratio estimation. Use k-nearest-neighbor or kernel density ratio methods to compute per-object weights that upweight training galaxies that are similar to the (deeper) test distribution. Train FlexZBoost with these importance weights. This directly corrects for the covariate shift between training and test.

Internal evaluation protocol

Since test files have no true redshifts, evaluate by holding out 20% of the training data as a validation set. Compute all challenge metrics on the validation set using pz_data_challenge.metrics. This gives us the exact same metrics the organizers will compute.

Submission target

All 8 test files covered (4 for Task Set 1, 4 for Task Set 2)
All 3 subtasks implemented per task set (pre-computed estimates, trained models + estimation function, full train+estimate function)
p(z) output in qp interpolated grid format with 100 grid points over z in [0, 3], plus zmode and object_id in ancillary data

Expected Results — What the Results Report Must Contain

The results report is the primary artifact used to evaluate whether the analysis is successful. It must contain all of the following to be considered complete and ready for the paper stage. An incomplete or vague results report means we must iterate again.

1. Algorithm description

Which algorithm(s) were implemented (e.g., FlexZBoost, MDN, XGBoost)
Hyperparameters used (number of trees, basis functions, learning rate, etc.)
Feature set: which bands and derived colors were used
How NaN/missing photometry was handled
For Task Set 2: how covariate shift was corrected (e.g., importance weighting method, KNN vs. kernel density ratio)

2. Internal validation results — Task Set 1 (representative)

For each of the 4 training files (cardinal 1yr, cardinal 10yr, flagship 1yr, flagship 10yr), using an 80/20 train/validation split, report:

Metric	Target (competitive)	Baseline (sklearn MLP)
Bias	< 0.005	~0.01
SigmaMAD	< 0.02	~0.03
OutlierRate	< 0.05	~0.10
CDE Loss	as negative as possible	~-3 to -4
PIT-KS	< 0.05	~0.1
PIT-RMSE	< 0.02	~0.05
PIT-KL	< 0.05	~0.2

These targets are based on FlexZBoost performance in Schmidt et al. 2020 on comparable data. The report must include the actual numbers, not just "good" or "improved".

3. Internal validation results — Task Set 2 (non-representative)

Same 4 files, same split. Additionally report:

Metrics without importance weighting (naive baseline)
Metrics with importance weighting (our correction)
The delta between them — this quantifies our contribution

Expected degradation without correction: SigmaMAD ~2–3× worse than Task Set 1, high outlier rate at faint magnitudes (i > 23). With correction: should recover to within ~30–50% of Task Set 1 performance.

4. Metric plots (mandatory)

The report must include or reference the following plots for each scenario:

Photo-z vs. true redshift 2D histogram (zmode vs. z_true)
Bias, SigmaMAD, OutlierRate as a function of redshift (in bins)
Bias, SigmaMAD, OutlierRate as a function of i-band magnitude (in bins)
PIT histogram (should be uniform for well-calibrated posteriors)
PIT-QQ plot (should be diagonal)
CDE Loss value

Plots must be saved to disk as PNG files with descriptive filenames.

5. Comparison across scenarios

A summary table comparing all 8 scenarios (2 task sets × 2 simulations × 2 depths) side by side. This reveals:

Whether 10yr depth helps (it should — better photometry, fewer NaNs)
Whether Cardinal vs. Flagship differ systematically
Whether Task Set 2 correction is effective across both simulations

6. Computational performance

Training time per file (seconds)
Inference time per object (milliseconds)
Model file size (MB)
These are reported by the challenge organizers — we need our own numbers

7. Generated output files

The report must confirm that the following files were successfully created (these are needed for the actual challenge submission):

For each of the 8 test files, a qp ensemble HDF5 file:

pz_challenge_taskset_1_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_10yr.hdf5

For each of the 8 training scenarios, a trained model file:

pz_challenge_taskset_1_cardinal_pz_model_1yr.pkl
... (etc.)

All files must pass qp validation: valid ensemble, ancillary data with zmode and object_id columns, object_ids matching the corresponding test file.

8. Failure analysis

If any scenario performs poorly (e.g., outlier rate > 0.15, CDE loss worse than baseline), the report must explain:

Which specific objects failed (redshift range, magnitude range)
Likely cause (e.g., missing u-band at low-z, training distribution mismatch)
What improvement to try in the next iteration

What makes a results report good enough to proceed to paper?

All 8 test files have valid qp output files on disk
Task Set 1 SigmaMAD < 0.025 on both simulations and both depth scenarios
Task Set 1 CDE Loss is negative and better than a random Gaussian baseline
Task Set 2 shows measurable improvement from importance weighting vs. naive
All metric plots are generated and saved
Computational timing is reported

If any of these are missing, the evaluator should request another iteration with specific guidance on what to fix.

Paths Summary

Data directory:     /home/node/work/pz_challenge/data/
Challenge repo:     /home/node/work/pz_challenge/pz_data_challenge/
Python venv:        /opt/denario-venv/bin/python  (Python 3.12, all libs installed)

All code must use absolute paths when reading data files.

Implementation Notes for the Engineer

These are practical details that must be followed to produce valid output.

Reading HDF5 data files

Read files with h5py directly — do not use the RAIL TableHandle or catalog_utils machinery (it requires a catalogs.yaml config file and is not needed here):

import h5py
import numpy as np

with h5py.File('/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_training_10yr.hdf5', 'r') as f:
    redshift = f['redshift'][:]   # training files only
    mag_u = f['mag_u_lsst'][:]    # NaN = non-detection
    mag_g = f['mag_g_lsst'][:]
    mag_r = f['mag_r_lsst'][:]
    mag_i = f['mag_i_lsst'][:]
    mag_z = f['mag_z_lsst'][:]
    mag_y = f['mag_y_lsst'][:]
    mag_Y = f['mag_Y_roman'][:]
    mag_J = f['mag_J_roman'][:]
    mag_H = f['mag_H_roman'][:]
    # same pattern for _err columns
    object_id = f['object_id'][:]

Replace NaN values with a sentinel (e.g., 99.0) before passing to any ML model.

Important: Flagship training files contain ~0.3–0.4% duplicate object_ids (known issue in the simulation). Do NOT use object_id as a unique row key for training files — use positional indices (row numbers) instead. This does not affect test files, which have fully unique object_ids. Add a binary indicator feature for each band that had NaNs (1 = was NaN, 0 = observed).

Writing qp output (mandatory format)

The output MUST be a qp interpolated grid ensemble with ancillary data containing zmode and object_id. This exact pattern must be followed:

import qp
import numpy as np

# Define redshift grid (use same grid for all output files)
zgrid = np.linspace(0.0, 3.0, 301)  # 301 points

# pz_vals: shape (n_objects, 301) — must be non-negative, ideally normalized
# Each row is a p(z) for one object evaluated at zgrid points

ensemble = qp.interp.create_ensemble(zgrid, pz_vals)

# Compute point estimates (mode of p(z))
zmode = zgrid[np.argmax(pz_vals, axis=1)]

# Add mandatory ancillary data
ensemble.set_ancil({'zmode': zmode, 'object_id': object_id.astype(int)})

# Write to file
ensemble.write_to(output_file)

Do NOT use qp mixmod or other representations — use qp.interp only. The validation harness checks for zmode and object_id in ancil — both are required.

Model serialization

Save trained models with pickle or joblib. The model file must contain everything needed to run inference on a new test file without retraining:

import pickle

model_bundle = {
    'estimator': trained_flexzboost_or_other_model,
    'feature_cols': list_of_feature_column_names,
    'zgrid': zgrid,
    'nan_sentinel': 99.0,
    'scaler': fitted_scaler_if_used,  # or None
    'importance_weights_method': 'knn_density_ratio',  # for Task Set 2
}

with open(model_file, 'wb') as f:
    pickle.dump(model_bundle, f)

Validation split

Use a stratified split to ensure the validation set covers the full redshift range:

from sklearn.model_selection import train_test_split

# Stratify by redshift bin to preserve distribution
z_bins = np.digitize(redshift, np.linspace(0, 3, 20))
train_idx, val_idx = train_test_split(
    np.arange(len(redshift)), test_size=0.2, random_state=42, stratify=z_bins
)

The validation set is used ONLY for metric computation after training — never for fitting or hyperparameter selection in the same run.

Computing metrics

Use the pz_data_challenge metrics module directly on validation data:

from pz_data_challenge import metrics
import numpy as np

# point_metrics requires truth redshifts and zmode estimates
# Build a minimal test_data dict and submit_data qp ensemble from validation set
# Then call metrics functions directly:

# Bias, SigmaMAD, OutlierRate:
delta = (zmode_val - z_true_val) / (1 + z_true_val)
bias = np.median(delta)
sigma_mad = 1.4862 * np.median(np.abs(delta - bias))
outlier_rate = np.mean(np.abs(delta) > np.maximum(0.06, 3 * sigma_mad))

# CDE Loss and PIT: use qp ensemble evaluated at true redshifts
# (the pz_data_challenge.metrics functions handle this if given a qp ensemble
# and a dict with 'redshift' and 'mag_i_lsst' keys)

Output file naming convention

All output p(z) estimate files must follow this naming convention exactly:

pz_challenge_taskset_1_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_10yr.hdf5

Model files:

pz_challenge_taskset_1_cardinal_pz_model_1yr.pkl
pz_challenge_taskset_1_cardinal_pz_model_10yr.pkl
pz_challenge_taskset_1_flagship_pz_model_1yr.pkl
pz_challenge_taskset_1_flagship_pz_model_10yr.pkl
pz_challenge_taskset_2_cardinal_pz_model_1yr.pkl
pz_challenge_taskset_2_cardinal_pz_model_10yr.pkl
pz_challenge_taskset_2_flagship_pz_model_1yr.pkl
pz_challenge_taskset_2_flagship_pz_model_10yr.pkl

Save all output files to the experiment output directory. Use absolute paths.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Iteration0/input_files		Iteration0/input_files
.gitignore		.gitignore
README.md		README.md
params.yaml		params.yaml

Folders and files

Latest commit

History

Repository files navigation

pz-challenge-v2

DESC PZ Data Challenge — Data Description

Challenge Goal

What we must produce

Evaluation metrics (lower is better unless noted)

Key references

Data Files

Columns in all files

Task Set 1: Representative Training (the easier task)

Task Set 1 — Training files (have true redshifts)

Task Set 1 — Test files (NO true redshifts — these are what we predict)

Task Set 2: Non-Representative Training (the hard task — where we win)

Task Set 2 — Training files (have true redshifts, spec-selected)

Task Set 2 — Test files (NO true redshifts; deeper than training)

Suggested Analysis Strategy

Primary algorithm: FlexZBoost

Feature engineering

Task Set 2 — Importance weighting (key differentiator)

Internal evaluation protocol

Submission target

Expected Results — What the Results Report Must Contain

1. Algorithm description

2. Internal validation results — Task Set 1 (representative)

3. Internal validation results — Task Set 2 (non-representative)

4. Metric plots (mandatory)

5. Comparison across scenarios

6. Computational performance

7. Generated output files

8. Failure analysis

What makes a results report good enough to proceed to paper?

Paths Summary

Implementation Notes for the Engineer

Reading HDF5 data files

Writing qp output (mandatory format)

Model serialization

Validation split

Computing metrics

Output file naming convention

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages