Scientist: denario-6 Date: 2026-06-03
We are participating in the LSST DESC Photometric Redshift (PZ) Data Challenge (https://pz-data-challenge.readthedocs.io/en/latest/), deadline July 17 2026. The challenge asks participants to develop algorithms that estimate photometric redshifts p(z) — full probability distributions over redshift — for simulated LSST-like galaxies.
This is a competition. Submissions are evaluated on multiple metrics and ranked on a public leaderboard. We want to win.
For each of 8 test files (see below), we must produce a qp ensemble — a
per-object probability distribution p(z) over redshift — in HDF5 format.
The ensemble must include ancillary data with:
zmode: point estimate (mode of p(z)) for each objectobject_id: matching the object IDs in the test file
We must also implement two Python functions per task set:
run_taskset_N_estimation_only(model_file, test_file, output_file): run inference using a pre-trained modelrun_taskset_N_training_and_estimation(train_file, test_file, output_file): train a model from scratch and run inference
Point estimate metrics (computed from zmode vs. true redshift):
Bias: median of Δ = (z_phot - z_true) / (1 + z_true) → want ~0SigmaMAD: 1.4862 × median|Δ - median(Δ)| → want as small as possibleOutlierRate: fraction of |Δ| > max(0.06, 3×σ_IQR) → want as small as possible
p(z) distribution metrics (measure calibration of the full posterior):
CDE Loss: Conditional Density Estimate loss (Izbicki & Lee 2017) → want as small (most negative) as possible. FlexZBoost is specifically designed to minimize this.PIT-KS: Kolmogorov-Smirnov statistic of the PIT distribution vs. uniformPIT-RMSE: RMSE of PIT histogram vs. uniformPIT-KL: KL divergence of PIT distribution vs. uniform (all PIT metrics → want ~0, indicating well-calibrated posteriors)
Computational metrics (assessed but not ranked):
- Training time, model size, inference time per object, output size per object
- Schmidt et al. 2020 (MNRAS 499, 1587): head-to-head benchmark of 12 photo-z methods on LSST-like data. FlexZBoost had the best CDE loss and was top-2 overall. This is the paper the challenge metrics are based on.
- The RAIL Team 2025 (arXiv:2505.02928): the RAIL framework used by the challenge. FlexZBoost, BPZ-lite, and sklearn MLP are all available.
- Izbicki & Lee 2017 (arXiv:1704.08095): the CDE loss — our primary optimization target for p(z) quality.
All data files are in HDF5 format at /home/node/work/pz_challenge/data/.
File naming: pz_challenge_{taskset}_{simulation}_{label}_{scenario}.hdf5
Photometry (present in all files):
mag_u_lsst,mag_g_lsst,mag_r_lsst,mag_i_lsst,mag_z_lsst,mag_y_lsst: LSST magnitudes in 6 optical bands (320–1600 nm). NaN = non-detection.mag_u_lsst_err...mag_y_lsst_err: photometric uncertainties for LSST bandsmag_Y_roman,mag_J_roman,mag_H_roman: Roman Space Telescope near-IR magnitudes (500–2300 nm). Extend coverage to higher redshift.mag_Y_roman_err,mag_J_roman_err,mag_H_roman_err: Roman uncertaintiesobject_id: unique object identifier (int or float)ra,dec: sky coordinates (present in training files; also in test files)
Labels (training files only):
redshift: true spectroscopic redshift
Spectroscopic survey selection flags (Task Set 2 training files only):
DESI_BGS,DESI_ELG_LOP,DESI_LRG: binary flags indicating DESI survey selection (Bright Galaxy Survey, Emission Line Galaxies, Luminous Red Galaxies)VVDSf02: binary flag for VIMOS VLT Deep Survey F02 selectionDEEP2_LSST: binary flag for DEEP2 survey selectionzCOSMOS: binary flag for zCOSMOS survey selection
These flags indicate which spectroscopic survey would have targeted each object. They encode the spectroscopic selection bias and are critically useful for importance weighting in Task Set 2.
Scientific setup: Training and test sets are drawn from the same distribution. Selection cut: i < 23 (bright objects only). This is the standard supervised photo-z problem.
| File | N | z range | z median | i range | NaN bands |
|---|---|---|---|---|---|
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_training_1yr.hdf5 |
100,000 | [0.005, 2.189] | 0.547 | [12.9, 23.0] | u: 6.5%, g: 0.07% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_training_10yr.hdf5 |
100,000 | [0.005, 2.286] | 0.547 | [13.5, 23.0] | u: 2.2% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_training_1yr.hdf5 |
100,000 | [0.006, 2.991] | 0.575 | [13.3, 23.0] | u: 3.4%, g: 0.03% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_training_10yr.hdf5 |
100,000 | [0.006, 2.967] | 0.572 | [14.0, 23.0] | u: 1.0% |
| File | N | i range | NaN bands |
|---|---|---|---|
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_test_1yr.hdf5 |
20,000 | [15.4, 23.0] | u: 6.4%, g: 0.07% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_test_10yr.hdf5 |
20,000 | [14.6, 23.0] | u: 2.2% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_test_1yr.hdf5 |
20,000 | [12.9, 23.0] | u: 3.5%, g: 0.02% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_1_flagship_test_10yr.hdf5 |
20,000 | [12.1, 23.0] | u: 0.9% |
Key properties of Task Set 1:
- 1yr scenario has worse photometric depth (larger errors, more NaNs in u-band) than 10yr
- Cardinal and Flagship use different N-body simulations with different galaxy populations
- The algorithm must generalize across both simulations and both depth scenarios
- Recommended split for internal validation: 80% train / 20% validation on training files
Scientific setup: Training samples are spectroscopically selected (biased toward bright, easily-observable galaxies). Test samples include all objects down to i < 25.4 — 2.4 magnitudes deeper than the training cut. Training and test are drawn from different distributions.
This is the realistic LSST scenario. Spectroscopic redshift surveys cannot observe the faint majority of the photometric sample, creating a systematic mismatch. Naive ML methods trained on biased spec-z samples fail badly at faint objects (high-z, low-surface-brightness galaxies).
| File | N | z range | z median | i range | NaN bands |
|---|---|---|---|---|---|
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_training_1yr.hdf5 |
100,000 | [0.005, 2.286] | 0.585 | [14.0, 25.1] | none |
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_training_10yr.hdf5 |
100,000 | [0.005, 2.259] | 0.595 | [11.8, 24.5] | none |
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_training_1yr.hdf5 |
100,000 | [0.007, 2.997] | 0.639 | [13.7, 24.5] | u: 0.2% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_training_10yr.hdf5 |
100,000 | [0.005, 2.984] | 0.642 | [14.3, 24.3] | u: 0.08% |
Spectroscopic selection breakdown (cardinal 10yr example):
- VVDSf02: 65% of training objects
- DEEP2_LSST: 30%
- DESI_BGS: 1.7%, DESI_LRG: 1.5%, DESI_ELG_LOP: 0.9%, zCOSMOS: 1.0%
| File | N | i range | NaN bands (many more — deeper survey) |
|---|---|---|---|
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_test_1yr.hdf5 |
20,000 | [15.3, 25.5] | u: 22.9%, g: 3.7%, r: 0.9%, z: 1.5%, y: 6.6% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_cardinal_test_10yr.hdf5 |
20,000 | [15.3, 25.5] | u: 11.2%, g: 0.3%, y: 0.2% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_test_1yr.hdf5 |
20,000 | [14.3, 25.5] | u: 15.3%, g: 1.4%, r: 0.5%, z: 2.0%, y: 8.4% |
/home/node/work/pz_challenge/data/pz_challenge_taskset_2_flagship_test_10yr.hdf5 |
20,000 | [14.5, 25.5] | u: 4.8%, g: 0.06%, y: 0.4% |
Key challenges for Task Set 2:
- Test objects are 2.4 mag fainter than training — many will be outside the training distribution entirely
- Heavy NaN rates in test files (especially u-band in 1yr scenario) — algorithm must handle missing photometry gracefully
- Roman near-IR bands (Y, J, H) become especially important at high redshift where LSST u-band drops out
Based on Schmidt et al. 2020, FlexZBoost achieved the best CDE loss in a
head-to-head comparison of 12 methods on LSST-like data. It is available via
pz-rail-flexzboost (already installed in /opt/denario-venv). It uses
gradient-boosted regression trees with conditional density estimation (FlexCode
framework) to produce well-calibrated p(z) posteriors.
RAIL gotcha — mag_limits must cover all bands on both stages.
When using Roman bands (Y/J/H) alongside LSST ugrizy, pass
mag_limits={b: 99.0 for b in bands} to both FlexZBoostInformer.make_stage()
and FlexZBoostEstimator.make_stage(). RAIL's defaults only cover LSST bands,
so _process_chunk will raise KeyError: 'mag_Y_roman' at estimation time if
mag_limits is passed only to the informer.
FlexCode gotcha — FlexCodeModel.predict() returns a tuple, not an array.
If you use the lower-level flexcode.FlexCodeModel directly (instead of
RAIL's FlexZBoost wrapper), model.predict(X, n_grid=N) returns
(cdes, z_grid):
cdes— density array, shape(n_samples, n_grid); this is what you feed into metrics / normalization.z_grid— 1-D redshift grid, shape(n_grid,).
Always unpack explicitly:
cdes, z_grid = model.predict(X_val, n_grid=n_grid)
pz = np.clip(cdes, 0.0, None)
norms = np.trapz(pz, z_grid, axis=1)
norms[norms == 0] = 1.0
pz /= norms[:, None]Assigning pz = model.predict(...) directly treats the tuple as an ndarray
and produces garbage shapes / indexing errors downstream.
Use all 18 photometric features: 9 magnitudes + 9 magnitude errors across LSST ugrizy and Roman YJH. Derive color features (magnitude differences between adjacent bands): u-g, g-r, r-i, i-z, z-y, y-Y, Y-J, J-H. Replace NaN magnitudes with a sentinel value (e.g., 99.0) and include binary NaN indicator features — do not drop objects.
The spectroscopic selection flags in training files enable density ratio estimation. Use k-nearest-neighbor or kernel density ratio methods to compute per-object weights that upweight training galaxies that are similar to the (deeper) test distribution. Train FlexZBoost with these importance weights. This directly corrects for the covariate shift between training and test.
Since test files have no true redshifts, evaluate by holding out 20% of the
training data as a validation set. Compute all challenge metrics on the
validation set using pz_data_challenge.metrics. This gives us the exact same
metrics the organizers will compute.
- All 8 test files covered (4 for Task Set 1, 4 for Task Set 2)
- All 3 subtasks implemented per task set (pre-computed estimates, trained models + estimation function, full train+estimate function)
- p(z) output in
qpinterpolated grid format with 100 grid points over z in [0, 3], pluszmodeandobject_idin ancillary data
The results report is the primary artifact used to evaluate whether the analysis is successful. It must contain all of the following to be considered complete and ready for the paper stage. An incomplete or vague results report means we must iterate again.
- Which algorithm(s) were implemented (e.g., FlexZBoost, MDN, XGBoost)
- Hyperparameters used (number of trees, basis functions, learning rate, etc.)
- Feature set: which bands and derived colors were used
- How NaN/missing photometry was handled
- For Task Set 2: how covariate shift was corrected (e.g., importance weighting method, KNN vs. kernel density ratio)
For each of the 4 training files (cardinal 1yr, cardinal 10yr, flagship 1yr, flagship 10yr), using an 80/20 train/validation split, report:
| Metric | Target (competitive) | Baseline (sklearn MLP) |
|---|---|---|
| Bias | < 0.005 | ~0.01 |
| SigmaMAD | < 0.02 | ~0.03 |
| OutlierRate | < 0.05 | ~0.10 |
| CDE Loss | as negative as possible | ~-3 to -4 |
| PIT-KS | < 0.05 | ~0.1 |
| PIT-RMSE | < 0.02 | ~0.05 |
| PIT-KL | < 0.05 | ~0.2 |
These targets are based on FlexZBoost performance in Schmidt et al. 2020 on comparable data. The report must include the actual numbers, not just "good" or "improved".
Same 4 files, same split. Additionally report:
- Metrics without importance weighting (naive baseline)
- Metrics with importance weighting (our correction)
- The delta between them — this quantifies our contribution
Expected degradation without correction: SigmaMAD ~2–3× worse than Task Set 1, high outlier rate at faint magnitudes (i > 23). With correction: should recover to within ~30–50% of Task Set 1 performance.
The report must include or reference the following plots for each scenario:
- Photo-z vs. true redshift 2D histogram (zmode vs. z_true)
- Bias, SigmaMAD, OutlierRate as a function of redshift (in bins)
- Bias, SigmaMAD, OutlierRate as a function of i-band magnitude (in bins)
- PIT histogram (should be uniform for well-calibrated posteriors)
- PIT-QQ plot (should be diagonal)
- CDE Loss value
Plots must be saved to disk as PNG files with descriptive filenames.
A summary table comparing all 8 scenarios (2 task sets × 2 simulations × 2 depths) side by side. This reveals:
- Whether 10yr depth helps (it should — better photometry, fewer NaNs)
- Whether Cardinal vs. Flagship differ systematically
- Whether Task Set 2 correction is effective across both simulations
- Training time per file (seconds)
- Inference time per object (milliseconds)
- Model file size (MB)
- These are reported by the challenge organizers — we need our own numbers
The report must confirm that the following files were successfully created (these are needed for the actual challenge submission):
For each of the 8 test files, a qp ensemble HDF5 file:
pz_challenge_taskset_1_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_10yr.hdf5
For each of the 8 training scenarios, a trained model file:
pz_challenge_taskset_1_cardinal_pz_model_1yr.pkl
... (etc.)
All files must pass qp validation: valid ensemble, ancillary data with zmode
and object_id columns, object_ids matching the corresponding test file.
If any scenario performs poorly (e.g., outlier rate > 0.15, CDE loss worse than baseline), the report must explain:
- Which specific objects failed (redshift range, magnitude range)
- Likely cause (e.g., missing u-band at low-z, training distribution mismatch)
- What improvement to try in the next iteration
- All 8 test files have valid qp output files on disk
- Task Set 1 SigmaMAD < 0.025 on both simulations and both depth scenarios
- Task Set 1 CDE Loss is negative and better than a random Gaussian baseline
- Task Set 2 shows measurable improvement from importance weighting vs. naive
- All metric plots are generated and saved
- Computational timing is reported
If any of these are missing, the evaluator should request another iteration with specific guidance on what to fix.
Data directory: /home/node/work/pz_challenge/data/
Challenge repo: /home/node/work/pz_challenge/pz_data_challenge/
Python venv: /opt/denario-venv/bin/python (Python 3.12, all libs installed)
All code must use absolute paths when reading data files.
These are practical details that must be followed to produce valid output.
Read files with h5py directly — do not use the RAIL TableHandle or catalog_utils machinery (it requires a catalogs.yaml config file and is not needed here):
import h5py
import numpy as np
with h5py.File('/home/node/work/pz_challenge/data/pz_challenge_taskset_1_cardinal_training_10yr.hdf5', 'r') as f:
redshift = f['redshift'][:] # training files only
mag_u = f['mag_u_lsst'][:] # NaN = non-detection
mag_g = f['mag_g_lsst'][:]
mag_r = f['mag_r_lsst'][:]
mag_i = f['mag_i_lsst'][:]
mag_z = f['mag_z_lsst'][:]
mag_y = f['mag_y_lsst'][:]
mag_Y = f['mag_Y_roman'][:]
mag_J = f['mag_J_roman'][:]
mag_H = f['mag_H_roman'][:]
# same pattern for _err columns
object_id = f['object_id'][:]Replace NaN values with a sentinel (e.g., 99.0) before passing to any ML model.
Important: Flagship training files contain ~0.3–0.4% duplicate object_ids (known issue in the simulation). Do NOT use object_id as a unique row key for training files — use positional indices (row numbers) instead. This does not affect test files, which have fully unique object_ids. Add a binary indicator feature for each band that had NaNs (1 = was NaN, 0 = observed).
The output MUST be a qp interpolated grid ensemble with ancillary data containing
zmode and object_id. This exact pattern must be followed:
import qp
import numpy as np
# Define redshift grid (use same grid for all output files)
zgrid = np.linspace(0.0, 3.0, 301) # 301 points
# pz_vals: shape (n_objects, 301) — must be non-negative, ideally normalized
# Each row is a p(z) for one object evaluated at zgrid points
ensemble = qp.interp.create_ensemble(zgrid, pz_vals)
# Compute point estimates (mode of p(z))
zmode = zgrid[np.argmax(pz_vals, axis=1)]
# Add mandatory ancillary data
ensemble.set_ancil({'zmode': zmode, 'object_id': object_id.astype(int)})
# Write to file
ensemble.write_to(output_file)Do NOT use qp mixmod or other representations — use qp.interp only. The
validation harness checks for zmode and object_id in ancil — both are
required.
Save trained models with pickle or joblib. The model file must contain everything needed to run inference on a new test file without retraining:
import pickle
model_bundle = {
'estimator': trained_flexzboost_or_other_model,
'feature_cols': list_of_feature_column_names,
'zgrid': zgrid,
'nan_sentinel': 99.0,
'scaler': fitted_scaler_if_used, # or None
'importance_weights_method': 'knn_density_ratio', # for Task Set 2
}
with open(model_file, 'wb') as f:
pickle.dump(model_bundle, f)Use a stratified split to ensure the validation set covers the full redshift range:
from sklearn.model_selection import train_test_split
# Stratify by redshift bin to preserve distribution
z_bins = np.digitize(redshift, np.linspace(0, 3, 20))
train_idx, val_idx = train_test_split(
np.arange(len(redshift)), test_size=0.2, random_state=42, stratify=z_bins
)The validation set is used ONLY for metric computation after training — never for fitting or hyperparameter selection in the same run.
Use the pz_data_challenge metrics module directly on validation data:
from pz_data_challenge import metrics
import numpy as np
# point_metrics requires truth redshifts and zmode estimates
# Build a minimal test_data dict and submit_data qp ensemble from validation set
# Then call metrics functions directly:
# Bias, SigmaMAD, OutlierRate:
delta = (zmode_val - z_true_val) / (1 + z_true_val)
bias = np.median(delta)
sigma_mad = 1.4862 * np.median(np.abs(delta - bias))
outlier_rate = np.mean(np.abs(delta) > np.maximum(0.06, 3 * sigma_mad))
# CDE Loss and PIT: use qp ensemble evaluated at true redshifts
# (the pz_data_challenge.metrics functions handle this if given a qp ensemble
# and a dict with 'redshift' and 'mag_i_lsst' keys)All output p(z) estimate files must follow this naming convention exactly:
pz_challenge_taskset_1_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_1_flagship_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_cardinal_pz_estimate_10yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_1yr.hdf5
pz_challenge_taskset_2_flagship_pz_estimate_10yr.hdf5
Model files:
pz_challenge_taskset_1_cardinal_pz_model_1yr.pkl
pz_challenge_taskset_1_cardinal_pz_model_10yr.pkl
pz_challenge_taskset_1_flagship_pz_model_1yr.pkl
pz_challenge_taskset_1_flagship_pz_model_10yr.pkl
pz_challenge_taskset_2_cardinal_pz_model_1yr.pkl
pz_challenge_taskset_2_cardinal_pz_model_10yr.pkl
pz_challenge_taskset_2_flagship_pz_model_1yr.pkl
pz_challenge_taskset_2_flagship_pz_model_10yr.pkl
Save all output files to the experiment output directory. Use absolute paths.