A production system for running SPINE (Scalable Particle Imaging with Neural Embeddings) reconstruction on HPC clusters with SLURM- and PBS-based batch systems.
SPINE is a deep learning-based reconstruction framework for liquid argon time projection chamber (LArTPC) detectors. This production system provides tools for running SPINE at scale on large datasets using scheduler-managed job arrays.
# Clone the repository
git clone https://github.com/DeepLearnPhysics/spine-prod.git
cd spine-prod
# Configure environment
source configure.shSPINE Version Control: Production jobs now run entirely from a tagged SPINE
container image. The default Shifter tag is
docker:ghcr.io/deeplearnphysics/spine:0.12.2, with the matching S3DF
Singularity image derived from the same version at
/sdf/data/neutrino/images/spine_v0-12-2.sif. This container packages SPINE,
OpT0Finder, and runtime dependencies, and jobs invoke the container-provided
spine executable directly.
Alternative Container Location: You can override the local .sif path or
container release before sourcing configure.sh:
export SPINE_CONTAINER_PATH=/path/to/spine_v0-12-2.sif
export SPINE_CONTAINER_VERSION=0.12.2
source configure.shUpdating SPINE Version: Update the container version and site-local image path together:
export SPINE_CONTAINER_VERSION=0.12.2
# Default SPINE_CONTAINER_PATH becomes /sdf/data/neutrino/images/spine_v0-12-2.sif# Detector shorthand resolves to the latest composite config
./submit.py --config infer/icarus --source /path/to/data.root
# Run on a single file
./submit.py --config infer/icarus/latest --source /path/to/data.root
# Run data-only processing on multiple files (glob)
./submit.py --config infer/icarus/latest --apply-mods data --source /path/to/data/*.root
# Run from a file list (recommended)
./submit.py --config infer/2x2/latest --source-list file_list.txt# Use a specific resource profile
./submit.py --config infer/icarus/latest --source data/*.root --profile s3df_turing
# Process multiple files per job
./submit.py --config infer/icarus/latest --source data/*.root --files-per-task 5
# Limit parallel tasks
./submit.py --config infer/icarus/latest --source data/*.root --ntasks 50
# Run a multi-stage pipeline
./submit.py --pipeline pipelines/icarus_production_example.yaml
# Dry run (see what would be submitted)
./submit.py --config infer/icarus/latest --source test.root --dry-run
# Interactive mode (test locally without batch submission)
./submit.py --interactive --config infer/icarus/latest --source test.root
./submit.py -I --config infer/icarus/latest --source-list files.txt --task-id 2Interactive mode (--interactive or -I) runs SPINE processing directly in your current shell without submitting to the batch scheduler. This is particularly useful for:
- Testing configurations before batch submission
- Debugging issues with immediate feedback
- Small-scale runs on login nodes (use sparingly!)
- Config validation with real execution
Interactive mode performs all the same config composition, file chunking, and environment setup as batch mode, but executes locally:
# Test a config on one file
./submit.py -I --config infer/icarus/latest --source /path/to/test.root
# Force container-backed interactive execution
./submit.py -I --interactive-runtime container --config infer/generic/latest --source test.root --set base.world_size=0
# Test with modifiers applied
./submit.py -I --config infer/icarus/latest --source test.root --apply-mods data lite
# Test a specific task from a file list (if using --files-per-task)
./submit.py -I --config infer/icarus/latest --source-list files.txt --files-per-task 5 --task-id 2Note: Interactive mode is not supported for pipelines. Use --dry-run to preview pipeline submissions.
By default, interactive mode uses the spine executable already on PATH. If
spine is unavailable, it falls back to the configured container: first
SPINE_CONTAINER_PATH with Singularity/Apptainer if the .sif exists, then
SPINE_CONTAINER_TAG with Docker/Podman. Use --interactive-runtime local to require
the local executable, or --interactive-runtime container to force container
execution. Docker/Podman fallback requests linux/amd64 by default; override
SPINE_CONTAINER_PLATFORM if a different platform is needed. For local debugging or
batch jobs that should use an unreleased checkout, pass --spine-path /path/to/spine
to run /path/to/spine/bin/run.py (or /path/to/spine/bin/spine if present)
instead of the spine executable on PATH. The checkout root is added to
container bind paths automatically where supported.
./submit.py --spine-path /path/to/spine -I --interactive-runtime local --config infer/generic/latest --source test.root --set base.world_size=0On EAF, Apptainer may be provided from CVMFS rather than as apptainer on
PATH. Interactive container mode can be pointed at that executable directly,
and can also pass through the extra environment flags needed there:
export SPINE_CONTAINER_RUNTIME_BIN=/cvmfs/eaf.opensciencegrid.org/apptainer/bin/apptainer
export SPINE_CONTAINER_RUNTIME_ARGS="--env LD_PRELOAD= --env LC_ALL=C.UTF-8"spine-prod/
├── configure.sh # Environment setup script
├── submit.py # Main submission orchestrator (NEW!)
├── README.md # This file
│
├── config/ # All SPINE configs (inference & training)
│ ├── infer/ # Inference configs (referenced as infer/...)
│ │ ├── icarus/ # ICARUS detector configs
│ │ ├── sbnd/ # SBND detector configs
│ │ ├── 2x2/ # 2x2 detector configs
│ │ ├── nd-lar/ # ND-LAr detector configs
│ │ ├── generic/ # Generic (no detector) configs
│ │ └── common/ # Shared configs
│ └── train/ # Training configs (referenced as train/...)
├── templates/ # Job templates
│ ├── profiles.yaml # Resource profiles
│ ├── job_template_s3df.sbatch
│ ├── job_template_nersc.sbatch
│ └── job_template_anl.pbs
│
├── pipelines/ # Multi-stage pipeline definitions
│ └── icarus_production_example.yaml
│
├── scripts/ # Utility scripts
├── tests/ # Test suite
└── jobs/ # Job artifacts (auto-created)
SPINE uses YAML configurations throughout. User-facing configs are either
versioned .yaml files such as infer/icarus/full_chain_co_260501.yaml or
detector shorthands such as infer/icarus or infer/icarus/latest, which
generate a composite YAML config at submission time.
infer/<detector>/
├── full_chain_*.yaml # Version-specific top-level configs
├── base/ # Base component YAMLs
├── io/ # IO component YAMLs
├── model/ # Model component YAMLs
├── post/ # Post-processing component YAMLs
└── modifier/ # Optional modifier YAMLs
# Latest composite request (generated at submission time)
infer/icarus/latest
# Latest data-only configuration
infer/icarus/latest --apply-mods data
# Latest NuMI configuration
infer/icarus/latest --apply-mods numi
# Specific version with cosmic overlay
infer/icarus/full_chain_co_260501.yaml
# Data with lite outputs
infer/icarus/full_chain_co_260501.yaml --apply-mods data liteSee individual config directories for detector-specific documentation.
Resource profiles define batch resource requirements for different use cases. Profiles are defined in templates/profiles.yaml.
Understanding the available resources on each partition helps justify the profile configurations:
| Partition | GPUs/Node | GPU Type | CPUs/Node | RAM/Node | Resources per GPU |
|---|---|---|---|---|---|
hopper |
4 | H200 (141 GB) | 224 | 1344 GB | 56 CPUs, 336 GB |
ampere |
4 | A100 (40 GB) | 112 | 952 GB | 28 CPUs, 238 GB |
turing |
10 | RTX 2080 Ti (11 GB) | 40 | 160 GB | 4 CPUs, 16 GB |
milano |
0 | - | 120 | 480 GB | - |
roma |
0 | - | 120 | 480 GB | - |
Profile allocations are designed to:
- Hopper: Request full resources per GPU (56 CPUs x 6 GB/CPU = 336 GB per GPU)
- Ampere: Request full resources per GPU (28 CPUs × 8 GB/CPU = 224 GB per GPU)
- Turing: Request full resources per GPU (4 CPUs × 4 GB/CPU = 16 GB per GPU)
- CPU nodes: Request minimal resources (1 CPU × 4 GB = 4 GB) for flexible scheduling
| Profile | Partition | GPU Type | GPU Memory | GPUs | CPUs | Memory | Time | Use Case |
|---|---|---|---|---|---|---|---|---|
s3df_hopper |
hopper | H200 | 141 GB | 1 | 56 | 6 GB/CPU | 2h | Highest-performance GPU processing |
s3df_ampere |
ampere | A100 | 40 GB | 1 | 28 | 8 GB/CPU | 2h | High-performance GPU processing (default) |
s3df_turing |
turing | RTX 2080 Ti | 11 GB | 1 | 4 | 4GB/CPU | 2h | Cheaper GPU inference |
s3df_milano |
milano | - | - | 0 | 1 | 4 GB/CPU | 2h | CPU-only analysis |
s3df_roma |
roma | - | - | 0 | 1 | 4 GB/CPU | 2h | CPU-only analysis |
NERSC Perlmutter is a heterogeneous system with GPU nodes in two configurations:
| Node Type | Count | GPUs/Node | GPU Type | CPUs/Node | RAM/Node | Resources per GPU |
|---|---|---|---|---|---|---|
| GPU (40GB) | 1,536 | 4 | A100 (40 GB) | 64 | 512 GB | 32 CPUs, 128 GB |
| GPU (80GB) | 256 | 4 | A100 (80 GB) | 64 | 512 GB | 32 CPUs, 128 GB |
| CPU | 3,072 | 0 | - | 128 | 512 GB | - |
Profile allocations are designed to:
- GPU nodes: Request full resources per GPU (32 CPUs × 4 GB/CPU = 128 GB per GPU)
- CPU nodes: Request minimal resources (1 CPU × 4 GB = 4 GB) for flexible scheduling
- Shared partitions: Allow partial node allocation for cost-efficient small jobs
| Profile | Partition | GPU Type | GPU Memory | GPUs | CPUs | Memory | Time | Use Case |
|---|---|---|---|---|---|---|---|---|
nersc_gpu |
gpu_ss11 | A100 | 40 GB | 1 | 32 | 4 GB/CPU | 2h | Standard GPU processing (default, best availability) |
nersc_gpu_80gb |
gpu_ss11 | A100 | 80 GB | 1 | 32 | 4 GB/CPU | 2h | High-memory GPU processing (limited availability) |
nersc_gpu_exclusive |
gpu | A100 | 40 GB | 4 | 32 | 4 GB/CPU | 2h | Full-node exclusive access (training) |
nersc_cpu |
shared | - | - | 0 | 1 | 4 GB/CPU | 2h | CPU-only analysis |
Note: The nersc_gpu profile uses 40GB A100s by default since there are 6x more nodes available (1,536 vs 256), resulting in significantly faster queue times. Use nersc_gpu_80gb only when you specifically need >40GB GPU memory.
Profiles are auto-detected based on detector and config, or can be specified explicitly:
# Auto-detect (default)
./submit.py --config infer/icarus/latest --source data.root
# Explicit profile
./submit.py --config infer/icarus/latest --source data.root --profile s3df_turing
# ANL/Polaris using SPINE_CONTAINER_PATH from configure.sh
./submit.py --config infer/icarus/latest --source data.root --profile anl_polaris_debug
# Override specific resources
./submit.py --config infer/icarus/latest --source data.root --time 2:00:00 --cpus-per-task 8
# Override SPINE configuration values at runtime
./submit.py --config infer/generic/latest --source data.root --set base.world_size=0
# Preload model weights on the submit host before submitting
./submit.py --config infer/2x2/full_chain_240819.yaml --source data.root --profile anl_polaris_debug --preload
# Optional: preload only, useful for external production pipelines
./scripts/preload_downloads.py infer/2x2/full_chain_240819.yamlPipelines allow you to chain multiple processing stages with automatic dependency management.
Create a YAML file in pipelines/:
stages:
- name: reconstruction
config: infer/icarus/latest
files: /path/to/raw/*.root
profile: s3df_ampere
ntasks: 100
# Replace with a concrete YAML config if you need specific modifiers
- name: analysis
depends_on: [reconstruction] # Wait for reconstruction to complete
config: path/to/downstream_stage.yaml
files: output_reco/*.h5
profile: s3df_milano
ntasks: 20./submit.py --pipeline pipelines/my_pipeline.yamlEach submission creates a timestamped directory in jobs/:
jobs/20260101_143022_spine_icarus_latest/
├── job_metadata.json # Complete job metadata
├── files_chunk_0.txt # Input file lists
├── submit_chunk_0.sbatch # Generated submission script (.sbatch or .pbs)
├── logs/ # Batch stdout/stderr
│ ├── spine_icarus_latest_12345_1.out
│ └── spine_icarus_latest_12345_1.err
└── output/ # Output files
└── spine_icarus_latest.h5
# View job status on SLURM
squeue -u $USER
# View job details on SLURM
scontrol show job <job_id>
# View job status on PBS
qstat -u $USER
# View job details on PBS
qstat -fx <job_id>
# View logs
tail -f jobs/<job_dir>/logs/spine_*.out
# Cancel job on SLURM
scancel <job_id>
# Cancel job on PBS
qdel <job_id>Each job saves complete metadata for reproducibility:
{
"job_name": "spine_icarus_latest",
"detector": "icarus",
"config": "infer/icarus/latest",
"profile": "s3df_ampere",
"num_files": 100,
"job_ids": ["12345", "12346"],
"submitted": "2026-01-01T14:30:22",
"command": "./submit.py --config ..."
}Pipelines can automatically clean up intermediate outputs once downstream stages complete. Add a cleanup field to any stage that produces temporary files:
stages:
- name: reconstruction
config: infer/icarus/latest
files: /path/to/input/*.root
output: output_reco
# Clean up output_reco/ after all dependent stages finish
cleanup:
- output_reco
- temp_files
- name: analysis
depends_on: [reconstruction]
config: path/to/downstream_stage.yaml
files: output_reco/*.h5
output: output_analysisThe cleanup job:
- Only runs if downstream stages complete successfully (
afterokdependency) - Runs as a minimal resource job (1 CPU, 1GB RAM, 10min timeout)
- Safely checks for path existence before removal
- Logs all cleanup actions for auditing
This is especially useful for large-scale production to save disk space by removing intermediate reconstruction outputs after final analysis completes.
# Use custom LArCV installation
./submit.py --config infer/icarus/latest --source data.root --larcv-path /path/to/larcv
# Use custom flash-matching setup
./submit.py --config infer/icarus/latest --source data.root --flashmatch-path /path/to/flashmatch
# Expose CVMFS inside the container
./submit.py --config infer/icarus/latest --source data.root --cvmfsThere is no need to pass --flashmatch. The flag is accepted only for backward
compatibility and is ignored. Use --flashmatch-path to source a custom
flash-matching setup instead.
For sites without CVMFS, point ICARUS configs at a local copy of the
icarus_data release directory before sourcing configure.sh:
export ICARUS_DATA_DIR=/path/to/icarus_data
source configure.sh# Submit with dependency on another job
./submit.py --config path/to/downstream_stage.yaml --source output/*.h5 --dependency afterok:12345# Process 5 files per job (reduces overhead)
./submit.py --config infer/icarus/latest --source data/*.root --files-per-task 5
# Limit concurrent tasks to 50
./submit.py --config infer/icarus/latest --source data/*.root --ntasks 50ICARUS uses split cryostat processing with cosmic overlay:
# Standard cosmic overlay processing
./submit.py --config infer/icarus/latest --source data.root
# Data-only mode (no truth labels)
./submit.py --config infer/icarus/latest --apply-mods data --source data.root
# NuMI beam configuration
./submit.py --config infer/icarus/latest --apply-mods numi --source data.root
# Lite output (reduced file size)
./submit.py --config infer/icarus/latest --apply-mods data lite --source data.root./submit.py --config infer/sbnd/latest --source data.root2x2 uses higher resource requirements:
./submit.py --config infer/2x2/latest --source data.root --profile s3df_ampere./submit.py --config infer/nd-lar/latest --source data.rootWARNING: SPINE_PROD_BASEDIR not set. Did you source configure.sh?
Solution: Source the environment:
source configure.shERROR: jinja2 is required. Install with: pip install jinja2
Solution: Install Python dependencies:
pip install jinja2 pyyaml- Check batch logs in
jobs/<job_dir>/logs/ - Review job metadata in
jobs/<job_dir>/job_metadata.json - Test configuration on a single file with
--dry-run - Verify input files exist and are accessible
Solution: Use a profile with more memory or override memory:
./submit.py --config infer/icarus/latest --source data.root --profile s3df_ampereOr override memory:
./submit.py --config infer/icarus/latest --source data.root --mem-per-cpu 16gSolution: Request more time:
./submit.py --config infer/icarus/latest --source data.root --time 4:00:00Always test configurations on a small sample:
# Test with dry run
./submit.py --config infer/icarus/latest --source test.root --dry-run
# Test with single file
./submit.py --config infer/icarus/latest --source test.root- Use
s3df_amperefor high-performance GPU processing (default) - Use
s3df_turingfor cheaper GPU inference - Use
s3df_milanoors3df_romafor CPU-only analysis
# For many small files, batch them
./submit.py --config infer/icarus/latest --source small_files/*.root --files-per-task 10
# For large files, process individually
./submit.py --config infer/icarus/latest --source large_files/*.root --files-per-task 1Check actual resource usage to optimize future jobs. On SLURM systems:
seff <job_id>Job metadata is automatically saved. Keep important job directories:
# Jobs are in timestamped directories
ls -lt jobs/Set by configure.sh:
SPINE_PROD_BASEDIR- Base directory of this repositorySPINE_CONFIG_PATH- Configuration search pathICARUS_DATA_DIR- ICARUS data release pathSPINE_CONTAINER_VERSION- Tagged SPINE container version, without a leadingvSPINE_CONTAINER_PATH- Singularity/Apptainer image pathSPINE_CONTAINER_TAG- Registry image tag for Shifter-style runtimes, includingdocker:SPINE_CONTAINER_PATH_AUTO- Tracks whetherSPINE_CONTAINER_PATHwas auto-derivedSPINE_CONTAINER_RUNTIME_BIN- Optional full path or command name for the Singularity/Apptainer executable used by interactive SIF executionSPINE_CONTAINER_RUNTIME_ARGS- Optional extra Singularity/Apptainer arguments for interactive SIF executionSPINE_CONTAINER_PLATFORM- Docker/Podman platform for interactive fallback
Edit templates/profiles.yaml:
profiles:
my_custom_profile:
partition: my_partition
gpus: 2
cpus_per_task: 16
mem_per_cpu: 8g
time: "6:00:00"
description: "My custom profile description"Edit templates/profiles.yaml:
detectors:
my_detector:
default_profile: s3df_ampere
configs_dir: infer/my_detector
account: "my:account"For contributors who need to run tests and development tools:
# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Linux/Mac
# Install SPINE (required for running parsing tests)
pip install "spine-ml @ git+https://github.com/DeepLearnPhysics/SPINE.git@main"
# Install development dependencies
pip install -r requirements-dev.txt
# Install pre-commit hooks
pre-commit installNote: SPINE installation is only required for development if you want to run the configuration parsing tests. Production users only need to point to an existing SPINE installation via configure.sh.
# Run all tests
pytest
# Run config validation tests only
pytest tests/test_config_validation.py
# Run with coverage
pytest --cov=. --cov-report=htmlThis repository uses pre-commit hooks for code quality:
- check-yaml: Validates YAML syntax
- yamllint: Lints YAML files for style
- prettier: Formats YAML files
- trailing-whitespace: Removes trailing whitespace
- end-of-file-fixer: Ensures files end with newline
# Run hooks manually
pre-commit run --all-filesAll configuration files are automatically validated in CI to ensure they parse correctly:
from config import load_config
config = load_config('infer/icarus/latest.yaml')A companion tool for indexing and browsing SPINE production metadata. See spine-db for documentation.
- Issues: https://github.com/DeepLearnPhysics/spine-prod/issues
- SPINE Documentation: https://github.com/DeepLearnPhysics/spine
- Contact: SPINE development team
This software is provided under the same license as SPINE.
If you use SPINE in your research, please cite the relevant SPINE publications.
Development supported by the DOE Office of High Energy Physics and the National Science Foundation.