Skip to content

DeepLearnPhysics/spine-prod

Repository files navigation

SPINE Production System

CI codecov Python

A production system for running SPINE (Scalable Particle Imaging with Neural Embeddings) reconstruction on HPC clusters with SLURM- and PBS-based batch systems.

Overview

SPINE is a deep learning-based reconstruction framework for liquid argon time projection chamber (LArTPC) detectors. This production system provides tools for running SPINE at scale on large datasets using scheduler-managed job arrays.

Quick Start

1. Environment Setup

# Clone the repository
git clone https://github.com/DeepLearnPhysics/spine-prod.git
cd spine-prod

# Configure environment
source configure.sh

SPINE Version Control: Production jobs now run entirely from a tagged SPINE container image. The default Shifter tag is docker:ghcr.io/deeplearnphysics/spine:0.12.2, with the matching S3DF Singularity image derived from the same version at /sdf/data/neutrino/images/spine_v0-12-2.sif. This container packages SPINE, OpT0Finder, and runtime dependencies, and jobs invoke the container-provided spine executable directly.

Alternative Container Location: You can override the local .sif path or container release before sourcing configure.sh:

export SPINE_CONTAINER_PATH=/path/to/spine_v0-12-2.sif
export SPINE_CONTAINER_VERSION=0.12.2
source configure.sh

Updating SPINE Version: Update the container version and site-local image path together:

export SPINE_CONTAINER_VERSION=0.12.2
# Default SPINE_CONTAINER_PATH becomes /sdf/data/neutrino/images/spine_v0-12-2.sif

2. Basic Job Submission

# Detector shorthand resolves to the latest composite config
./submit.py --config infer/icarus --source /path/to/data.root

# Run on a single file
./submit.py --config infer/icarus/latest --source /path/to/data.root

# Run data-only processing on multiple files (glob)
./submit.py --config infer/icarus/latest --apply-mods data --source /path/to/data/*.root

# Run from a file list (recommended)
./submit.py --config infer/2x2/latest --source-list file_list.txt

3. Advanced Usage

# Use a specific resource profile
./submit.py --config infer/icarus/latest --source data/*.root --profile s3df_turing

# Process multiple files per job
./submit.py --config infer/icarus/latest --source data/*.root --files-per-task 5

# Limit parallel tasks
./submit.py --config infer/icarus/latest --source data/*.root --ntasks 50

# Run a multi-stage pipeline
./submit.py --pipeline pipelines/icarus_production_example.yaml

# Dry run (see what would be submitted)
./submit.py --config infer/icarus/latest --source test.root --dry-run

# Interactive mode (test locally without batch submission)
./submit.py --interactive --config infer/icarus/latest --source test.root
./submit.py -I --config infer/icarus/latest --source-list files.txt --task-id 2

Interactive Mode

Interactive mode (--interactive or -I) runs SPINE processing directly in your current shell without submitting to the batch scheduler. This is particularly useful for:

  • Testing configurations before batch submission
  • Debugging issues with immediate feedback
  • Small-scale runs on login nodes (use sparingly!)
  • Config validation with real execution

Interactive mode performs all the same config composition, file chunking, and environment setup as batch mode, but executes locally:

# Test a config on one file
./submit.py -I --config infer/icarus/latest --source /path/to/test.root

# Force container-backed interactive execution
./submit.py -I --interactive-runtime container --config infer/generic/latest --source test.root --set base.world_size=0

# Test with modifiers applied
./submit.py -I --config infer/icarus/latest --source test.root --apply-mods data lite

# Test a specific task from a file list (if using --files-per-task)
./submit.py -I --config infer/icarus/latest --source-list files.txt --files-per-task 5 --task-id 2

Note: Interactive mode is not supported for pipelines. Use --dry-run to preview pipeline submissions.

By default, interactive mode uses the spine executable already on PATH. If spine is unavailable, it falls back to the configured container: first SPINE_CONTAINER_PATH with Singularity/Apptainer if the .sif exists, then SPINE_CONTAINER_TAG with Docker/Podman. Use --interactive-runtime local to require the local executable, or --interactive-runtime container to force container execution. Docker/Podman fallback requests linux/amd64 by default; override SPINE_CONTAINER_PLATFORM if a different platform is needed. For local debugging or batch jobs that should use an unreleased checkout, pass --spine-path /path/to/spine to run /path/to/spine/bin/run.py (or /path/to/spine/bin/spine if present) instead of the spine executable on PATH. The checkout root is added to container bind paths automatically where supported.

./submit.py --spine-path /path/to/spine -I --interactive-runtime local --config infer/generic/latest --source test.root --set base.world_size=0

EAF Interactive Container Setup

On EAF, Apptainer may be provided from CVMFS rather than as apptainer on PATH. Interactive container mode can be pointed at that executable directly, and can also pass through the extra environment flags needed there:

export SPINE_CONTAINER_RUNTIME_BIN=/cvmfs/eaf.opensciencegrid.org/apptainer/bin/apptainer
export SPINE_CONTAINER_RUNTIME_ARGS="--env LD_PRELOAD= --env LC_ALL=C.UTF-8"

Directory Structure

spine-prod/
├── configure.sh             # Environment setup script
├── submit.py                # Main submission orchestrator (NEW!)
├── README.md                # This file
│
├── config/                  # All SPINE configs (inference & training)
│   ├── infer/               # Inference configs (referenced as infer/...)
│   │   ├── icarus/          # ICARUS detector configs
│   │   ├── sbnd/            # SBND detector configs
│   │   ├── 2x2/             # 2x2 detector configs
│   │   ├── nd-lar/          # ND-LAr detector configs
│   │   ├── generic/         # Generic (no detector) configs
│   │   └── common/          # Shared configs
│   └── train/               # Training configs (referenced as train/...)
├── templates/               # Job templates
│   ├── profiles.yaml        # Resource profiles
│   ├── job_template_s3df.sbatch
│   ├── job_template_nersc.sbatch
│   └── job_template_anl.pbs
│
├── pipelines/               # Multi-stage pipeline definitions
│   └── icarus_production_example.yaml
│
├── scripts/                 # Utility scripts
├── tests/                   # Test suite
└── jobs/                    # Job artifacts (auto-created)

Configuration System

SPINE uses YAML configurations throughout. User-facing configs are either versioned .yaml files such as infer/icarus/full_chain_co_260501.yaml or detector shorthands such as infer/icarus or infer/icarus/latest, which generate a composite YAML config at submission time.

Config Organization

infer/<detector>/
├── full_chain_*.yaml             # Version-specific top-level configs
├── base/                         # Base component YAMLs
├── io/                           # IO component YAMLs
├── model/                        # Model component YAMLs
├── post/                         # Post-processing component YAMLs
└── modifier/                     # Optional modifier YAMLs

Example: ICARUS Configurations

# Latest composite request (generated at submission time)
infer/icarus/latest

# Latest data-only configuration
infer/icarus/latest --apply-mods data

# Latest NuMI configuration
infer/icarus/latest --apply-mods numi

# Specific version with cosmic overlay
infer/icarus/full_chain_co_260501.yaml

# Data with lite outputs
infer/icarus/full_chain_co_260501.yaml --apply-mods data lite

See individual config directories for detector-specific documentation.

Resource Profiles

Resource profiles define batch resource requirements for different use cases. Profiles are defined in templates/profiles.yaml.

S3DF Node Resources

Understanding the available resources on each partition helps justify the profile configurations:

Partition GPUs/Node GPU Type CPUs/Node RAM/Node Resources per GPU
hopper 4 H200 (141 GB) 224 1344 GB 56 CPUs, 336 GB
ampere 4 A100 (40 GB) 112 952 GB 28 CPUs, 238 GB
turing 10 RTX 2080 Ti (11 GB) 40 160 GB 4 CPUs, 16 GB
milano 0 - 120 480 GB -
roma 0 - 120 480 GB -

Profile allocations are designed to:

  • Hopper: Request full resources per GPU (56 CPUs x 6 GB/CPU = 336 GB per GPU)
  • Ampere: Request full resources per GPU (28 CPUs × 8 GB/CPU = 224 GB per GPU)
  • Turing: Request full resources per GPU (4 CPUs × 4 GB/CPU = 16 GB per GPU)
  • CPU nodes: Request minimal resources (1 CPU × 4 GB = 4 GB) for flexible scheduling

Available Profiles

Profile Partition GPU Type GPU Memory GPUs CPUs Memory Time Use Case
s3df_hopper hopper H200 141 GB 1 56 6 GB/CPU 2h Highest-performance GPU processing
s3df_ampere ampere A100 40 GB 1 28 8 GB/CPU 2h High-performance GPU processing (default)
s3df_turing turing RTX 2080 Ti 11 GB 1 4 4GB/CPU 2h Cheaper GPU inference
s3df_milano milano - - 0 1 4 GB/CPU 2h CPU-only analysis
s3df_roma roma - - 0 1 4 GB/CPU 2h CPU-only analysis

NERSC Perlmutter Node Resources

NERSC Perlmutter is a heterogeneous system with GPU nodes in two configurations:

Node Type Count GPUs/Node GPU Type CPUs/Node RAM/Node Resources per GPU
GPU (40GB) 1,536 4 A100 (40 GB) 64 512 GB 32 CPUs, 128 GB
GPU (80GB) 256 4 A100 (80 GB) 64 512 GB 32 CPUs, 128 GB
CPU 3,072 0 - 128 512 GB -

Profile allocations are designed to:

  • GPU nodes: Request full resources per GPU (32 CPUs × 4 GB/CPU = 128 GB per GPU)
  • CPU nodes: Request minimal resources (1 CPU × 4 GB = 4 GB) for flexible scheduling
  • Shared partitions: Allow partial node allocation for cost-efficient small jobs

NERSC Available Profiles

Profile Partition GPU Type GPU Memory GPUs CPUs Memory Time Use Case
nersc_gpu gpu_ss11 A100 40 GB 1 32 4 GB/CPU 2h Standard GPU processing (default, best availability)
nersc_gpu_80gb gpu_ss11 A100 80 GB 1 32 4 GB/CPU 2h High-memory GPU processing (limited availability)
nersc_gpu_exclusive gpu A100 40 GB 4 32 4 GB/CPU 2h Full-node exclusive access (training)
nersc_cpu shared - - 0 1 4 GB/CPU 2h CPU-only analysis

Note: The nersc_gpu profile uses 40GB A100s by default since there are 6x more nodes available (1,536 vs 256), resulting in significantly faster queue times. Use nersc_gpu_80gb only when you specifically need >40GB GPU memory.

Profile Selection

Profiles are auto-detected based on detector and config, or can be specified explicitly:

# Auto-detect (default)
./submit.py --config infer/icarus/latest --source data.root

# Explicit profile
./submit.py --config infer/icarus/latest --source data.root --profile s3df_turing

# ANL/Polaris using SPINE_CONTAINER_PATH from configure.sh
./submit.py --config infer/icarus/latest --source data.root --profile anl_polaris_debug

# Override specific resources
./submit.py --config infer/icarus/latest --source data.root --time 2:00:00 --cpus-per-task 8

# Override SPINE configuration values at runtime
./submit.py --config infer/generic/latest --source data.root --set base.world_size=0

# Preload model weights on the submit host before submitting
./submit.py --config infer/2x2/full_chain_240819.yaml --source data.root --profile anl_polaris_debug --preload

# Optional: preload only, useful for external production pipelines
./scripts/preload_downloads.py infer/2x2/full_chain_240819.yaml

Pipeline Mode

Pipelines allow you to chain multiple processing stages with automatic dependency management.

Pipeline Definition

Create a YAML file in pipelines/:

stages:
  - name: reconstruction
    config: infer/icarus/latest
    files: /path/to/raw/*.root
    profile: s3df_ampere
    ntasks: 100
    # Replace with a concrete YAML config if you need specific modifiers
  
  - name: analysis
    depends_on: [reconstruction]  # Wait for reconstruction to complete
    config: path/to/downstream_stage.yaml
    files: output_reco/*.h5
    profile: s3df_milano
    ntasks: 20

Submit Pipeline

./submit.py --pipeline pipelines/my_pipeline.yaml

Job Management

Job Artifacts

Each submission creates a timestamped directory in jobs/:

jobs/20260101_143022_spine_icarus_latest/
├── job_metadata.json           # Complete job metadata
├── files_chunk_0.txt          # Input file lists
├── submit_chunk_0.sbatch      # Generated submission script (.sbatch or .pbs)
├── logs/                      # Batch stdout/stderr
│   ├── spine_icarus_latest_12345_1.out
│   └── spine_icarus_latest_12345_1.err
└── output/                    # Output files
    └── spine_icarus_latest.h5

Monitoring Jobs

# View job status on SLURM
squeue -u $USER

# View job details on SLURM
scontrol show job <job_id>

# View job status on PBS
qstat -u $USER

# View job details on PBS
qstat -fx <job_id>

# View logs
tail -f jobs/<job_dir>/logs/spine_*.out

# Cancel job on SLURM
scancel <job_id>

# Cancel job on PBS
qdel <job_id>

Job Metadata

Each job saves complete metadata for reproducibility:

{
  "job_name": "spine_icarus_latest",
  "detector": "icarus",
  "config": "infer/icarus/latest",
  "profile": "s3df_ampere",
  "num_files": 100,
  "job_ids": ["12345", "12346"],
  "submitted": "2026-01-01T14:30:22",
  "command": "./submit.py --config ..."
}

Automatic Cleanup of Intermediate Files

Pipelines can automatically clean up intermediate outputs once downstream stages complete. Add a cleanup field to any stage that produces temporary files:

stages:
  - name: reconstruction
    config: infer/icarus/latest
    files: /path/to/input/*.root
    output: output_reco
    # Clean up output_reco/ after all dependent stages finish
    cleanup:
      - output_reco
      - temp_files
  
  - name: analysis
    depends_on: [reconstruction]
    config: path/to/downstream_stage.yaml
    files: output_reco/*.h5
    output: output_analysis

The cleanup job:

  • Only runs if downstream stages complete successfully (afterok dependency)
  • Runs as a minimal resource job (1 CPU, 1GB RAM, 10min timeout)
  • Safely checks for path existence before removal
  • Logs all cleanup actions for auditing

This is especially useful for large-scale production to save disk space by removing intermediate reconstruction outputs after final analysis completes.

Advanced Features

Custom Software Paths

# Use custom LArCV installation
./submit.py --config infer/icarus/latest --source data.root --larcv-path /path/to/larcv

# Use custom flash-matching setup
./submit.py --config infer/icarus/latest --source data.root --flashmatch-path /path/to/flashmatch

# Expose CVMFS inside the container
./submit.py --config infer/icarus/latest --source data.root --cvmfs

There is no need to pass --flashmatch. The flag is accepted only for backward compatibility and is ignored. Use --flashmatch-path to source a custom flash-matching setup instead.

For sites without CVMFS, point ICARUS configs at a local copy of the icarus_data release directory before sourcing configure.sh:

export ICARUS_DATA_DIR=/path/to/icarus_data
source configure.sh

Job Dependencies

# Submit with dependency on another job
./submit.py --config path/to/downstream_stage.yaml --source output/*.h5 --dependency afterok:12345

Array Job Optimization

# Process 5 files per job (reduces overhead)
./submit.py --config infer/icarus/latest --source data/*.root --files-per-task 5

# Limit concurrent tasks to 50
./submit.py --config infer/icarus/latest --source data/*.root --ntasks 50

Detector-Specific Guides

ICARUS

ICARUS uses split cryostat processing with cosmic overlay:

# Standard cosmic overlay processing
./submit.py --config infer/icarus/latest --source data.root

# Data-only mode (no truth labels)
./submit.py --config infer/icarus/latest --apply-mods data --source data.root

# NuMI beam configuration
./submit.py --config infer/icarus/latest --apply-mods numi --source data.root

# Lite output (reduced file size)
./submit.py --config infer/icarus/latest --apply-mods data lite --source data.root

SBND

./submit.py --config infer/sbnd/latest --source data.root

2x2

2x2 uses higher resource requirements:

./submit.py --config infer/2x2/latest --source data.root --profile s3df_ampere

ND-LAr

./submit.py --config infer/nd-lar/latest --source data.root

Troubleshooting

Environment Not Set

WARNING: SPINE_PROD_BASEDIR not set. Did you source configure.sh?

Solution: Source the environment:

source configure.sh

Missing Dependencies

ERROR: jinja2 is required. Install with: pip install jinja2

Solution: Install Python dependencies:

pip install jinja2 pyyaml

Job Failures

  1. Check batch logs in jobs/<job_dir>/logs/
  2. Review job metadata in jobs/<job_dir>/job_metadata.json
  3. Test configuration on a single file with --dry-run
  4. Verify input files exist and are accessible

Out of Memory

Solution: Use a profile with more memory or override memory:

./submit.py --config infer/icarus/latest --source data.root --profile s3df_ampere

Or override memory:

./submit.py --config infer/icarus/latest --source data.root --mem-per-cpu 16g

Job Time Limit

Solution: Request more time:

./submit.py --config infer/icarus/latest --source data.root --time 4:00:00

Best Practices

1. Test Before Production

Always test configurations on a small sample:

# Test with dry run
./submit.py --config infer/icarus/latest --source test.root --dry-run

# Test with single file
./submit.py --config infer/icarus/latest --source test.root

2. Use Appropriate Profiles

  • Use s3df_ampere for high-performance GPU processing (default)
  • Use s3df_turing for cheaper GPU inference
  • Use s3df_milano or s3df_roma for CPU-only analysis

3. Optimize File Batching

# For many small files, batch them
./submit.py --config infer/icarus/latest --source small_files/*.root --files-per-task 10

# For large files, process individually
./submit.py --config infer/icarus/latest --source large_files/*.root --files-per-task 1

4. Monitor Resource Usage

Check actual resource usage to optimize future jobs. On SLURM systems:

seff <job_id>

5. Track Your Work

Job metadata is automatically saved. Keep important job directories:

# Jobs are in timestamped directories
ls -lt jobs/

Environment Variables

Set by configure.sh:

  • SPINE_PROD_BASEDIR - Base directory of this repository
  • SPINE_CONFIG_PATH - Configuration search path
  • ICARUS_DATA_DIR - ICARUS data release path
  • SPINE_CONTAINER_VERSION - Tagged SPINE container version, without a leading v
  • SPINE_CONTAINER_PATH - Singularity/Apptainer image path
  • SPINE_CONTAINER_TAG - Registry image tag for Shifter-style runtimes, including docker:
  • SPINE_CONTAINER_PATH_AUTO - Tracks whether SPINE_CONTAINER_PATH was auto-derived
  • SPINE_CONTAINER_RUNTIME_BIN - Optional full path or command name for the Singularity/Apptainer executable used by interactive SIF execution
  • SPINE_CONTAINER_RUNTIME_ARGS - Optional extra Singularity/Apptainer arguments for interactive SIF execution
  • SPINE_CONTAINER_PLATFORM - Docker/Podman platform for interactive fallback

Contributing

Adding New Profiles

Edit templates/profiles.yaml:

profiles:
  my_custom_profile:
    partition: my_partition
    gpus: 2
    cpus_per_task: 16
    mem_per_cpu: 8g
    time: "6:00:00"
    description: "My custom profile description"

Adding Detector Defaults

Edit templates/profiles.yaml:

detectors:
  my_detector:
    default_profile: s3df_ampere
    configs_dir: infer/my_detector
    account: "my:account"

Development

Installation for Development

For contributors who need to run tests and development tools:

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Linux/Mac

# Install SPINE (required for running parsing tests)
pip install "spine-ml @ git+https://github.com/DeepLearnPhysics/SPINE.git@main"

# Install development dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

Note: SPINE installation is only required for development if you want to run the configuration parsing tests. Production users only need to point to an existing SPINE installation via configure.sh.

Running Tests

# Run all tests
pytest

# Run config validation tests only
pytest tests/test_config_validation.py

# Run with coverage
pytest --cov=. --cov-report=html

Pre-commit Hooks

This repository uses pre-commit hooks for code quality:

  • check-yaml: Validates YAML syntax
  • yamllint: Lints YAML files for style
  • prettier: Formats YAML files
  • trailing-whitespace: Removes trailing whitespace
  • end-of-file-fixer: Ensures files end with newline
# Run hooks manually
pre-commit run --all-files

Config Validation

All configuration files are automatically validated in CI to ensure they parse correctly:

from config import load_config
config = load_config('infer/icarus/latest.yaml')

Related Tools

Production Database (spine-db)

A companion tool for indexing and browsing SPINE production metadata. See spine-db for documentation.

Support

License

This software is provided under the same license as SPINE.

Citation

If you use SPINE in your research, please cite the relevant SPINE publications.

Acknowledgments

Development supported by the DOE Office of High Energy Physics and the National Science Foundation.

About

Production tools for SPINE

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors