SPINE Production System

A production system for running SPINE (Scalable Particle Imaging with Neural Embeddings) reconstruction on HPC clusters with SLURM- and PBS-based batch systems.

Overview

SPINE is a deep learning-based reconstruction framework for liquid argon time projection chamber (LArTPC) detectors. This production system provides tools for running SPINE at scale on large datasets using scheduler-managed job arrays.

Quick Start

1. Environment Setup

# Clone the repository
git clone https://github.com/DeepLearnPhysics/spine-prod.git
cd spine-prod

# Configure environment
source configure.sh

SPINE Version Control: Production jobs now run entirely from a tagged SPINE container image. The default Shifter tag is docker:ghcr.io/deeplearnphysics/spine:0.12.2, with the matching S3DF Singularity image derived from the same version at /sdf/data/neutrino/images/spine_v0-12-2.sif. This container packages SPINE, OpT0Finder, and runtime dependencies, and jobs invoke the container-provided spine executable directly.

Alternative Container Location: You can override the local .sif path or container release before sourcing configure.sh:

export SPINE_CONTAINER_PATH=/path/to/spine_v0-12-2.sif
export SPINE_CONTAINER_VERSION=0.12.2
source configure.sh

Updating SPINE Version: Update the container version and site-local image path together:

export SPINE_CONTAINER_VERSION=0.12.2
# Default SPINE_CONTAINER_PATH becomes /sdf/data/neutrino/images/spine_v0-12-2.sif

2. Basic Job Submission

# Detector shorthand resolves to the latest composite config
./submit.py --config infer/icarus --source /path/to/data.root

# Run on a single file
./submit.py --config infer/icarus/latest --source /path/to/data.root

# Run data-only processing on multiple files (glob)
./submit.py --config infer/icarus/latest --apply-mods data --source /path/to/data/*.root

# Run from a file list (recommended)
./submit.py --config infer/2x2/latest --source-list file_list.txt

3. Advanced Usage

# Use a specific resource profile
./submit.py --config infer/icarus/latest --source data/*.root --profile s3df_turing

# Process multiple files per job
./submit.py --config infer/icarus/latest --source data/*.root --files-per-task 5

# Limit parallel tasks
./submit.py --config infer/icarus/latest --source data/*.root --ntasks 50

# Run a multi-stage pipeline
./submit.py --pipeline pipelines/icarus_production_example.yaml

# Dry run (see what would be submitted)
./submit.py --config infer/icarus/latest --source test.root --dry-run

# Interactive mode (test locally without batch submission)
./submit.py --interactive --config infer/icarus/latest --source test.root
./submit.py -I --config infer/icarus/latest --source-list files.txt --task-id 2

Interactive Mode

Interactive mode (--interactive or -I) runs SPINE processing directly in your current shell without submitting to the batch scheduler. This is particularly useful for:

Testing configurations before batch submission
Debugging issues with immediate feedback
Small-scale runs on login nodes (use sparingly!)
Config validation with real execution

Interactive mode performs all the same config composition, file chunking, and environment setup as batch mode, but executes locally:

# Test a config on one file
./submit.py -I --config infer/icarus/latest --source /path/to/test.root

# Force container-backed interactive execution
./submit.py -I --interactive-runtime container --config infer/generic/latest --source test.root --set base.world_size=0

# Test with modifiers applied
./submit.py -I --config infer/icarus/latest --source test.root --apply-mods data lite

# Test a specific task from a file list (if using --files-per-task)
./submit.py -I --config infer/icarus/latest --source-list files.txt --files-per-task 5 --task-id 2

Note: Interactive mode is not supported for pipelines. Use --dry-run to preview pipeline submissions.

By default, interactive mode uses the spine executable already on PATH. If spine is unavailable, it falls back to the configured container: first SPINE_CONTAINER_PATH with Singularity/Apptainer if the .sif exists, then SPINE_CONTAINER_TAG with Docker/Podman. Use --interactive-runtime local to require the local executable, or --interactive-runtime container to force container execution. Docker/Podman fallback requests linux/amd64 by default; override SPINE_CONTAINER_PLATFORM if a different platform is needed. For local debugging or batch jobs that should use an unreleased checkout, pass --spine-path /path/to/spine to run /path/to/spine/bin/run.py (or /path/to/spine/bin/spine if present) instead of the spine executable on PATH. The checkout root is added to container bind paths automatically where supported.

./submit.py --spine-path /path/to/spine -I --interactive-runtime local --config infer/generic/latest --source test.root --set base.world_size=0

EAF Interactive Container Setup

On EAF, Apptainer may be provided from CVMFS rather than as apptainer on PATH. Interactive container mode can be pointed at that executable directly, and can also pass through the extra environment flags needed there:

export SPINE_CONTAINER_RUNTIME_BIN=/cvmfs/eaf.opensciencegrid.org/apptainer/bin/apptainer
export SPINE_CONTAINER_RUNTIME_ARGS="--env LD_PRELOAD= --env LC_ALL=C.UTF-8"

Directory Structure

spine-prod/
├── configure.sh             # Environment setup script
├── submit.py                # Main submission orchestrator (NEW!)
├── README.md                # This file
│
├── config/                  # All SPINE configs (inference & training)
│   ├── infer/               # Inference configs (referenced as infer/...)
│   │   ├── icarus/          # ICARUS detector configs
│   │   ├── sbnd/            # SBND detector configs
│   │   ├── 2x2/             # 2x2 detector configs
│   │   ├── nd-lar/          # ND-LAr detector configs
│   │   ├── generic/         # Generic (no detector) configs
│   │   └── common/          # Shared configs
│   └── train/               # Training configs (referenced as train/...)
├── templates/               # Job templates
│   ├── profiles.yaml        # Resource profiles
│   ├── job_template_s3df.sbatch
│   ├── job_template_nersc.sbatch
│   └── job_template_anl.pbs
│
├── pipelines/               # Multi-stage pipeline definitions
│   └── icarus_production_example.yaml
│
├── scripts/                 # Utility scripts
├── tests/                   # Test suite
└── jobs/                    # Job artifacts (auto-created)

Configuration System

SPINE uses YAML configurations throughout. User-facing configs are either versioned .yaml files such as infer/icarus/full_chain_co_260501.yaml or detector shorthands such as infer/icarus or infer/icarus/latest, which generate a composite YAML config at submission time.

Config Organization

infer/<detector>/
├── full_chain_*.yaml             # Version-specific top-level configs
├── base/                         # Base component YAMLs
├── io/                           # IO component YAMLs
├── model/                        # Model component YAMLs
├── post/                         # Post-processing component YAMLs
└── modifier/                     # Optional modifier YAMLs

Example: ICARUS Configurations

# Latest composite request (generated at submission time)
infer/icarus/latest

# Latest data-only configuration
infer/icarus/latest --apply-mods data

# Latest NuMI configuration
infer/icarus/latest --apply-mods numi

# Specific version with cosmic overlay
infer/icarus/full_chain_co_260501.yaml

# Data with lite outputs
infer/icarus/full_chain_co_260501.yaml --apply-mods data lite

See individual config directories for detector-specific documentation.

Resource Profiles

Resource profiles define batch resource requirements for different use cases. Profiles are defined in templates/profiles.yaml.

S3DF Node Resources

Understanding the available resources on each partition helps justify the profile configurations:

Partition	GPUs/Node	GPU Type	CPUs/Node	RAM/Node	Resources per GPU
`hopper`	4	H200 (141 GB)	224	1344 GB	56 CPUs, 336 GB
`ampere`	4	A100 (40 GB)	112	952 GB	28 CPUs, 238 GB
`turing`	10	RTX 2080 Ti (11 GB)	40	160 GB	4 CPUs, 16 GB
`milano`	0	-	120	480 GB	-
`roma`	0	-	120	480 GB	-

Profile allocations are designed to:

Hopper: Request full resources per GPU (56 CPUs x 6 GB/CPU = 336 GB per GPU)
Ampere: Request full resources per GPU (28 CPUs × 8 GB/CPU = 224 GB per GPU)
Turing: Request full resources per GPU (4 CPUs × 4 GB/CPU = 16 GB per GPU)
CPU nodes: Request minimal resources (1 CPU × 4 GB = 4 GB) for flexible scheduling

Available Profiles

Profile	Partition	GPU Type	GPU Memory	GPUs	CPUs	Memory	Time	Use Case
`s3df_hopper`	hopper	H200	141 GB	1	56	6 GB/CPU	2h	Highest-performance GPU processing
`s3df_ampere`	ampere	A100	40 GB	1	28	8 GB/CPU	2h	High-performance GPU processing (default)
`s3df_turing`	turing	RTX 2080 Ti	11 GB	1	4	4GB/CPU	2h	Cheaper GPU inference
`s3df_milano`	milano	-	-	0	1	4 GB/CPU	2h	CPU-only analysis
`s3df_roma`	roma	-	-	0	1	4 GB/CPU	2h	CPU-only analysis

NERSC Perlmutter Node Resources

NERSC Perlmutter is a heterogeneous system with GPU nodes in two configurations:

Node Type	Count	GPUs/Node	GPU Type	CPUs/Node	RAM/Node	Resources per GPU
GPU (40GB)	1,536	4	A100 (40 GB)	64	512 GB	32 CPUs, 128 GB
GPU (80GB)	256	4	A100 (80 GB)	64	512 GB	32 CPUs, 128 GB
CPU	3,072	0	-	128	512 GB	-

Profile allocations are designed to:

GPU nodes: Request full resources per GPU (32 CPUs × 4 GB/CPU = 128 GB per GPU)
CPU nodes: Request minimal resources (1 CPU × 4 GB = 4 GB) for flexible scheduling
Shared partitions: Allow partial node allocation for cost-efficient small jobs

NERSC Available Profiles

Profile	Partition	GPU Type	GPU Memory	GPUs	CPUs	Memory	Time	Use Case
`nersc_gpu`	gpu_ss11	A100	40 GB	1	32	4 GB/CPU	2h	Standard GPU processing (default, best availability)
`nersc_gpu_80gb`	gpu_ss11	A100	80 GB	1	32	4 GB/CPU	2h	High-memory GPU processing (limited availability)
`nersc_gpu_exclusive`	gpu	A100	40 GB	4	32	4 GB/CPU	2h	Full-node exclusive access (training)
`nersc_cpu`	shared	-	-	0	1	4 GB/CPU	2h	CPU-only analysis

Note: The nersc_gpu profile uses 40GB A100s by default since there are 6x more nodes available (1,536 vs 256), resulting in significantly faster queue times. Use nersc_gpu_80gb only when you specifically need >40GB GPU memory.

Profile Selection

Profiles are auto-detected based on detector and config, or can be specified explicitly:

# Auto-detect (default)
./submit.py --config infer/icarus/latest --source data.root

# Explicit profile
./submit.py --config infer/icarus/latest --source data.root --profile s3df_turing

# ANL/Polaris using SPINE_CONTAINER_PATH from configure.sh
./submit.py --config infer/icarus/latest --source data.root --profile anl_polaris_debug

# Override specific resources
./submit.py --config infer/icarus/latest --source data.root --time 2:00:00 --cpus-per-task 8

# Override SPINE configuration values at runtime
./submit.py --config infer/generic/latest --source data.root --set base.world_size=0

# Preload model weights on the submit host before submitting
./submit.py --config infer/2x2/full_chain_240819.yaml --source data.root --profile anl_polaris_debug --preload

# Optional: preload only, useful for external production pipelines
./scripts/preload_downloads.py infer/2x2/full_chain_240819.yaml

Pipeline Mode

Pipelines allow you to chain multiple processing stages with automatic dependency management.

Pipeline Definition

Create a YAML file in pipelines/:

stages:
  - name: reconstruction
    config: infer/icarus/latest
    files: /path/to/raw/*.root
    profile: s3df_ampere
    ntasks: 100
    # Replace with a concrete YAML config if you need specific modifiers
  
  - name: analysis
    depends_on: [reconstruction]  # Wait for reconstruction to complete
    config: path/to/downstream_stage.yaml
    files: output_reco/*.h5
    profile: s3df_milano
    ntasks: 20

Submit Pipeline

./submit.py --pipeline pipelines/my_pipeline.yaml

Job Management

Job Artifacts

Each submission creates a timestamped directory in jobs/:

jobs/20260101_143022_spine_icarus_latest/
├── job_metadata.json           # Complete job metadata
├── files_chunk_0.txt          # Input file lists
├── submit_chunk_0.sbatch      # Generated submission script (.sbatch or .pbs)
├── logs/                      # Batch stdout/stderr
│   ├── spine_icarus_latest_12345_1.out
│   └── spine_icarus_latest_12345_1.err
└── output/                    # Output files
    └── spine_icarus_latest.h5

Monitoring Jobs

# View job status on SLURM
squeue -u $USER

# View job details on SLURM
scontrol show job <job_id>

# View job status on PBS
qstat -u $USER

# View job details on PBS
qstat -fx <job_id>

# View logs
tail -f jobs/<job_dir>/logs/spine_*.out

# Cancel job on SLURM
scancel <job_id>

# Cancel job on PBS
qdel <job_id>

Job Metadata

Each job saves complete metadata for reproducibility:

{
  "job_name": "spine_icarus_latest",
  "detector": "icarus",
  "config": "infer/icarus/latest",
  "profile": "s3df_ampere",
  "num_files": 100,
  "job_ids": ["12345", "12346"],
  "submitted": "2026-01-01T14:30:22",
  "command": "./submit.py --config ..."
}

Automatic Cleanup of Intermediate Files

Pipelines can automatically clean up intermediate outputs once downstream stages complete. Add a cleanup field to any stage that produces temporary files:

stages:
  - name: reconstruction
    config: infer/icarus/latest
    files: /path/to/input/*.root
    output: output_reco
    # Clean up output_reco/ after all dependent stages finish
    cleanup:
      - output_reco
      - temp_files
  
  - name: analysis
    depends_on: [reconstruction]
    config: path/to/downstream_stage.yaml
    files: output_reco/*.h5
    output: output_analysis

The cleanup job:

Only runs if downstream stages complete successfully (afterok dependency)
Runs as a minimal resource job (1 CPU, 1GB RAM, 10min timeout)
Safely checks for path existence before removal
Logs all cleanup actions for auditing

This is especially useful for large-scale production to save disk space by removing intermediate reconstruction outputs after final analysis completes.

Advanced Features

Custom Software Paths

# Use custom LArCV installation
./submit.py --config infer/icarus/latest --source data.root --larcv-path /path/to/larcv

# Use custom flash-matching setup
./submit.py --config infer/icarus/latest --source data.root --flashmatch-path /path/to/flashmatch

# Expose CVMFS inside the container
./submit.py --config infer/icarus/latest --source data.root --cvmfs

There is no need to pass --flashmatch. The flag is accepted only for backward compatibility and is ignored. Use --flashmatch-path to source a custom flash-matching setup instead.

For sites without CVMFS, point ICARUS configs at a local copy of the icarus_data release directory before sourcing configure.sh:

export ICARUS_DATA_DIR=/path/to/icarus_data
source configure.sh

Job Dependencies

# Submit with dependency on another job
./submit.py --config path/to/downstream_stage.yaml --source output/*.h5 --dependency afterok:12345

Array Job Optimization

# Process 5 files per job (reduces overhead)
./submit.py --config infer/icarus/latest --source data/*.root --files-per-task 5

# Limit concurrent tasks to 50
./submit.py --config infer/icarus/latest --source data/*.root --ntasks 50

Detector-Specific Guides

ICARUS

ICARUS uses split cryostat processing with cosmic overlay:

# Standard cosmic overlay processing
./submit.py --config infer/icarus/latest --source data.root

# Data-only mode (no truth labels)
./submit.py --config infer/icarus/latest --apply-mods data --source data.root

# NuMI beam configuration
./submit.py --config infer/icarus/latest --apply-mods numi --source data.root

# Lite output (reduced file size)
./submit.py --config infer/icarus/latest --apply-mods data lite --source data.root

SBND

./submit.py --config infer/sbnd/latest --source data.root

2x2

2x2 uses higher resource requirements:

./submit.py --config infer/2x2/latest --source data.root --profile s3df_ampere

ND-LAr

./submit.py --config infer/nd-lar/latest --source data.root

Troubleshooting

Environment Not Set

WARNING: SPINE_PROD_BASEDIR not set. Did you source configure.sh?

Solution: Source the environment:

source configure.sh

Missing Dependencies

ERROR: jinja2 is required. Install with: pip install jinja2

Solution: Install Python dependencies:

pip install jinja2 pyyaml

Job Failures

Check batch logs in jobs/<job_dir>/logs/
Review job metadata in jobs/<job_dir>/job_metadata.json
Test configuration on a single file with --dry-run
Verify input files exist and are accessible

Out of Memory

Solution: Use a profile with more memory or override memory:

./submit.py --config infer/icarus/latest --source data.root --profile s3df_ampere

Or override memory:

./submit.py --config infer/icarus/latest --source data.root --mem-per-cpu 16g

Job Time Limit

Solution: Request more time:

./submit.py --config infer/icarus/latest --source data.root --time 4:00:00

Best Practices

1. Test Before Production

Always test configurations on a small sample:

# Test with dry run
./submit.py --config infer/icarus/latest --source test.root --dry-run

# Test with single file
./submit.py --config infer/icarus/latest --source test.root

2. Use Appropriate Profiles

Use s3df_ampere for high-performance GPU processing (default)
Use s3df_turing for cheaper GPU inference
Use s3df_milano or s3df_roma for CPU-only analysis

3. Optimize File Batching

# For many small files, batch them
./submit.py --config infer/icarus/latest --source small_files/*.root --files-per-task 10

# For large files, process individually
./submit.py --config infer/icarus/latest --source large_files/*.root --files-per-task 1

4. Monitor Resource Usage

Check actual resource usage to optimize future jobs. On SLURM systems:

seff <job_id>

5. Track Your Work

Job metadata is automatically saved. Keep important job directories:

# Jobs are in timestamped directories
ls -lt jobs/

Environment Variables

Set by configure.sh:

SPINE_PROD_BASEDIR - Base directory of this repository
SPINE_CONFIG_PATH - Configuration search path
ICARUS_DATA_DIR - ICARUS data release path
SPINE_CONTAINER_VERSION - Tagged SPINE container version, without a leading v
SPINE_CONTAINER_PATH - Singularity/Apptainer image path
SPINE_CONTAINER_TAG - Registry image tag for Shifter-style runtimes, including docker:
SPINE_CONTAINER_PATH_AUTO - Tracks whether SPINE_CONTAINER_PATH was auto-derived
SPINE_CONTAINER_RUNTIME_BIN - Optional full path or command name for the Singularity/Apptainer executable used by interactive SIF execution
SPINE_CONTAINER_RUNTIME_ARGS - Optional extra Singularity/Apptainer arguments for interactive SIF execution
SPINE_CONTAINER_PLATFORM - Docker/Podman platform for interactive fallback

Contributing

Adding New Profiles

Edit templates/profiles.yaml:

profiles:
  my_custom_profile:
    partition: my_partition
    gpus: 2
    cpus_per_task: 16
    mem_per_cpu: 8g
    time: "6:00:00"
    description: "My custom profile description"

Adding Detector Defaults

Edit templates/profiles.yaml:

detectors:
  my_detector:
    default_profile: s3df_ampere
    configs_dir: infer/my_detector
    account: "my:account"

Development

Installation for Development

For contributors who need to run tests and development tools:

# Create a virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate  # On Linux/Mac

# Install SPINE (required for running parsing tests)
pip install "spine-ml @ git+https://github.com/DeepLearnPhysics/SPINE.git@main"

# Install development dependencies
pip install -r requirements-dev.txt

# Install pre-commit hooks
pre-commit install

Note: SPINE installation is only required for development if you want to run the configuration parsing tests. Production users only need to point to an existing SPINE installation via configure.sh.

Running Tests

# Run all tests
pytest

# Run config validation tests only
pytest tests/test_config_validation.py

# Run with coverage
pytest --cov=. --cov-report=html

Pre-commit Hooks

This repository uses pre-commit hooks for code quality:

check-yaml: Validates YAML syntax
yamllint: Lints YAML files for style
prettier: Formats YAML files
trailing-whitespace: Removes trailing whitespace
end-of-file-fixer: Ensures files end with newline

# Run hooks manually
pre-commit run --all-files

Config Validation

All configuration files are automatically validated in CI to ensure they parse correctly:

from config import load_config
config = load_config('infer/icarus/latest.yaml')

Related Tools

Production Database (spine-db)

A companion tool for indexing and browsing SPINE production metadata. See spine-db for documentation.

Support

Issues: https://github.com/DeepLearnPhysics/spine-prod/issues
SPINE Documentation: https://github.com/DeepLearnPhysics/spine
Contact: SPINE development team

License

This software is provided under the same license as SPINE.

Citation

If you use SPINE in your research, please cite the relevant SPINE publications.

Acknowledgments

Development supported by the DOE Office of High Energy Physics and the National Science Foundation.

Name		Name	Last commit message	Last commit date
Latest commit History 380 Commits
.github/workflows		.github/workflows
config		config
pipelines		pipelines
scripts		scripts
src		src
templates		templates
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.yamllint.yaml		.yamllint.yaml
CHANGELOG.md		CHANGELOG.md
QUICKREF.md		QUICKREF.md
README.md		README.md
codecov.yml		codecov.yml
configure.sh		configure.sh
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
submit.py		submit.py
version.py		version.py

Folders and files

Latest commit

History

Repository files navigation

SPINE Production System

Overview

Quick Start

1. Environment Setup

2. Basic Job Submission

3. Advanced Usage

Interactive Mode

EAF Interactive Container Setup

Directory Structure

Configuration System

Config Organization

Example: ICARUS Configurations

Resource Profiles

S3DF Node Resources

Available Profiles

NERSC Perlmutter Node Resources

NERSC Available Profiles

Profile Selection

Pipeline Mode

Pipeline Definition

Submit Pipeline

Job Management

Job Artifacts

Monitoring Jobs

Job Metadata

Automatic Cleanup of Intermediate Files

Advanced Features

Custom Software Paths

Job Dependencies

Array Job Optimization

Detector-Specific Guides

ICARUS

SBND

2x2

ND-LAr

Troubleshooting

Environment Not Set

Missing Dependencies

Job Failures

Out of Memory

Job Time Limit

Best Practices

1. Test Before Production

2. Use Appropriate Profiles

3. Optimize File Batching

4. Monitor Resource Usage

5. Track Your Work

Environment Variables

Contributing

Adding New Profiles

Adding Detector Defaults

Development

Installation for Development

Running Tests

Pre-commit Hooks

Config Validation

Related Tools

Production Database (spine-db)

Support

License

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 20

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages