A focused PyTorch optimization repository for deep learning workloads, with practical notebooks and reusable training code. This repo is organized around essential optimization workflows, with one consolidated README.md for setup, usage, and key benchmark results.
notebooks/- 7 focused optimization notebooks:01_profiling.ipynb02_amp.ipynb03_compile.ipynb04_gradient_checkpointing.ipynb05_dataloader_optimization.ipynb06_llm_optimization.ipynb07_advanced_optimization.ipynb
scripts/train.py- training pipeline with optimization flagsmodels/llm.py- model implementation and optimization-aware architectureutils/optimization_utils.py- profiling and performance utilitiesrequirements.txt- dependency list for reproducible setup
- Profiling and bottleneck identification
- Automatic Mixed Precision (AMP)
torch.compile()for compiled model execution- Gradient checkpointing for memory savings
- DataLoader performance tuning
- LLM-specific optimization strategies
- Apple Silicon M3 / MPS-specific guidance
cd ../Cuda_Optimization
python -m pip install -r requirements.txt
jupyter notebook notebooks/01_profiling.ipynbRun training with optimizations:
python scripts/train.py --device mps --use-amp --use-checkpointPrimary target: Apple Silicon M3 MacBook with MPS backend. This repository also works on CPU and CUDA-capable systems, but benchmarks are centered on M3 performance.
Measured using a small transformer model on M3 with a 16M-parameter model, batch size 32, sequence length 256.
| Configuration | Time/iter | Throughput | Memory | Relative speed |
|---|---|---|---|---|
| Baseline FP32 | 450 ms | 71 samples/s | 8.2 GB | 1.0x |
| AMP FP16 | 180 ms | 178 samples/s | 4.1 GB | 2.5x |
| Gradient checkpointing | 550 ms | 58 samples/s | 4.2 GB | 0.8x |
| torch.compile | 360 ms | 89 samples/s | 8.2 GB | 1.25x |
| AMP + checkpoint | 220 ms | 145 samples/s | 4.3 GB | 2.0x |
| All combined | 160 ms | 200 samples/s | 4.3 GB | 2.8x |
- Start with profiling before changing model code.
- Use AMP for the largest single speedup on M3.
- Apply gradient checkpointing only when memory is the limiting factor.
- Prefer
torch.compile()for inference and stable training workloads. - Keep batch sizes moderate on M3 to avoid thermal limits.
- Use
num_workers=0andpin_memory=Falseon Apple Silicon.
models/- model definitionsnotebooks/- interactive learning and profiling notebooksscripts/- execution scriptsutils/- helper functions for optimization and monitoringrequirements.txt- required Python packagesREADME.md- consolidated repository guide
This repository now maintains a single primary documentation source, README.md, and removes excess guide files for a cleaner, easier-to-use project layout.