Skip to content

schwp/lora-from-scratch

Repository files navigation

LoRA from scratch (for Maths)

A from-scratch implementation of LoRA (Low-Rank Adaptation) for large language models without PEFT, nor black-box abstractions.

Fine-tuned TinyLlama-1.1B on mathematical reasoning using 0.1% of its parameters, trained on a GTX 1650 Super (4GB VRAM) in about 12 hours.


Motivation

Most LoRA tutorials stop at calling get_peft_model(). This project goes one level deeper implementing the adapter layer, the model injection, the training loop, and the evaluation pipeline from scratch, with no dependency on PEFT.

Three specific questions drove the implementation:

  • Does LoRA actually move the needle on a model this small? TinyLlama-1.1B sits well below the scale where mathematical reasoning typically emerges. Applying LoRA to it on a math dataset is a stress test of the method's lower bound, not a recipe for state-of-the-art results, but a controlled environment to observe what the adapter learns and where it fails.
  • What does LoRA cost on consumer hardware? The 1650 Super has 4GB of VRAM, roughly the floor for running a 1B-class model in fp16. Understanding the real memory breakdown (base weights, optimizer state, activations, adapter matrices) and the engineering decisions needed to fit training in that budget (gradient checkpointing, mixed precision, gradient accumulation) is more transferable knowledge than running the same experiment on an A100.
  • What breaks when you implement it yourself? Other libraries abstract away a huge amount of sharp edges: the column-major weight layout, the fp32/fp16 dtype split between frozen and trainable parameters, the register_buffer vs nn.Parameter distinction, the named_modules vs named_parameters confusion. Hitting these problems directly and diagnosing them is the point.

Results

Fine-tuned on MetaMathQA (10k samples), evaluated on GSM8K test set.

Model GSM8K accuracy Trainable params % of total
TinyLlama-1.1B (base) 1.5% 0 0%
base + LoRA (r=8, focused on Q+V) 3.0% 1,126,400 0.10%

Note

The absolute accuracy is limited by model capacity - TinyLlama is 6x smaller than the models used in the MetaMath paper, trained on 2.5% of the recommended data. The result is a 2x relative improvement with 0.1% of parameters trained on consumer hardware, not a claim about mathematical reasoning at scale.

Training on a GTX 1650 Super (4GB VRAM): ~4.5 hours for 5000 steps, ~12 hours total to reach checkpoints/lora-r8-metamath-full-v2/params_step_13500.pt

The LoRA adapters (saved in checkpoints/) have been trained on a GTX 1650 Super (with only 4GB VRAM). The training took ~4.5 hours for 5000 steps (more or less 12 hours to get the checkpoints/lora-r8-metamath-full-v2/params_step_13500.pt).

Project Structure

lora-from-scratch/
├── src/
│   ├── lora.py        # LoraLinear: the core primitive
│   ├── model.py       # apply_lora, get_lora_params, save_params
│   ├── train.py       # training loop with mixed precision + TensorBoard
│   └── eval.py        # GSM8K evaluation + comparison charts
├── checkpoints/       # adapter weights only (must be added to the model)
├── runs/              # TensorBoard logs
├── eval_results/      # evaluation JSON results
└── notebooks/         # exploration and debugging

Usage

Implementation

The whole project is built on top of a LoraLinear Pytorch Class that wraps nn.Linear layers with the LoRA adapter weights. The original model weights are frozen and we add the LoRA adapter matrices to the model.

The adapters are then added to the model by walking through the model tree and replacing nn.Linear layers with LoraLinear. Since model's weights are frozen, the adapter weights can be merged/unmerged from the model.

Installation

git clone https://github.com/yourhandle/lora-from-scratch
cd lora-from-scratch
uv venv && uv sync

Training

python src/train.py \
  --nb-samples 10000 \
  --max-steps  5000  \
  --epochs     3     \
  --r          8     \
  --alpha      16    \
  --lr         3e-4  \
  --batch-size 1     \
  --grad-accum 8     \
  --run-name   lora-r8-metamath

Monitor training live:

tensorboard --logdir runs/

Training decisions:

  • Gradient checkpointing: we store gradients on disk to avoid memory issues, and let the model sit on the GPU and load them on demand during training.
  • Warmup: we use a warmup schedule to gradually increase the learning rate for a more stable training process.

Evaluation

# compare base model vs trained adapter
python src/eval.py \
  --lora-path checkpoints/lora-r8-metamath/params_step_5000.pt \
  --nb-samples 200
 
# evaluate all checkpoints against the base model
python src/eval.py \
  --adapter-dir checkpoints/lora-r8-metamath/ \
  --nb-samples 50

Outputs JSON results to eval_results/.

HuggingFace

The model with the best performance weights are available on HuggingFace.

You can use the model like any other HuggingFace model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("schwp/TinyMathLlama-1.1B")
tokenizer = AutoTokenizer.from_pretrained("schwp/TinyMathLlama-1.1B")

Key Lessons

  • The B matrix is initialized to zero and not random. This prevents the model from starting with random adapter weights and helps stabilize training. The model learns to adapt to the task gradually, instead of all at once.
  • Storing the frozen weights as a buffer instead of nn.Parameter allows us to avoid recomputing them during training. They do not belong to the models parameters, so they are not updated by the optimizer. It's important when you are limited by your hardware and the size of your base model.
  • Loading the adapter weights in named_modules() instead of named_parameters(). This is important so the adapter weights are loaded into the correct modules, and the model is then fine-tuned.

References


Authors

Pierre SCHWEITZER (schwp)

About

Implementation from scratch of LoRA Paper and fine-tuning a model with it for mathematics tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors