A from-scratch implementation of LoRA (Low-Rank Adaptation) for large language models without PEFT, nor black-box abstractions.
Fine-tuned TinyLlama-1.1B on mathematical reasoning using 0.1% of its parameters, trained on a GTX 1650 Super (4GB VRAM) in about 12 hours.
Most LoRA tutorials stop at calling get_peft_model(). This project goes one
level deeper implementing the adapter layer, the model injection, the training
loop, and the evaluation pipeline from scratch, with no dependency on PEFT.
Three specific questions drove the implementation:
- Does LoRA actually move the needle on a model this small?
TinyLlama-1.1Bsits well below the scale where mathematical reasoning typically emerges. Applying LoRA to it on a math dataset is a stress test of the method's lower bound, not a recipe for state-of-the-art results, but a controlled environment to observe what the adapter learns and where it fails. - What does LoRA cost on consumer hardware? The 1650 Super has 4GB of VRAM, roughly
the floor for running a 1B-class model in
fp16. Understanding the real memory breakdown (base weights, optimizer state, activations, adapter matrices) and the engineering decisions needed to fit training in that budget (gradient checkpointing, mixed precision, gradient accumulation) is more transferable knowledge than running the same experiment on an A100. - What breaks when you implement it yourself? Other libraries abstract away a huge
amount of sharp edges: the column-major weight layout, the
fp32/fp16dtype split between frozen and trainable parameters, the register_buffer vs nn.Parameter distinction, the named_modules vs named_parameters confusion. Hitting these problems directly and diagnosing them is the point.
Fine-tuned on MetaMathQA (10k samples), evaluated on GSM8K test set.
| Model | GSM8K accuracy | Trainable params | % of total |
|---|---|---|---|
| TinyLlama-1.1B (base) | 1.5% | 0 | 0% |
| base + LoRA (r=8, focused on Q+V) | 3.0% | 1,126,400 | 0.10% |
Note
The absolute accuracy is limited by model capacity - TinyLlama is 6x smaller than the models used in the MetaMath paper, trained on 2.5% of the recommended data. The result is a 2x relative improvement with 0.1% of parameters trained on consumer hardware, not a claim about mathematical reasoning at scale.
Training on a GTX 1650 Super (4GB VRAM): ~4.5 hours for 5000 steps, ~12 hours total
to reach checkpoints/lora-r8-metamath-full-v2/params_step_13500.pt
The LoRA adapters (saved in checkpoints/) have been trained on a GTX 1650 Super
(with only 4GB VRAM). The training took ~4.5 hours for 5000 steps (more or less
12 hours to get the checkpoints/lora-r8-metamath-full-v2/params_step_13500.pt).
lora-from-scratch/
├── src/
│ ├── lora.py # LoraLinear: the core primitive
│ ├── model.py # apply_lora, get_lora_params, save_params
│ ├── train.py # training loop with mixed precision + TensorBoard
│ └── eval.py # GSM8K evaluation + comparison charts
├── checkpoints/ # adapter weights only (must be added to the model)
├── runs/ # TensorBoard logs
├── eval_results/ # evaluation JSON results
└── notebooks/ # exploration and debugging
The whole project is built on top of a LoraLinear Pytorch Class that wraps
nn.Linear layers with the LoRA adapter weights. The original model weights are
frozen and we add the LoRA adapter matrices to the model.
The adapters are then added to the model by walking through the model tree and
replacing nn.Linear layers with LoraLinear. Since model's weights are frozen,
the adapter weights can be merged/unmerged from the model.
git clone https://github.com/yourhandle/lora-from-scratch
cd lora-from-scratch
uv venv && uv syncpython src/train.py \
--nb-samples 10000 \
--max-steps 5000 \
--epochs 3 \
--r 8 \
--alpha 16 \
--lr 3e-4 \
--batch-size 1 \
--grad-accum 8 \
--run-name lora-r8-metamathMonitor training live:
tensorboard --logdir runs/Training decisions:
- Gradient checkpointing: we store gradients on disk to avoid memory issues, and let the model sit on the GPU and load them on demand during training.
- Warmup: we use a warmup schedule to gradually increase the learning rate for a more stable training process.
# compare base model vs trained adapter
python src/eval.py \
--lora-path checkpoints/lora-r8-metamath/params_step_5000.pt \
--nb-samples 200
# evaluate all checkpoints against the base model
python src/eval.py \
--adapter-dir checkpoints/lora-r8-metamath/ \
--nb-samples 50Outputs JSON results to eval_results/.
The model with the best performance weights are available on HuggingFace.
You can use the model like any other HuggingFace model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("schwp/TinyMathLlama-1.1B")
tokenizer = AutoTokenizer.from_pretrained("schwp/TinyMathLlama-1.1B")- The B matrix is initialized to zero and not random. This prevents the model from starting with random adapter weights and helps stabilize training. The model learns to adapt to the task gradually, instead of all at once.
- Storing the frozen weights as a buffer instead of
nn.Parameterallows us to avoid recomputing them during training. They do not belong to the models parameters, so they are not updated by the optimizer. It's important when you are limited by your hardware and the size of your base model. - Loading the adapter weights in
named_modules()instead ofnamed_parameters(). This is important so the adapter weights are loaded into the correct modules, and the model is then fine-tuned.
- LoRA: Low-Rank Adaptation of Large Language Models - Hu et al. 2021
- MetaMath: Bootstrap Your Own Mathematical Questions - Yu et al. 2023
- GSM8K: Training Verifiers to Solve Math Word Problems - Cobbe et al. 2021
- PEFT library - HuggingFace (the production implementation this replaces)
Pierre SCHWEITZER (schwp)