Skip to content

a2Fsa2k/anuGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

anuGPT

A lightweight, efficient GPT-style transformer model trained from scratch using CUDA. This project implements a 9.84M parameter language model optimized for story generation, trained on the TinyStories dataset.

Model Specifications

Model Architecture

Component Specification
Parameters 9.84 Million
Layers 6
Attention Heads 8
Embedding Dimension 384
Context Length 256 tokens
Vocabulary Size 256 (byte-level)
Model Type Transformer (GPT-style)

Training Details

Dataset

  • Primary: TinyStories 1MB subset
  • Size: 1,048,576 bytes (~11,373 stories)
  • Domain: Children's short stories

Hardware & Performance

  • GPU: NVIDIA RTX 4050 Laptop (6GB VRAM)
  • Training Time: 3.5 hours
  • Tokens Processed: 528 million
  • Epochs: ~6
  • VRAM Usage: 96 MB
  • GPU Utilization: 99-100%

Optimizer

Custom Tiger optimizer with adaptive learning rate:

η = 2.0 / sqrt(log(1 + params) × params + tokens_processed)

Training Results

Metric Value
Initial Loss 1.946
Final Loss 0.060
Perplexity 1.062
Loss Reduction 97%
Iterations 528

Training Loss Curve

Benchmarks

Generation Quality

  • Coherence: 8.5/10
  • Grammar: 8.5/10
  • Story Structure: 9/10
  • Character Consistency: 8/10

Efficiency Comparison

Model Parameters Training Time Hardware
anuGPT 9.84M 3.5 hours RTX 4050 (6GB)
GPT-2 Small 124M ~weeks Multiple GPUs
TinyStories Baseline ~8M ~10 hours Similar

Model Size: 12.5× smaller than GPT-2 Small while maintaining strong performance on children's story generation.

Installation

Prerequisites

# CUDA Toolkit (tested with CUDA 11.0+)
# cuBLAS library
# GCC/G++ compiler
# Python 3.x (for utilities)

Build

git clone https://github.com/urngmi/cuda_gpt.git
cd anuGPT
make

This will compile three binaries:

  • train - Training binary
  • gpu - GPU inference binary
  • cpu - CPU inference binary (multi-threaded)

Usage

Training

Train on a text dataset:

./train datasets/tinyshakespeare.txt -o model

Options:

  • -o <file> - Output model file (default: model)
  • -i <file> - Input model file (for fine-tuning)
  • -s <size> - Vocabulary size (for non-byte-level tokenization)

The training will automatically:

  • Save checkpoints when loss improves
  • Stop when loss converges (increase > 0.02)
  • Display progress: [tokens_processed_mb] [loss] [time_per_iteration]

Inference

GPU Inference (faster):

./gpu "Once upon a time"

CPU Inference (portable):

./cpu -t 4 "Once upon a time"

Options for CPU:

  • -t <threads> - Number of threads (default: 1)

Example Output

Prompt: "Once upon a time"

Generated:
Once upon a time, there was a little boy named Tim. Tim was very 
nervous. He had lost his toy razor. He loved to pretend to shave 
like his dad. Every day, he would play with his toy razor.

Web Interface

A clean web UI is provided for interactive testing:

cd web-ui
python3 server.py

Then open http://localhost:8080 in your browser.

Features:

  • ChatGPT-style interface
  • Real-time generation with typing effect
  • Easy model testing

See web-ui/README.md for details.

Configuration

Edit config file to adjust model architecture:

const uint64_t context=256;    // Context window size
const uint64_t depth=6;        // Number of transformer layers
const uint64_t heads=8;        // Number of attention heads
const uint64_t embed=heads*48; // Embedding dimension
const uint64_t voca=256;       // Vocabulary size
const uint64_t fullbatch=1ull<<14; // Batch size in tokens

After modifying, rebuild:

make clean
make

Project Structure

anuGPT/
├── assets/
│   ├── loss_curve.png      # Training visualization
│   └── model_specs.png     # Model specifications table
├── datasets/
│   ├── tinyshakespeare.txt # Shakespeare dataset (1.1MB)
│   ├── tinystories_1mb.txt # TinyStories 1MB subset
│   ├── tinystories_10mb.txt # TinyStories 10MB subset
│   └── tinystories_100k.txt # Full TinyStories (85MB)
├── web-ui/
│   ├── index.html         # Web interface
│   ├── style.css          # UI styling
│   ├── script.js          # Frontend logic
│   ├── server.py          # Backend server
│   └── README.md          # Web UI documentation
├── config                  # Model configuration
├── train.cu               # Training implementation
├── gpu.cu                 # GPU inference
├── cpu.cpp                # CPU inference
├── model                  # Trained model weights
├── training.log           # Training metrics log
├── makefile              # Build configuration
└── README.md

Implementation Details

Features

  • BFloat16 Quantization: Reduces memory footprint without significant quality loss
  • Fused Kernels: Optimized CUDA kernels for layer normalization, attention, and FFN
  • Memory Efficient: Unified memory management with CUDA managed memory
  • Custom Attention: Causal self-attention with position embeddings
  • Adaptive Learning: Tiger optimizer with automatic learning rate scheduling

Model Components

  1. Token Embedding: Byte-level (256 vocab) embedding layer
  2. Position Embedding: Learned positional encodings
  3. Transformer Blocks (6 layers):
    • Multi-head self-attention (8 heads)
    • Layer normalization
    • Feed-forward network (384 → 1536 → 384)
    • Residual connections
  4. Output Layer: Linear projection to vocabulary

Performance Tips

For Faster Training

  1. Increase fullbatch size (if VRAM allows)
  2. Use smaller context length for shorter sequences
  3. Reduce model depth/width for faster iterations

For Better Quality

  1. Train on more data (use larger TinyStories subset)
  2. Increase context length for longer coherence
  3. Train for more epochs until convergence

Memory Optimization

  • Current configuration uses ~100MB VRAM
  • Can scale up to ~4-5GB on 6GB cards
  • Adjust batch size to utilize available memory

Training from Scratch

To reproduce the results:

# 1. Build the project
make

# 2. Train on TinyStories 1MB
./train datasets/tinystories_1mb.txt -o model

# 3. Wait ~3-4 hours for convergence

# 4. Test generation
./gpu "Once upon a time"

Expected timeline:

  • First hour: Loss drops to ~0.3
  • 2 hours: Loss ~0.15
  • 3 hours: Loss ~0.08
  • 3.5 hours: Loss ~0.06 (converged)

Fine-tuning

To fine-tune the existing model:

./train datasets/your_data.txt -i model -o model_finetuned

This will continue training from the existing checkpoint.

Technical Specifications

Compute Requirements

  • Minimum: CUDA-capable GPU with 1GB VRAM
  • Recommended: 6GB+ VRAM for comfortable training
  • Optimal: 8GB+ VRAM for larger batch sizes

Memory Breakdown

Model Parameters:    22 MB (9.84M × 2 bytes bfloat16)
Gradients:          22 MB
Optimizer States:   22 MB
Batch Activations:  30 MB (varies with batch size)
─────────────────────────
Total:              ~96 MB

Limitations

  1. Context Length: Limited to 256 tokens (~150-200 words)
  2. Vocabulary: Byte-level tokenization is simple but less efficient than BPE
  3. Domain: Optimized for children's stories; may not generalize to other domains
  4. Long-term Coherence: May lose consistency in very long generations

Future Improvements

  • Implement larger context windows (512, 1024 tokens)
  • Add BPE tokenization for better compression
  • Multi-GPU training support
  • Flash Attention implementation
  • Model quantization (INT8, INT4)
  • Streaming generation API
  • Fine-tuning on specialized domains

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • TinyStories dataset: HuggingFace
  • Inspired by Andrej Karpathy's nanoGPT
  • Tiger optimizer: Custom adaptive optimizer implementation

Contact

For questions or issues, please open an issue on GitHub.


Model Size: 19 MB
Training Cost: ~$0 (local GPU)
Inference Speed: ~45 tokens/second
Last Updated: December 2025

About

Taking wyGPT's high-performance CUDA code and making it teachable - like nanoGPT, but for GPU programming.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors