A complete end-to-end notebook that builds a GPT-2-style language model from scratch, pretrains it on the TinyStories dataset, and then fine-tunes it into a functional Interview Bot using instruction-following data.
Built in Google Colab with a T4 GPU.
| Stage | Description |
|---|---|
| 1. Data Loading | Loads the TinyStories dataset (10k–50k stories) from HuggingFace |
| 2. Tokenization | Builds a custom tokenizer, then upgrades to GPT-2's BPE tokenizer via tiktoken |
| 3. Data Pipeline | Sliding window dataset (GPTDatasetV1) with configurable stride and context length |
| 4. Embeddings | Token embeddings + positional embeddings from scratch |
| 5. Attention | Implements scaled dot-product self-attention, causal masking, and multi-head attention |
| 6. GPT Model | Full GPTModel built with transformer blocks, layer norm, and feed-forward layers |
| 7. Pretraining | Trains the GPT on TinyStories for next-token prediction |
| 8. Fine-tuning | Instruction fine-tunes the pretrained model on Alpaca-style Q&A data |
| 9. Interview Bot | Final model answers CS/Python questions in an ask() interface |
| 10. Export | Saves the model as gpt2_interview_bot.pth and gpt2_alpaca_base.pth |
notebook/
├── Untitled4.ipynb # Main notebook (this file)
├── gpt2_interview_bot.pth # Fine-tuned model checkpoint
└── gpt2_alpaca_base.pth # Pretrained base checkpoint (pre fine-tune)
pip install datasets tiktoken torch| Library | Purpose |
|---|---|
torch |
Model building and training |
tiktoken |
GPT-2 BPE tokenizer |
datasets |
Loading TinyStories from HuggingFace |
The GPT model is built entirely from scratch, following the GPT-2 design:
- Vocabulary size: 50,257 (GPT-2 BPE)
- Context length: 256 tokens (pretraining) / 1024 tokens (fine-tuning)
- Embedding dim: configurable (256 used for pretraining)
- Attention: Multi-head causal self-attention with dropout
- Blocks: Stacked transformer blocks with Layer Norm and GELU activation
- Output: Linear projection back to vocabulary size
The model is pretrained on 10,000–50,000 short children's stories from the TinyStories dataset.
Stories are joined with <|endoftext|> separators and fed through a sliding window dataloader (GPTDatasetV1) for next-token prediction.
Tokenization evolution in the notebook:
- Custom regex-based tokenizer (
SimpleTokenizerV1) — ~10,980 vocab tokens from TinyStories - Upgraded to
tiktokenGPT-2 BPE tokenizer — 50,257 vocab tokens
After pretraining, the model is fine-tuned on instruction-following data using the Alpaca prompt format:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What is the difference between a list and a tuple in Python?
### Response:
A list is mutable while a tuple is immutable...
Key fine-tuning details:
- Optimizer: AdamW (
lr=5e-5,weight_decay=0.1) - Batch size: 8
- Epochs: 1–2
- Padding: Custom collate function masks padding tokens with
ignore_index=-100so they don't contribute to loss - Max sequence length: 1024 tokens
After fine-tuning, use the ask() function to query the model:
questions = [
"What is the difference between a list and a tuple in Python?",
"Explain what a decorator is in Python.",
"What is the time complexity of binary search?",
"What is a deadlock in operating systems?",
]
for q in questions:
print(f"Q: {q}")
print(f"A: {ask(q)}")
print("---")Sample output:
Q: What is a deadlock in operating systems?
A: A deadlock is a situation in which a process tries to do more work
than it has access to doing...
Note: Since this is a small GPT-2-scale model trained for a short time, answers are directionally correct but may lack precision. Fine-tuning on more data or for more epochs will improve quality.
Save:
torch.save(model.state_dict(), "gpt2_interview_bot.pth")Download from Colab:
from google.colab import files
files.download("gpt2_interview_bot.pth")Reload:
model = GPTModel(BASE_CONFIG)
model.load_state_dict(torch.load("gpt2_interview_bot.pth", map_location=device))
model.eval()- Byte Pair Encoding (BPE) tokenization
- Token + positional embeddings
- Scaled dot-product attention and causal masking
- Multi-head self-attention
- Transformer blocks with residual connections
- Next-token prediction pretraining
- Instruction fine-tuning with masked loss (padding tokens ignored)
- Greedy / EOS-aware text generation