GPT from Scratch — Story Pretraining & Interview Bot Fine-tuning

A complete end-to-end notebook that builds a GPT-2-style language model from scratch, pretrains it on the TinyStories dataset, and then fine-tunes it into a functional Interview Bot using instruction-following data.

Built in Google Colab with a T4 GPU.

What This Notebook Does

Stage	Description
1. Data Loading	Loads the TinyStories dataset (10k–50k stories) from HuggingFace
2. Tokenization	Builds a custom tokenizer, then upgrades to GPT-2's BPE tokenizer via `tiktoken`
3. Data Pipeline	Sliding window dataset (`GPTDatasetV1`) with configurable stride and context length
4. Embeddings	Token embeddings + positional embeddings from scratch
5. Attention	Implements scaled dot-product self-attention, causal masking, and multi-head attention
6. GPT Model	Full `GPTModel` built with transformer blocks, layer norm, and feed-forward layers
7. Pretraining	Trains the GPT on TinyStories for next-token prediction
8. Fine-tuning	Instruction fine-tunes the pretrained model on Alpaca-style Q&A data
9. Interview Bot	Final model answers CS/Python questions in an `ask()` interface
10. Export	Saves the model as `gpt2_interview_bot.pth` and `gpt2_alpaca_base.pth`

Project Structure

notebook/
├── Untitled4.ipynb          # Main notebook (this file)
├── gpt2_interview_bot.pth   # Fine-tuned model checkpoint
└── gpt2_alpaca_base.pth     # Pretrained base checkpoint (pre fine-tune)

Requirements

pip install datasets tiktoken torch

Library	Purpose
`torch`	Model building and training
`tiktoken`	GPT-2 BPE tokenizer
`datasets`	Loading TinyStories from HuggingFace

Model Architecture

The GPT model is built entirely from scratch, following the GPT-2 design:

Vocabulary size: 50,257 (GPT-2 BPE)
Context length: 256 tokens (pretraining) / 1024 tokens (fine-tuning)
Embedding dim: configurable (256 used for pretraining)
Attention: Multi-head causal self-attention with dropout
Blocks: Stacked transformer blocks with Layer Norm and GELU activation
Output: Linear projection back to vocabulary size

Stage 1 — Pretraining on TinyStories

The model is pretrained on 10,000–50,000 short children's stories from the TinyStories dataset.

Stories are joined with <|endoftext|> separators and fed through a sliding window dataloader (GPTDatasetV1) for next-token prediction.

Tokenization evolution in the notebook:

Custom regex-based tokenizer (SimpleTokenizerV1) — ~10,980 vocab tokens from TinyStories
Upgraded to tiktoken GPT-2 BPE tokenizer — 50,257 vocab tokens

Stage 2 — Fine-tuning as an Interview Bot

After pretraining, the model is fine-tuned on instruction-following data using the Alpaca prompt format:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is the difference between a list and a tuple in Python?

### Response:
A list is mutable while a tuple is immutable...

Key fine-tuning details:

Optimizer: AdamW (lr=5e-5, weight_decay=0.1)
Batch size: 8
Epochs: 1–2
Padding: Custom collate function masks padding tokens with ignore_index=-100 so they don't contribute to loss
Max sequence length: 1024 tokens

Using the Interview Bot

After fine-tuning, use the ask() function to query the model:

questions = [
    "What is the difference between a list and a tuple in Python?",
    "Explain what a decorator is in Python.",
    "What is the time complexity of binary search?",
    "What is a deadlock in operating systems?",
]

for q in questions:
    print(f"Q: {q}")
    print(f"A: {ask(q)}")
    print("---")

Sample output:

Q: What is a deadlock in operating systems?
A: A deadlock is a situation in which a process tries to do more work
   than it has access to doing...

Note: Since this is a small GPT-2-scale model trained for a short time, answers are directionally correct but may lack precision. Fine-tuning on more data or for more epochs will improve quality.

Saving & Loading the Model

Save:

torch.save(model.state_dict(), "gpt2_interview_bot.pth")

Download from Colab:

from google.colab import files
files.download("gpt2_interview_bot.pth")

Reload:

model = GPTModel(BASE_CONFIG)
model.load_state_dict(torch.load("gpt2_interview_bot.pth", map_location=device))
model.eval()

Key Concepts Covered

Byte Pair Encoding (BPE) tokenization
Token + positional embeddings
Scaled dot-product attention and causal masking
Multi-head self-attention
Transformer blocks with residual connections
Next-token prediction pretraining
Instruction fine-tuning with masked loss (padding tokens ignored)
Greedy / EOS-aware text generation

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Untitled4.ipynb		Untitled4.ipynb
hello.py		hello.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT from Scratch — Story Pretraining & Interview Bot Fine-tuning

What This Notebook Does

Project Structure

Requirements

Model Architecture

Stage 1 — Pretraining on TinyStories

Stage 2 — Fine-tuning as an Interview Bot

Using the Interview Bot

Saving & Loading the Model

Key Concepts Covered

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPT from Scratch — Story Pretraining & Interview Bot Fine-tuning

What This Notebook Does

Project Structure

Requirements

Model Architecture

Stage 1 — Pretraining on TinyStories

Stage 2 — Fine-tuning as an Interview Bot

Using the Interview Bot

Saving & Loading the Model

Key Concepts Covered

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages