SMS Spam Classification Pipeline

An ensemble ML pipeline for SMS spam detection using 3 calibrated scikit-learn models with soft voting, served via FastAPI with PostgreSQL persistence.

Architecture

SMS Text → TF-IDF (1000 features) → Calibrated Ensemble → ham/spam
                                        ├── GaussianNB
                                        ├── LogisticRegression
                                        └── SVC (RBF kernel)

Each base model is wrapped with CalibratedClassifierCV(cv=5) for reliable probability estimates, then combined via VotingClassifier(voting='soft').

Performance

Metric	Score
Accuracy	98.21%
Precision	98.51%
Recall	88.00%
F1-score	92.96%

Dataset

SMS Spam Collection — 5,572 SMS messages (4,825 ham / 747 spam) located in Dataset/spam.csv.

Quick Start

1. Set up environment

uv venv .venv
source .venv/bin/activate
uv pip install scikit-learn pandas numpy joblib fastapi uvicorn sqlalchemy psycopg2-binary python-dotenv

2. Train the model

python train.py

This will:

Load and preprocess the dataset
Extract TF-IDF features (max 1000)
Train 3 calibrated models + ensemble
Evaluate on 20% test split
Save ensemble.pkl and vectorizer.pkl
Run verification predictions

3. Evaluate the model

python evaluate.py

This will show full test-set metrics (accuracy, precision, recall, F1, ROC-AUC), confusion matrix, confidence stats, misclassified samples, batch predictions on 8 example texts, and an interactive mode where you can type any message to classify it.

4. Configure PostgreSQL

Edit .env with your database credentials:

DATABASE_URL=postgresql://user:password@localhost:5432/spam_db

5. Run the API

uvicorn app:app --reload

API docs at http://localhost:8000/docs

Endpoint	Method	Description
`/predict`	POST	Classify a message → saved to Postgres
`/predictions`	GET	List past predictions (filter by `?label=spam`)
`/predictions/stats`	GET	Spam/ham counts
`/health`	GET	Health check

Example:

curl -X POST http://localhost:8000/predict \
  -H 'Content-Type: application/json' \
  -d '{"text": "You won a free iPhone!"}'

6. Use the saved model directly

import joblib

model = joblib.load("ensemble.pkl")
vectorizer = joblib.load("vectorizer.pkl")
features = vectorizer.transform(["You won a free prize!"]).toarray()

print(model.predict(features)[0]) # 0=ham, 1=spam

7. Run using Docker (Containerized Setup)

Because the pre-trained model files (*.pkl) are large, they are excluded from the git repository. The Docker setup utilizes a Multi-stage Build to automatically train the models inside the container and serve them securely.

Note: Ensure your .env contains your external Render PostgreSQL connection string (DATABASE_URL).

# Build and spin up the FastAPI service
docker-compose up --build -d

Stop the service when finished:

docker-compose down

Project Structure

mlops/
├── Dataset/
│   └── spam.csv          # SMS Spam Collection
├── train.py              # Training pipeline
├── evaluate.py           # Evaluation & interactive inference
├── app.py                # FastAPI application
├── database.py           # SQLAlchemy models & DB config
├── Dockerfile            # Multi-stage container builder
├── docker-compose.yml    # App orchestrator (loads .env)
├── requirements.txt      # Hardlocked ML dependencies
├── static/               
│   ├── index.html        # Clean B&W UI
│   └── style.css         
├── .env                  # PostgreSQL credentials (not committed)
├── .dockerignore
├── .gitignore
└── README.md

Requirements

Python 3.10+
scikit-learn, pandas, numpy, joblib
CPU only — no GPU required

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMS Spam Classification Pipeline

Architecture

Performance

Dataset

Quick Start

1. Set up environment

2. Train the model

3. Evaluate the model

4. Configure PostgreSQL

5. Run the API

6. Use the saved model directly

7. Run using Docker (Containerized Setup)

Project Structure

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
Dataset		Dataset
static		static
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
database.py		database.py
docker-compose.yml		docker-compose.yml
ensemble.pkl		ensemble.pkl
evaluate.py		evaluate.py
prometheus.yml		prometheus.yml
requirements.txt		requirements.txt
test.db		test.db
train.py		train.py
vectorizer.pkl		vectorizer.pkl

Folders and files

Latest commit

History

Repository files navigation

SMS Spam Classification Pipeline

Architecture

Performance

Dataset

Quick Start

1. Set up environment

2. Train the model

3. Evaluate the model

4. Configure PostgreSQL

5. Run the API

6. Use the saved model directly

7. Run using Docker (Containerized Setup)

Project Structure

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages