📧 Spam Email Classifier using Machine Learning

A Machine Learning and Natural Language Processing (NLP) project that automatically classifies SMS and Email messages as Spam or Ham (Not Spam).

The project applies Natural Language Processing (NLP) techniques, TF-IDF vectorization, and compares multiple machine learning algorithms including Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine (SVM). Based on model evaluation, the Support Vector Machine (SVM) achieved the best overall performance and was selected for deployment in the Streamlit web application.

🎥 Project Demo

The application provides a simple interface where users can enter an SMS or email message and instantly receive a prediction indicating whether the message is Spam or Ham.

Home Page

🚀 Features

📩 Spam & Ham Message Classification
🧹 NLP-based Text Cleaning and Preprocessing
🔤 TF-IDF Feature Extraction
🤖 Comparison of Multiple Machine Learning Models (Multinomial Naive Bayes, Logistic Regression & Support Vector Machine)
🏆 Automatic Selection of the Best Performing Model
📊 Exploratory Data Analysis (EDA)
☁️ Word Cloud Visualization
📈 Confusion Matrix and Performance Evaluation
🌐 Interactive Streamlit Web Application
💾 Model Serialization using Pickle
⚡ Real-Time Spam Prediction

📂 Project Structure

Spam_Email_Classifier/
│
├── app/
│   └── app.py
│
├── dataset/
│   └── SMSSpamCollection
│
├── model/
│   ├── spam_model.pkl
│   └── vectorizer.pkl
│
├── notebook/
│   └── spam_classifier.ipynb
│
├── output/
│   ├── app_home.png
│   ├── spam_prediction.png
│   ├── ham_prediction.png
│   ├── spam_vs_ham.png
│   ├── character_distribution.png
│   ├── word_distribution.png
│   ├── correlation_heatmap.png
│   ├── spam_wordcloud.png
│   ├── ham_wordcloud.png
│   └── confusion_matrix.png
│
├── README.md
├── requirements.txt
├── .gitignore
└── LICENSE

📊 Dataset

Dataset Name

SMS Spam Collection Dataset

Source

UCI Machine Learning Repository

Dataset Summary

Total Messages : 5,572
Classes :
- Spam
- Ham

🔄 Machine Learning Workflow

The complete workflow followed in this project is:

Dataset Collection
Data Loading
Data Cleaning
Exploratory Data Analysis (EDA)
Text Preprocessing
- Lowercase Conversion
- Tokenization
- Removing Special Characters
- Removing Stopwords
- Stemming
Feature Extraction using TF-IDF
Train-Test Split
Model Training
Model Evaluation
Saving the Model
Streamlit Deployment

🛠 Technologies Used

Category	Technologies
Programming Language	Python
Data Manipulation	Pandas, NumPy
Machine Learning	Scikit-learn
NLP	NLTK
Visualization	Matplotlib, Seaborn
Web Framework	Streamlit
Model Serialization	Pickle

📈 Model Comparison

The performance of three machine learning algorithms was evaluated using Accuracy, Precision, Recall, and F1-Score.

Model	Accuracy	Precision	Recall	F1-Score
Multinomial Naive Bayes	97%	100%	81%	90%
Logistic Regression	96%	96%	74%	83%
Support Vector Machine (SVM)	98%	98%	98%	98%

🏆 Best Model

The Support Vector Machine (SVM) achieved the highest overall performance with an accuracy of 98% and was selected as the final model for deployment in the Streamlit application.

🎯 Results

Successfully classified SMS and email messages into Spam and Ham.
Compared three supervised machine learning algorithms.
Achieved 98% overall accuracy using the Support Vector Machine (SVM).
Developed an interactive Streamlit web application for real-time spam detection.
Implemented a complete NLP pipeline including text preprocessing, TF-IDF vectorization, model training, evaluation, and deployment.

📊 Model Comparison

The following chart compares the performance of the evaluated machine learning models.

📷 Project Outputs

🏠 Streamlit Home Page

🚨 Spam Prediction

✅ Ham Prediction

📊 Spam vs Ham Distribution

🔤 Character Distribution

📝 Word Distribution

🔥 Correlation Heatmap

☁️ Spam Word Cloud

☁️ Ham Word Cloud

📉 Confusion Matrix

⚠️ Challenges Faced

Cleaning noisy SMS text data.
Removing stopwords while preserving useful information.
Selecting the best-performing classification model.
Saving trained models for deployment.
Building a responsive Streamlit application.

📚 Learning Outcomes

This project helped me gain practical experience in:

Natural Language Processing (NLP)
Text preprocessing techniques
TF-IDF Vectorization
Machine Learning model training and evaluation
Model comparison and selection
Streamlit web application development
Model serialization using Pickle
Git and GitHub for project version control

⚙️ Installation Guide

Clone the Repository

git clone https://github.com/harshvardhan4096/Spam_Email_Classifier.git

Navigate to the Project Folder

cd Spam_Email_Classifier

Install Dependencies

pip install -r requirements.txt

Run the Streamlit Application

streamlit run app/app.py

💻 Example Usage

Input

Congratulations!

You have won ₹5,00,000.

Click here to claim your prize immediately.

Prediction

🚨 Spam

Input

Hi Harsh,

Let's meet tomorrow at 10 AM in the library.

Prediction

✅ Ham

🚀 Future Scope

Deploy the application on Streamlit Community Cloud
Add multilingual spam detection
Implement Deep Learning models (LSTM/Transformer)
Integrate email APIs for real-time email filtering
Containerize the application using Docker
Build a REST API using FastAPI or Flask
Add explainable AI techniques (SHAP/LIME) to interpret predictions

🤝 Contributing

Contributions, issues, and feature requests are welcome.

If you'd like to improve this project:

Fork the repository
Create a new feature branch
Commit your changes
Open a Pull Request

📜 License

This project is licensed under the MIT License.

👨‍💻 Connect with Me

Harsh Vardhan Chaudhary

🎓 B.Tech – Computer Science & Engineering SRM Institute of Science & Technology

GitHub: https://github.com/harshvardhan4096
LinkedIn: https://www.linkedin.com/in/connect-harsh-vardhan/

⭐ Support

If you found this project useful, please consider giving it a ⭐ on GitHub.

It helps others discover the project and motivates further improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
dataset		dataset
model		model
notebook		notebook
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📧 Spam Email Classifier using Machine Learning

🎥 Project Demo

🚀 Features

📂 Project Structure

📊 Dataset

🔄 Machine Learning Workflow

🛠 Technologies Used

📈 Model Comparison

🏆 Best Model

🎯 Results

📊 Model Comparison

📷 Project Outputs

🏠 Streamlit Home Page

🚨 Spam Prediction

✅ Ham Prediction

📊 Spam vs Ham Distribution

🔤 Character Distribution

📝 Word Distribution

🔥 Correlation Heatmap

☁️ Spam Word Cloud

☁️ Ham Word Cloud

📉 Confusion Matrix

⚠️ Challenges Faced

📚 Learning Outcomes

⚙️ Installation Guide

Clone the Repository

Navigate to the Project Folder

Install Dependencies

Run the Streamlit Application

💻 Example Usage

🚀 Future Scope

🤝 Contributing

📜 License

👨‍💻 Connect with Me

⭐ Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages