Skip to content

harshvardhan4096/Spam_Email_Classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“§ Spam Email Classifier using Machine Learning

Python Scikit-Learn Streamlit License

A Machine Learning and Natural Language Processing (NLP) project that automatically classifies SMS and Email messages as Spam or Ham (Not Spam).

The project applies Natural Language Processing (NLP) techniques, TF-IDF vectorization, and compares multiple machine learning algorithms including Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine (SVM). Based on model evaluation, the Support Vector Machine (SVM) achieved the best overall performance and was selected for deployment in the Streamlit web application.

πŸŽ₯ Project Demo

The application provides a simple interface where users can enter an SMS or email message and instantly receive a prediction indicating whether the message is Spam or Ham.

Home Page

Application


πŸš€ Features

  • πŸ“© Spam & Ham Message Classification
  • 🧹 NLP-based Text Cleaning and Preprocessing
  • πŸ”€ TF-IDF Feature Extraction
  • πŸ€– Comparison of Multiple Machine Learning Models (Multinomial Naive Bayes, Logistic Regression & Support Vector Machine)
  • πŸ† Automatic Selection of the Best Performing Model
  • πŸ“Š Exploratory Data Analysis (EDA)
  • ☁️ Word Cloud Visualization
  • πŸ“ˆ Confusion Matrix and Performance Evaluation
  • 🌐 Interactive Streamlit Web Application
  • πŸ’Ύ Model Serialization using Pickle
  • ⚑ Real-Time Spam Prediction

πŸ“‚ Project Structure

Spam_Email_Classifier/
β”‚
β”œβ”€β”€ app/
β”‚   └── app.py
β”‚
β”œβ”€β”€ dataset/
β”‚   └── SMSSpamCollection
β”‚
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ spam_model.pkl
β”‚   └── vectorizer.pkl
β”‚
β”œβ”€β”€ notebook/
β”‚   └── spam_classifier.ipynb
β”‚
β”œβ”€β”€ output/
β”‚   β”œβ”€β”€ app_home.png
β”‚   β”œβ”€β”€ spam_prediction.png
β”‚   β”œβ”€β”€ ham_prediction.png
β”‚   β”œβ”€β”€ spam_vs_ham.png
β”‚   β”œβ”€β”€ character_distribution.png
β”‚   β”œβ”€β”€ word_distribution.png
β”‚   β”œβ”€β”€ correlation_heatmap.png
β”‚   β”œβ”€β”€ spam_wordcloud.png
β”‚   β”œβ”€β”€ ham_wordcloud.png
β”‚   └── confusion_matrix.png
β”‚
β”œβ”€β”€ README.md
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
└── LICENSE

πŸ“Š Dataset

Dataset Name

SMS Spam Collection Dataset

Source

UCI Machine Learning Repository

Dataset Summary

  • Total Messages : 5,572

  • Classes :

    • Spam
    • Ham

πŸ”„ Machine Learning Workflow

The complete workflow followed in this project is:

  1. Dataset Collection

  2. Data Loading

  3. Data Cleaning

  4. Exploratory Data Analysis (EDA)

  5. Text Preprocessing

    • Lowercase Conversion
    • Tokenization
    • Removing Special Characters
    • Removing Stopwords
    • Stemming
  6. Feature Extraction using TF-IDF

  7. Train-Test Split

  8. Model Training

  9. Model Evaluation

  10. Saving the Model

  11. Streamlit Deployment


πŸ›  Technologies Used

Category Technologies
Programming Language Python
Data Manipulation Pandas, NumPy
Machine Learning Scikit-learn
NLP NLTK
Visualization Matplotlib, Seaborn
Web Framework Streamlit
Model Serialization Pickle

πŸ“ˆ Model Comparison

The performance of three machine learning algorithms was evaluated using Accuracy, Precision, Recall, and F1-Score.

Model Accuracy Precision Recall F1-Score
Multinomial Naive Bayes 97% 100% 81% 90%
Logistic Regression 96% 96% 74% 83%
Support Vector Machine (SVM) 98% 98% 98% 98%

πŸ† Best Model

The Support Vector Machine (SVM) achieved the highest overall performance with an accuracy of 98% and was selected as the final model for deployment in the Streamlit application.

🎯 Results

  • Successfully classified SMS and email messages into Spam and Ham.
  • Compared three supervised machine learning algorithms.
  • Achieved 98% overall accuracy using the Support Vector Machine (SVM).
  • Developed an interactive Streamlit web application for real-time spam detection.
  • Implemented a complete NLP pipeline including text preprocessing, TF-IDF vectorization, model training, evaluation, and deployment.

πŸ“Š Model Comparison

The following chart compares the performance of the evaluated machine learning models.

Model Comparison

πŸ“· Project Outputs

🏠 Streamlit Home Page

Application


🚨 Spam Prediction

Spam Prediction


βœ… Ham Prediction

Ham Prediction


πŸ“Š Spam vs Ham Distribution

Distribution


πŸ”€ Character Distribution

Character Distribution


πŸ“ Word Distribution

Word Distribution


πŸ”₯ Correlation Heatmap

Heatmap


☁️ Spam Word Cloud

Spam WordCloud


☁️ Ham Word Cloud

Ham WordCloud


πŸ“‰ Confusion Matrix

Confusion Matrix


⚠️ Challenges Faced

  • Cleaning noisy SMS text data.
  • Removing stopwords while preserving useful information.
  • Selecting the best-performing classification model.
  • Saving trained models for deployment.
  • Building a responsive Streamlit application.

πŸ“š Learning Outcomes

This project helped me gain practical experience in:

  • Natural Language Processing (NLP)
  • Text preprocessing techniques
  • TF-IDF Vectorization
  • Machine Learning model training and evaluation
  • Model comparison and selection
  • Streamlit web application development
  • Model serialization using Pickle
  • Git and GitHub for project version control

βš™οΈ Installation Guide

Clone the Repository

git clone https://github.com/harshvardhan4096/Spam_Email_Classifier.git

Navigate to the Project Folder

cd Spam_Email_Classifier

Install Dependencies

pip install -r requirements.txt

Run the Streamlit Application

streamlit run app/app.py

πŸ’» Example Usage

Input

Congratulations!

You have won β‚Ή5,00,000.

Click here to claim your prize immediately.

Prediction

🚨 Spam

Input

Hi Harsh,

Let's meet tomorrow at 10 AM in the library.

Prediction

βœ… Ham

πŸš€ Future Scope

  • Deploy the application on Streamlit Community Cloud
  • Add multilingual spam detection
  • Implement Deep Learning models (LSTM/Transformer)
  • Integrate email APIs for real-time email filtering
  • Containerize the application using Docker
  • Build a REST API using FastAPI or Flask
  • Add explainable AI techniques (SHAP/LIME) to interpret predictions

🀝 Contributing

Contributions, issues, and feature requests are welcome.

If you'd like to improve this project:

  1. Fork the repository
  2. Create a new feature branch
  3. Commit your changes
  4. Open a Pull Request

πŸ“œ License

This project is licensed under the MIT License.


πŸ‘¨β€πŸ’» Connect with Me

Harsh Vardhan Chaudhary

πŸŽ“ B.Tech – Computer Science & Engineering SRM Institute of Science & Technology


⭐ Support

If you found this project useful, please consider giving it a ⭐ on GitHub.

It helps others discover the project and motivates further improvements.

About

Spam Email Classifier using NLP, TF-IDF, Support Vector Machine (SVM), and Streamlit

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors