A Machine Learning and Natural Language Processing (NLP) project that automatically classifies SMS and Email messages as Spam or Ham (Not Spam).
The project applies Natural Language Processing (NLP) techniques, TF-IDF vectorization, and compares multiple machine learning algorithms including Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine (SVM). Based on model evaluation, the Support Vector Machine (SVM) achieved the best overall performance and was selected for deployment in the Streamlit web application.
The application provides a simple interface where users can enter an SMS or email message and instantly receive a prediction indicating whether the message is Spam or Ham.
Home Page
- π© Spam & Ham Message Classification
- π§Ή NLP-based Text Cleaning and Preprocessing
- π€ TF-IDF Feature Extraction
- π€ Comparison of Multiple Machine Learning Models (Multinomial Naive Bayes, Logistic Regression & Support Vector Machine)
- π Automatic Selection of the Best Performing Model
- π Exploratory Data Analysis (EDA)
- βοΈ Word Cloud Visualization
- π Confusion Matrix and Performance Evaluation
- π Interactive Streamlit Web Application
- πΎ Model Serialization using Pickle
- β‘ Real-Time Spam Prediction
Spam_Email_Classifier/
β
βββ app/
β βββ app.py
β
βββ dataset/
β βββ SMSSpamCollection
β
βββ model/
β βββ spam_model.pkl
β βββ vectorizer.pkl
β
βββ notebook/
β βββ spam_classifier.ipynb
β
βββ output/
β βββ app_home.png
β βββ spam_prediction.png
β βββ ham_prediction.png
β βββ spam_vs_ham.png
β βββ character_distribution.png
β βββ word_distribution.png
β βββ correlation_heatmap.png
β βββ spam_wordcloud.png
β βββ ham_wordcloud.png
β βββ confusion_matrix.png
β
βββ README.md
βββ requirements.txt
βββ .gitignore
βββ LICENSE
Dataset Name
SMS Spam Collection Dataset
Source
UCI Machine Learning Repository
Dataset Summary
-
Total Messages : 5,572
-
Classes :
- Spam
- Ham
The complete workflow followed in this project is:
-
Dataset Collection
-
Data Loading
-
Data Cleaning
-
Exploratory Data Analysis (EDA)
-
Text Preprocessing
- Lowercase Conversion
- Tokenization
- Removing Special Characters
- Removing Stopwords
- Stemming
-
Feature Extraction using TF-IDF
-
Train-Test Split
-
Model Training
-
Model Evaluation
-
Saving the Model
-
Streamlit Deployment
| Category | Technologies |
|---|---|
| Programming Language | Python |
| Data Manipulation | Pandas, NumPy |
| Machine Learning | Scikit-learn |
| NLP | NLTK |
| Visualization | Matplotlib, Seaborn |
| Web Framework | Streamlit |
| Model Serialization | Pickle |
The performance of three machine learning algorithms was evaluated using Accuracy, Precision, Recall, and F1-Score.
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Multinomial Naive Bayes | 97% | 100% | 81% | 90% |
| Logistic Regression | 96% | 96% | 74% | 83% |
| Support Vector Machine (SVM) | 98% | 98% | 98% | 98% |
The Support Vector Machine (SVM) achieved the highest overall performance with an accuracy of 98% and was selected as the final model for deployment in the Streamlit application.
- Successfully classified SMS and email messages into Spam and Ham.
- Compared three supervised machine learning algorithms.
- Achieved 98% overall accuracy using the Support Vector Machine (SVM).
- Developed an interactive Streamlit web application for real-time spam detection.
- Implemented a complete NLP pipeline including text preprocessing, TF-IDF vectorization, model training, evaluation, and deployment.
The following chart compares the performance of the evaluated machine learning models.
- Cleaning noisy SMS text data.
- Removing stopwords while preserving useful information.
- Selecting the best-performing classification model.
- Saving trained models for deployment.
- Building a responsive Streamlit application.
This project helped me gain practical experience in:
- Natural Language Processing (NLP)
- Text preprocessing techniques
- TF-IDF Vectorization
- Machine Learning model training and evaluation
- Model comparison and selection
- Streamlit web application development
- Model serialization using Pickle
- Git and GitHub for project version control
git clone https://github.com/harshvardhan4096/Spam_Email_Classifier.gitcd Spam_Email_Classifierpip install -r requirements.txtstreamlit run app/app.pyInput
Congratulations!
You have won βΉ5,00,000.
Click here to claim your prize immediately.
Prediction
π¨ Spam
Input
Hi Harsh,
Let's meet tomorrow at 10 AM in the library.
Prediction
β
Ham
- Deploy the application on Streamlit Community Cloud
- Add multilingual spam detection
- Implement Deep Learning models (LSTM/Transformer)
- Integrate email APIs for real-time email filtering
- Containerize the application using Docker
- Build a REST API using FastAPI or Flask
- Add explainable AI techniques (SHAP/LIME) to interpret predictions
Contributions, issues, and feature requests are welcome.
If you'd like to improve this project:
- Fork the repository
- Create a new feature branch
- Commit your changes
- Open a Pull Request
This project is licensed under the MIT License.
Harsh Vardhan Chaudhary
π B.Tech β Computer Science & Engineering SRM Institute of Science & Technology
- GitHub: https://github.com/harshvardhan4096
- LinkedIn: https://www.linkedin.com/in/connect-harsh-vardhan/
If you found this project useful, please consider giving it a β on GitHub.
It helps others discover the project and motivates further improvements.










