A simple sentiment analysis project using the IMDB movie review dataset. It classifies text as Positive or Negative using TF‑IDF + Logistic Regression.
- Python (recommended: 3.10+)
- Git (optional, only needed if you want to clone with
git clone)
git clone https://github.com/JoeWat2005/nlp-text-classifier.git
cd nlp-text-classifierpython -m venv envWindows (PowerShell):
.\env\Scripts\Activate.ps1If PowerShell blocks activation, run this once (in the same PowerShell window) and try again:
Set-ExecutionPolicy -Scope Process -ExecutionPolicy BypassWindows (CMD):
env\Scripts\activatemacOS / Linux:
source env/bin/activatepython -m pip install --upgrade pip
python -m pip install -r requirements.txtThe dependency list is pinned in requirements.txt (includes scikit-learn, datasets, pandas, matplotlib, seaborn, etc.).
Downloads the IMDB dataset from Hugging Face and writes it to data/data.csv.
python download_data.pyTrains a TF‑IDF vectorizer + Logistic Regression model, prints evaluation metrics, and saves artifacts under models/.
python train.pyLoads models/model.pkl + models/vectorizer.pkl and lets you type sentences to classify.
python predict.pyType quit to exit.
data/— downloaded dataset (data.csv)models/— saved model/vectorizer + plotsutils/— helper modules
If you see an error about saving plots into models/, create the directory and rerun:
mkdir models
python train.pydownload_data.py requires an internet connection and the Hugging Face datasets package.
MIT License