Skip to content

JoeWat2005/nlp-text-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Text Classifier

A simple sentiment analysis project using the IMDB movie review dataset. It classifies text as Positive or Negative using TF‑IDF + Logistic Regression.

Prerequisites

  • Python (recommended: 3.10+)
  • Git (optional, only needed if you want to clone with git clone)

Setup (clone + virtual environment)

1) Clone the repository

git clone https://github.com/JoeWat2005/nlp-text-classifier.git
cd nlp-text-classifier

2) Create a virtual environment called env

python -m venv env

3) Activate the environment

Windows (PowerShell):

.\env\Scripts\Activate.ps1

If PowerShell blocks activation, run this once (in the same PowerShell window) and try again:

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass

Windows (CMD):

env\Scripts\activate

macOS / Linux:

source env/bin/activate

4) Install dependencies

python -m pip install --upgrade pip
python -m pip install -r requirements.txt

The dependency list is pinned in requirements.txt (includes scikit-learn, datasets, pandas, matplotlib, seaborn, etc.).

Run the project

1) Download the dataset

Downloads the IMDB dataset from Hugging Face and writes it to data/data.csv.

python download_data.py

2) Train the model

Trains a TF‑IDF vectorizer + Logistic Regression model, prints evaluation metrics, and saves artifacts under models/.

python train.py

3) Make predictions (interactive)

Loads models/model.pkl + models/vectorizer.pkl and lets you type sentences to classify.

python predict.py

Type quit to exit.

Folder structure

  • data/ — downloaded dataset (data.csv)
  • models/ — saved model/vectorizer + plots
  • utils/ — helper modules

Troubleshooting

FileNotFoundError: ... models/confusion_matrix.png

If you see an error about saving plots into models/, create the directory and rerun:

mkdir models
python train.py

Dataset download issues

download_data.py requires an internet connection and the Hugging Face datasets package.

License

MIT License

About

A basic NLP sentiment classifier that uses TF-IDF and Logistic Regression to predict whether text is positive or negative.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages