This repository contains the training and inference code for a Persian/Dari emotion classification project built on top of HooshvareLab/bert-base-parsbert-uncased. The model is designed to classify social media text into eight emotion categories: Hope, Happy, Neutral, Surprise, Disgust, Sad, Anger, and Fear.
The codebase has been simplified to use a single training script, scripts/train.py, which covers all supported experiment settings through command-line presets.
scripts/train.py: main training entrypointscripts/predict.py: inference script for loading a trained model and running predictionsconfig/paths.py: path configuration for datasets, models, and output directoriesaugmentations/fear_augmenter.py: utilities related to fear-class augmentationutils/dataset_utils.py: helper functions for dataset preparation
The project requires Python 3.8 or later and the packages listed in requirements.txt.
Install dependencies with:
pip install -r requirements.txtDatasets are not included in the repository. Path resolution is handled in config/paths.py, with support for the following environment variables:
SENTIMENT_STORAGE_ROOT: location for saved models, checkpoints, and experiment outputsSENTIMENT_DATA_ROOT: location of the project datasetsSENTIMENT_BASE_PATH: backward-compatible fallback used by older setups
The training script expects the following processed files when using the default configuration:
Data/processed/Labeled_4K.csvData/processed/Combined_Labeled_Dataset.csvData/processed/Combined_Labeled_Dataset_with_fearAug.csv
All training runs are handled through scripts/train.py.
Basic example:
python scripts/train.py --mode baseline_4kSupported modes:
baseline_4k: 8-label training on the 4K labeled datasetfull_8label: training on the full labeled dataset with all eight classesfull_7label: training on the full labeled dataset after removing theFearclassfull_8label_aug: training on the augmented full dataset
Example commands:
python scripts/train.py --mode full_8label
python scripts/train.py --mode full_7label
python scripts/train.py --mode full_8label_aug
python scripts/train.py --mode full_8label_aug --batch-size 16 --max-length 256 --use-dynamic-padding
python scripts/train.py --mode full_8label_aug --batch-size 8 --num-train-epochs 2Padding strategy used in this project:
baseline_4kis run with static paddingfull_8label,full_7label, andfull_8label_augare run with dynamic padding
Common optional arguments:
--dataset-path--base-model--output-dir--final-model-dir--batch-size--num-train-epochs--learning-rate--max-length--use-dynamic-padding--fp16--no-fp16
Recommended commands:
Baseline run with static padding:
python scripts/train.py --mode baseline_4kFull-dataset run with dynamic padding:
python scripts/train.py --mode full_8label_aug --batch-size 16 --max-length 256 --use-dynamic-paddingTrained models are stored under Models/, while experiment-specific runs are written to outputs/. Each run saves metadata and evaluation results to make comparisons between experiments easier.
To run prediction with a trained model:
python scripts/predict.pyBy default, scripts/predict.py loads the model from PATHS["fine_tuned_model"].
Academic use only.