- Project Description
- Technical Stack
- Setup Instructions
- Data
- Training
- Model Export & Production Preparation
- Inference
This project aims to detect whether a given text was generated by an AI model or written by a human. It uses a BERT-like model with a classification head, trained on a dataset of human-written and AI-generated texts from competition
- Python 3.12
- PyTorch & PyTorch Lightning: For model development and training.
- Transformers: For leveraging pre-trained models like BERT.
- Hydra: For configuration management.
- DVC (Data Version Control): For versioning data and models, with Google Drive as remote storage.
- UV: For Python package management and virtual environments.
- MLflow: For experiment tracking and model serving.
- ONNX: For model serialization and interoperability.
- pre-commit, ruff, prettier: For code quality.
- Git
- Python 3.12
- UV (Python package installer and resolver)
- Access to a Google Drive account (for DVC remote)
-
Clone the repository:
git clone git@github.com:Wayfarer123/GeneratedTextDetection-MLOps.git cd GeneratedTextDetection -
Create and activate a virtual environment and install dependencies using UV:
uv venv source .venv/bin/activate # On Linux/macOS # .venv\Scripts\activate # On Windows uv sync uv pip install -e .
This project uses DVC to manage large data files and models, storing them in Google Drive.
- Prerequisites for DVC:
- You have cloned this repository. DVC is already initialized within it.
- To view the folder follow the link https://drive.google.com/drive/folders/1hJPpDzv3QcmCxwOobvMko7yEcNW_W4xq?usp=sharing
- Setup steps:
- Obtain the Google Cloud Service Account JSON key file (
generatedtextdetection-a38d1a979b00.json) and place it in repo root. To get the file and be able to access the data text https://t.me/Nikita_Okhotnikov - To download the data and any DVC-tracked models, run the following command from the project root directory:
dvc pull
- Alternatively
- Create your own google cloud account
- Create new project and a service account there
- Download your own json key file and change gdrive_service_account_json_file_path in .dvc/config.
Install pre-commit hooks to ensure code quality before committing:
pre-commit installNow, ruff and prettier will run automatically on staged files when you git commit.
The raw data is the sample of texts with corresponding labels. Files are expected in data/raw/ after dvc pull and consists of:
subtaskA_train_monolingual.jsonl: Training data.subtaskA_dev_monolingual.jsonl: Validation/development data.
To train the model:
-
Ensure MLflow Tracking Server is running: The configuration points to
http://127.0.0.1:8080. You can start a local MLflow server:mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri file:./mlruns --default-artifact-root file:./mlruns
MLflow will log to a local
mlrunsdirectory. -
Run the training script: The script uses Hydra for configuration. You can override parameters from the command line.
python -m generation_detection.train
Example overrides:
python -m generation_detection.train training.epochs=10 training.learning_rate=1e-5 data.batch_size=64
Check
configs/for all configurable parameters. Trained model checkpoints will be saved inoutputs/by default (managed by PyTorch Lightning) and logged to MLflow.
After training, export the best model checkpoint to ONNX format. The training script will log the path to the best checkpoint.
- Find the path to the latest mlflow run (e.g.,
mlruns/289040139417110437/b0e08865e0b44d64a3a33ba7d5dd2029) - Copy the id (
b0e08865e0b44d64a3a33ba7d5dd2029) to theconfigs/export/default.mlflow_run_id_for_onnx_log - Run the export script:
python -m generation_detection.export_onnx
That will save .onnx checkpoint to models/onnx/model_name.onnx. Additionally the code wraps the model with tokenizer as pyfunc for serving and save in mlruns/Experiment ID/Run ID/artifacts/onnx_with_tokenizer_for_serving and register this model in mlflow as ${model.model_name}/MODEL_VERSION. By default MODEL_VERSION is an index number 1,2,3...
Use the predict.py script to make predictions using the ONNX model. Check configs/inference/default.yaml for config.
python python -m generation_detector.predictThe script will output the predicted class (0 for human, 1 for AI-generated) and the probabilities.
-
Run the serving locally
mlflow models serve -m "models:/MODEL_NAME/MODEL_VERSION" -p 5001 --env-manager local
After that the model will await requests just with texts to classify, but in the following format.
-
Send requests to the model Served model expects POST requests at http://127.0.0.1:5001/invocations with headers
{ "Content-Type": "application/json" }and payload:
{ "dataframe_split": { "columns": ["text"], "data": [["First text sample"], ["Second text sample"]] } } -
Client code example Check
scripts/demo_client_request.pyfor the example request or run it with default arguments aspython scripts/demo_client_request.py
to check the behaviour