Generated Text Detection Project

Project Description

This project aims to detect whether a given text was generated by an AI model or written by a human. It uses a BERT-like model with a classification head, trained on a dataset of human-written and AI-generated texts from competition

Technical Stack

Python 3.12
PyTorch & PyTorch Lightning: For model development and training.
Transformers: For leveraging pre-trained models like BERT.
Hydra: For configuration management.
DVC (Data Version Control): For versioning data and models, with Google Drive as remote storage.
UV: For Python package management and virtual environments.
MLflow: For experiment tracking and model serving.
ONNX: For model serialization and interoperability.
pre-commit, ruff, prettier: For code quality.

Setup Instructions

Prerequisites

Git
Python 3.12
UV (Python package installer and resolver)
Access to a Google Drive account (for DVC remote)

Installation

Clone the repository:

git clone git@github.com:Wayfarer123/GeneratedTextDetection-MLOps.git
cd GeneratedTextDetection

Create and activate a virtual environment and install dependencies using UV:

uv venv
source .venv/bin/activate  # On Linux/macOS
# .venv\Scripts\activate   # On Windows
uv sync

uv pip install -e .

DVC Setup (Google Drive Remote)

This project uses DVC to manage large data files and models, storing them in Google Drive.

Prerequisites for DVC:

You have cloned this repository. DVC is already initialized within it.
To view the folder follow the link https://drive.google.com/drive/folders/1hJPpDzv3QcmCxwOobvMko7yEcNW_W4xq?usp=sharing

Setup steps:

Obtain the Google Cloud Service Account JSON key file (generatedtextdetection-a38d1a979b00.json) and place it in repo root. To get the file and be able to access the data text https://t.me/Nikita_Okhotnikov
To download the data and any DVC-tracked models, run the following command from the project root directory:
```
dvc pull
```

Alternatively

Create your own google cloud account
Create new project and a service account there
Download your own json key file and change gdrive_service_account_json_file_path in .dvc/config.

Pre-commit Hooks

Install pre-commit hooks to ensure code quality before committing:

pre-commit install

Now, ruff and prettier will run automatically on staged files when you git commit.

Data

The raw data is the sample of texts with corresponding labels. Files are expected in data/raw/ after dvc pull and consists of:

subtaskA_train_monolingual.jsonl: Training data.
subtaskA_dev_monolingual.jsonl: Validation/development data.

Training

To train the model:

Ensure MLflow Tracking Server is running: The configuration points to http://127.0.0.1:8080. You can start a local MLflow server:
```
mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri file:./mlruns --default-artifact-root file:./mlruns
```
MLflow will log to a local mlruns directory.
Run the training script: The script uses Hydra for configuration. You can override parameters from the command line.
```
python -m generation_detection.train
```
Example overrides:
```
python -m generation_detection.train training.epochs=10 training.learning_rate=1e-5 data.batch_size=64
```
Check configs/ for all configurable parameters. Trained model checkpoints will be saved in outputs/ by default (managed by PyTorch Lightning) and logged to MLflow.

Model Export & Production Preparation

Export to ONNX

After training, export the best model checkpoint to ONNX format. The training script will log the path to the best checkpoint.

Find the path to the latest mlflow run (e.g., mlruns/289040139417110437/b0e08865e0b44d64a3a33ba7d5dd2029)
Copy the id (b0e08865e0b44d64a3a33ba7d5dd2029) to the configs/export/default.mlflow_run_id_for_onnx_log

Run the export script:

python -m generation_detection.export_onnx

That will save .onnx checkpoint to models/onnx/model_name.onnx. Additionally the code wraps the model with tokenizer as pyfunc for serving and save in mlruns/Experiment ID/Run ID/artifacts/onnx_with_tokenizer_for_serving and register this model in mlflow as ${model.model_name}/MODEL_VERSION. By default MODEL_VERSION is an index number 1,2,3...

Inference

Local Inference with ONNX

Use the predict.py script to make predictions using the ONNX model. Check configs/inference/default.yaml for config.

python python -m generation_detector.predict

The script will output the predicted class (0 for human, 1 for AI-generated) and the probabilities.

Serving with MLflow

Run the serving locally
```
    mlflow models serve -m "models:/MODEL_NAME/MODEL_VERSION" -p 5001 --env-manager local
```
After that the model will await requests just with texts to classify, but in the following format.

Send requests to the model Served model expects POST requests at http://127.0.0.1:5001/invocations with headers

{
  "Content-Type": "application/json"
}

and payload:

{
  "dataframe_split": {
    "columns": ["text"],
    "data": [["First text sample"], ["Second text sample"]]
  }
}

Client code example Check scripts/demo_client_request.py for the example request or run it with default arguments as
```
    python scripts/demo_client_request.py
```
to check the behaviour

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.dvc		.dvc
configs		configs
data		data
generation_detector/generation_detector		generation_detector/generation_detector
scripts		scripts
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
models.dvc		models.dvc
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generated Text Detection Project

Table of Contents

Project Description

Technical Stack

Setup Instructions

Prerequisites

Installation

DVC Setup (Google Drive Remote)

Pre-commit Hooks

Data

Training

Model Export & Production Preparation

Export to ONNX

Inference

Local Inference with ONNX

Serving with MLflow

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generated Text Detection Project

Table of Contents

Project Description

Technical Stack

Setup Instructions

Prerequisites

Installation

DVC Setup (Google Drive Remote)

Pre-commit Hooks

Data

Training

Model Export & Production Preparation

Export to ONNX

Inference

Local Inference with ONNX

Serving with MLflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages