Skip to content

Wayfarer123/GeneratedTextDetection-MLOps

Repository files navigation

Generated Text Detection Project

Table of Contents

  1. Project Description
  2. Technical Stack
  3. Setup Instructions
  4. Data
  5. Training
  6. Model Export & Production Preparation
  7. Inference

Project Description

This project aims to detect whether a given text was generated by an AI model or written by a human. It uses a BERT-like model with a classification head, trained on a dataset of human-written and AI-generated texts from competition

Technical Stack

  • Python 3.12
  • PyTorch & PyTorch Lightning: For model development and training.
  • Transformers: For leveraging pre-trained models like BERT.
  • Hydra: For configuration management.
  • DVC (Data Version Control): For versioning data and models, with Google Drive as remote storage.
  • UV: For Python package management and virtual environments.
  • MLflow: For experiment tracking and model serving.
  • ONNX: For model serialization and interoperability.
  • pre-commit, ruff, prettier: For code quality.

Setup Instructions

Prerequisites

  • Git
  • Python 3.12
  • UV (Python package installer and resolver)
  • Access to a Google Drive account (for DVC remote)

Installation

  1. Clone the repository:

    git clone git@github.com:Wayfarer123/GeneratedTextDetection-MLOps.git
    cd GeneratedTextDetection
  2. Create and activate a virtual environment and install dependencies using UV:

    uv venv
    source .venv/bin/activate  # On Linux/macOS
    # .venv\Scripts\activate   # On Windows
    uv sync
    
    uv pip install -e .

DVC Setup (Google Drive Remote)

This project uses DVC to manage large data files and models, storing them in Google Drive.

  1. Prerequisites for DVC:
  1. Setup steps:
  • Obtain the Google Cloud Service Account JSON key file (generatedtextdetection-a38d1a979b00.json) and place it in repo root. To get the file and be able to access the data text https://t.me/Nikita_Okhotnikov
  • To download the data and any DVC-tracked models, run the following command from the project root directory:
    dvc pull
  1. Alternatively
  • Create your own google cloud account
  • Create new project and a service account there
  • Download your own json key file and change gdrive_service_account_json_file_path in .dvc/config.

Pre-commit Hooks

Install pre-commit hooks to ensure code quality before committing:

pre-commit install

Now, ruff and prettier will run automatically on staged files when you git commit.

Data

The raw data is the sample of texts with corresponding labels. Files are expected in data/raw/ after dvc pull and consists of:

  • subtaskA_train_monolingual.jsonl: Training data.
  • subtaskA_dev_monolingual.jsonl: Validation/development data.

Training

To train the model:

  1. Ensure MLflow Tracking Server is running: The configuration points to http://127.0.0.1:8080. You can start a local MLflow server:

    mlflow server --host 127.0.0.1 --port 8080 --backend-store-uri file:./mlruns --default-artifact-root file:./mlruns

    MLflow will log to a local mlruns directory.

  2. Run the training script: The script uses Hydra for configuration. You can override parameters from the command line.

    python -m generation_detection.train

    Example overrides:

    python -m generation_detection.train training.epochs=10 training.learning_rate=1e-5 data.batch_size=64

    Check configs/ for all configurable parameters. Trained model checkpoints will be saved in outputs/ by default (managed by PyTorch Lightning) and logged to MLflow.

Model Export & Production Preparation

Export to ONNX

After training, export the best model checkpoint to ONNX format. The training script will log the path to the best checkpoint.

  1. Find the path to the latest mlflow run (e.g., mlruns/289040139417110437/b0e08865e0b44d64a3a33ba7d5dd2029)
  2. Copy the id (b0e08865e0b44d64a3a33ba7d5dd2029) to the configs/export/default.mlflow_run_id_for_onnx_log
  3. Run the export script:
    python -m generation_detection.export_onnx

That will save .onnx checkpoint to models/onnx/model_name.onnx. Additionally the code wraps the model with tokenizer as pyfunc for serving and save in mlruns/Experiment ID/Run ID/artifacts/onnx_with_tokenizer_for_serving and register this model in mlflow as ${model.model_name}/MODEL_VERSION. By default MODEL_VERSION is an index number 1,2,3...

Inference

Local Inference with ONNX

Use the predict.py script to make predictions using the ONNX model. Check configs/inference/default.yaml for config.

python python -m generation_detector.predict

The script will output the predicted class (0 for human, 1 for AI-generated) and the probabilities.

Serving with MLflow

  1. Run the serving locally

        mlflow models serve -m "models:/MODEL_NAME/MODEL_VERSION" -p 5001 --env-manager local

    After that the model will await requests just with texts to classify, but in the following format.

  2. Send requests to the model Served model expects POST requests at http://127.0.0.1:5001/invocations with headers

    {
      "Content-Type": "application/json"
    }

    and payload:

    {
      "dataframe_split": {
        "columns": ["text"],
        "data": [["First text sample"], ["Second text sample"]]
      }
    }
  3. Client code example Check scripts/demo_client_request.py for the example request or run it with default arguments as

        python scripts/demo_client_request.py

    to check the behaviour

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages