Skip to content

roeex5/train0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Computer Use Agent Training

Created: 2026-01-01

Train a computer use agent using behavioral cloning (supervised learning) by fine-tuning Qwen2.5-VL-7B on recorded human interactions.

Overview

This project trains a vision-language model to predict user actions (mouse clicks, keyboard input) given a screenshot and action history. The model learns from recorded sessions where humans interact with computer interfaces.

Approach: Behavioral cloning with LoRA fine-tuning Model: Qwen2.5-VL-7B (7B parameter vision-language model) Hardware: DGX Spark machines with GB10 GPUs (128GB shared memory)

Dataset

Primary Dataset: Homemade recording from /data/Yaniv1_session_2025-12-31T10-26-33

  • 620 paired JSON + JPG files (event-driven recording)
  • Screenshots: 3840x2160 (4K resolution)
  • Event types: Mouse clicks (x, y, button) + Keyboard (keystrokes)
  • Rich metadata: Window context (app, URL, title), display bounds, timestamps
  • Session duration: ~2.75 hours

Secondary Dataset: Hugging Face PSAI dataset at /data/computer-use-data-psai

  • 3,167 tasks in parquet format (~15.5GB)
  • Will be integrated in Phase 4

Project Structure

.
├── README.md
├── setup.py                   # Package configuration and dependencies
├── environment.yaml           # Conda environment definition
├── config/                    # Training configurations
├── src/
│   ├── data/                 # Data parsing and preprocessing
│   ├── models/               # Model wrappers
│   ├── training/             # Training loop and loss functions
│   ├── evaluation/           # Metrics and evaluation
│   └── utils/                # Utility functions
├── scripts/                  # Executable scripts
│   ├── prepare_data.py      # Data preprocessing
│   ├── train.py             # Training script
│   ├── evaluate.py          # Evaluation script
│   └── visualize_predictions.py
├── notebooks/                # Jupyter notebooks for exploration
├── tests/                    # Unit tests
├── checkpoints/              # Model checkpoints (gitignored)
├── logs/                     # Training logs (gitignored)
└── outputs/                  # Predictions and visualizations

Setup

1. Environment Setup

# Create environment from file
conda env create -f environment.yml

# Activate environment
conda activate computer-use

# Install package in editable mode
# Option 1: Core dependencies only (minimal)
pip install -e .

# Option 2: Full training environment (recommended)
pip install -e ".[all]"

# Option 3: Specific feature sets
# pip install -e ".[training,visualization,dev]"

Note: The environment setup may take 5-10 minutes and requires ~8-10GB of disk space.

2. Verify Installation

# Activate environment
conda activate computer-use

# Test imports
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"

# Run coordinate tests
pytest tests/test_coordinates.py -v

# Test parser
python scripts/test_parser.py

# Test dataset
python scripts/test_dataset.py

# Verify screenshot timing (ensures no data leakage)
python tests/verify_screenshot_timing.py

# Test complete pipeline with minimal model (M1 Mac compatible)
python scripts/test_minimal_model.py

3. Data Access

The data directories are symlinked:

  • /data/Users/roee/data (read-only)
  • /recorder/Users/roee/src/auto/events-agent-antigraviti (read-only)

Quick Start

Phase 1: Data Exploration

# Prepare and cache the dataset
python scripts/prepare_data.py --data_dir data/Yaniv1_session_2025-12-31T10-26-33

# Explore the data
jupyter notebook notebooks/01_data_exploration.ipynb

Phase 2: Training

# Train on DGX
python scripts/train.py --config config/dgx_config.yaml

Phase 3: Evaluation

# Evaluate trained model
python scripts/evaluate.py --checkpoint checkpoints/best_model.pth

# Visualize predictions
python scripts/visualize_predictions.py --checkpoint checkpoints/best_model.pth

Action Space

The model predicts actions in a hierarchical format:

Action Types (5 classes):

  1. MOUSE_CLICK - Single click at coordinates
  2. MOUSE_DOUBLE_CLICK - Double click
  3. KEYBOARD_TYPE - Text input
  4. KEYBOARD_SHORTCUT - Key combinations (Cmd-Tab, etc.)
  5. NO_ACTION - Model uncertainty

Output Format (JSON):

{
  "action_type": "MOUSE_CLICK",
  "parameters": {
    "x": 0.278,  # Normalized [0, 1]
    "y": 0.509,  # Normalized [0, 1]
    "button": "left"
  }
}

Training Configuration

Model: Qwen2.5-VL-7B with LoRA fine-tuning

  • LoRA rank: 32, alpha: 64
  • Target modules: q_proj, k_proj, v_proj, o_proj

Training:

  • Batch size: 4 (per device), gradient accumulation: 8 (effective: 32)
  • Learning rate: 2e-5 with cosine schedule
  • Mixed precision: bfloat16 (bf16)
  • Epochs: 10 (~50 minutes on DGX)

Loss Function:

  • Action type: Cross-entropy classification
  • Coordinates: L1 loss (MAE) on normalized [0,1] range
  • Text: Token-level cross-entropy

Expected Performance

Baseline (Phase 2):

  • Action type accuracy: 65-75%
  • Mouse coordinate MAE: 150-250 pixels (denormalized)
  • Keyboard text accuracy: 40-60%

DGX Migration

Transfer Data to DGX

# From local machine
rsync -avz --progress /Users/roee/data/Yaniv1_session_2025-12-31T10-26-33/ \
  dgx:/data/computer-use/yaniv_session/

Deploy Code to DGX

rsync -avz --progress --exclude 'checkpoints/*' --exclude 'logs/*' \
  /Users/roee/src/train/my/ dgx:/workspace/computer-use-training/

Launch Training on DGX

ssh dgx
cd /workspace/computer-use-training
conda activate computer-use
python scripts/train.py --config config/dgx_config.yaml

Development Phases

  • Phase 1 (Week 1): Data pipeline validation
  • Phase 2 (Weeks 2-3): Baseline model training
  • Phase 3 (Week 4): Evaluation and error analysis
  • Phase 4 (Weeks 5-6): Scale to Hugging Face dataset
  • Phase 5 (Week 7+): Advanced techniques (RL, multi-resolution)

Citation

This project uses:

  • Qwen2.5-VL - Vision-language model
  • PEFT - Parameter-efficient fine-tuning
  • Recording data from custom Electron app

License

TBD

About

An initial train attempt of a model for the use as a local computer use agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages