Computer Use Agent Training

Created: 2026-01-01

Train a computer use agent using behavioral cloning (supervised learning) by fine-tuning Qwen2.5-VL-7B on recorded human interactions.

Overview

This project trains a vision-language model to predict user actions (mouse clicks, keyboard input) given a screenshot and action history. The model learns from recorded sessions where humans interact with computer interfaces.

Approach: Behavioral cloning with LoRA fine-tuning Model: Qwen2.5-VL-7B (7B parameter vision-language model) Hardware: DGX Spark machines with GB10 GPUs (128GB shared memory)

Dataset

Primary Dataset: Homemade recording from /data/Yaniv1_session_2025-12-31T10-26-33

620 paired JSON + JPG files (event-driven recording)
Screenshots: 3840x2160 (4K resolution)
Event types: Mouse clicks (x, y, button) + Keyboard (keystrokes)
Rich metadata: Window context (app, URL, title), display bounds, timestamps
Session duration: ~2.75 hours

Secondary Dataset: Hugging Face PSAI dataset at /data/computer-use-data-psai

3,167 tasks in parquet format (~15.5GB)
Will be integrated in Phase 4

Project Structure

.
├── README.md
├── setup.py                   # Package configuration and dependencies
├── environment.yaml           # Conda environment definition
├── config/                    # Training configurations
├── src/
│   ├── data/                 # Data parsing and preprocessing
│   ├── models/               # Model wrappers
│   ├── training/             # Training loop and loss functions
│   ├── evaluation/           # Metrics and evaluation
│   └── utils/                # Utility functions
├── scripts/                  # Executable scripts
│   ├── prepare_data.py      # Data preprocessing
│   ├── train.py             # Training script
│   ├── evaluate.py          # Evaluation script
│   └── visualize_predictions.py
├── notebooks/                # Jupyter notebooks for exploration
├── tests/                    # Unit tests
├── checkpoints/              # Model checkpoints (gitignored)
├── logs/                     # Training logs (gitignored)
└── outputs/                  # Predictions and visualizations

Setup

1. Environment Setup

# Create environment from file
conda env create -f environment.yml

# Activate environment
conda activate computer-use

# Install package in editable mode
# Option 1: Core dependencies only (minimal)
pip install -e .

# Option 2: Full training environment (recommended)
pip install -e ".[all]"

# Option 3: Specific feature sets
# pip install -e ".[training,visualization,dev]"

Note: The environment setup may take 5-10 minutes and requires ~8-10GB of disk space.

2. Verify Installation

# Activate environment
conda activate computer-use

# Test imports
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"

# Run coordinate tests
pytest tests/test_coordinates.py -v

# Test parser
python scripts/test_parser.py

# Test dataset
python scripts/test_dataset.py

# Verify screenshot timing (ensures no data leakage)
python tests/verify_screenshot_timing.py

# Test complete pipeline with minimal model (M1 Mac compatible)
python scripts/test_minimal_model.py

3. Data Access

The data directories are symlinked:

/data → /Users/roee/data (read-only)
/recorder → /Users/roee/src/auto/events-agent-antigraviti (read-only)

Quick Start

Phase 1: Data Exploration

# Prepare and cache the dataset
python scripts/prepare_data.py --data_dir data/Yaniv1_session_2025-12-31T10-26-33

# Explore the data
jupyter notebook notebooks/01_data_exploration.ipynb

Phase 2: Training

# Train on DGX
python scripts/train.py --config config/dgx_config.yaml

Phase 3: Evaluation

# Evaluate trained model
python scripts/evaluate.py --checkpoint checkpoints/best_model.pth

# Visualize predictions
python scripts/visualize_predictions.py --checkpoint checkpoints/best_model.pth

Action Space

The model predicts actions in a hierarchical format:

Action Types (5 classes):

MOUSE_CLICK - Single click at coordinates
MOUSE_DOUBLE_CLICK - Double click
KEYBOARD_TYPE - Text input
KEYBOARD_SHORTCUT - Key combinations (Cmd-Tab, etc.)
NO_ACTION - Model uncertainty

Output Format (JSON):

{
  "action_type": "MOUSE_CLICK",
  "parameters": {
    "x": 0.278,  # Normalized [0, 1]
    "y": 0.509,  # Normalized [0, 1]
    "button": "left"
  }
}

Training Configuration

Model: Qwen2.5-VL-7B with LoRA fine-tuning

LoRA rank: 32, alpha: 64
Target modules: q_proj, k_proj, v_proj, o_proj

Training:

Batch size: 4 (per device), gradient accumulation: 8 (effective: 32)
Learning rate: 2e-5 with cosine schedule
Mixed precision: bfloat16 (bf16)
Epochs: 10 (~50 minutes on DGX)

Loss Function:

Action type: Cross-entropy classification
Coordinates: L1 loss (MAE) on normalized [0,1] range
Text: Token-level cross-entropy

Expected Performance

Baseline (Phase 2):

Action type accuracy: 65-75%
Mouse coordinate MAE: 150-250 pixels (denormalized)
Keyboard text accuracy: 40-60%

DGX Migration

Transfer Data to DGX

# From local machine
rsync -avz --progress /Users/roee/data/Yaniv1_session_2025-12-31T10-26-33/ \
  dgx:/data/computer-use/yaniv_session/

Deploy Code to DGX

rsync -avz --progress --exclude 'checkpoints/*' --exclude 'logs/*' \
  /Users/roee/src/train/my/ dgx:/workspace/computer-use-training/

Launch Training on DGX

ssh dgx
cd /workspace/computer-use-training
conda activate computer-use
python scripts/train.py --config config/dgx_config.yaml

Development Phases

Phase 1 (Week 1): Data pipeline validation
Phase 2 (Weeks 2-3): Baseline model training
Phase 3 (Week 4): Evaluation and error analysis
Phase 4 (Weeks 5-6): Scale to Hugging Face dataset
Phase 5 (Week 7+): Advanced techniques (RL, multi-resolution)

Citation

This project uses:

Qwen2.5-VL - Vision-language model
PEFT - Parameter-efficient fine-tuning
Recording data from custom Electron app

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
PROGRESS.md		PROGRESS.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TRAINING_GUIDE.md		TRAINING_GUIDE.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computer Use Agent Training

Overview

Dataset

Project Structure

Setup

1. Environment Setup

2. Verify Installation

3. Data Access

Quick Start

Phase 1: Data Exploration

Phase 2: Training

Phase 3: Evaluation

Action Space

Training Configuration

Expected Performance

DGX Migration

Transfer Data to DGX

Deploy Code to DGX

Launch Training on DGX

Development Phases

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Computer Use Agent Training

Overview

Dataset

Project Structure

Setup

1. Environment Setup

2. Verify Installation

3. Data Access

Quick Start

Phase 1: Data Exploration

Phase 2: Training

Phase 3: Evaluation

Action Space

Training Configuration

Expected Performance

DGX Migration

Transfer Data to DGX

Deploy Code to DGX

Launch Training on DGX

Development Phases

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages