Created: 2026-01-01
Train a computer use agent using behavioral cloning (supervised learning) by fine-tuning Qwen2.5-VL-7B on recorded human interactions.
This project trains a vision-language model to predict user actions (mouse clicks, keyboard input) given a screenshot and action history. The model learns from recorded sessions where humans interact with computer interfaces.
Approach: Behavioral cloning with LoRA fine-tuning Model: Qwen2.5-VL-7B (7B parameter vision-language model) Hardware: DGX Spark machines with GB10 GPUs (128GB shared memory)
Primary Dataset: Homemade recording from /data/Yaniv1_session_2025-12-31T10-26-33
- 620 paired JSON + JPG files (event-driven recording)
- Screenshots: 3840x2160 (4K resolution)
- Event types: Mouse clicks (x, y, button) + Keyboard (keystrokes)
- Rich metadata: Window context (app, URL, title), display bounds, timestamps
- Session duration: ~2.75 hours
Secondary Dataset: Hugging Face PSAI dataset at /data/computer-use-data-psai
- 3,167 tasks in parquet format (~15.5GB)
- Will be integrated in Phase 4
.
├── README.md
├── setup.py # Package configuration and dependencies
├── environment.yaml # Conda environment definition
├── config/ # Training configurations
├── src/
│ ├── data/ # Data parsing and preprocessing
│ ├── models/ # Model wrappers
│ ├── training/ # Training loop and loss functions
│ ├── evaluation/ # Metrics and evaluation
│ └── utils/ # Utility functions
├── scripts/ # Executable scripts
│ ├── prepare_data.py # Data preprocessing
│ ├── train.py # Training script
│ ├── evaluate.py # Evaluation script
│ └── visualize_predictions.py
├── notebooks/ # Jupyter notebooks for exploration
├── tests/ # Unit tests
├── checkpoints/ # Model checkpoints (gitignored)
├── logs/ # Training logs (gitignored)
└── outputs/ # Predictions and visualizations
# Create environment from file
conda env create -f environment.yml
# Activate environment
conda activate computer-use
# Install package in editable mode
# Option 1: Core dependencies only (minimal)
pip install -e .
# Option 2: Full training environment (recommended)
pip install -e ".[all]"
# Option 3: Specific feature sets
# pip install -e ".[training,visualization,dev]"Note: The environment setup may take 5-10 minutes and requires ~8-10GB of disk space.
# Activate environment
conda activate computer-use
# Test imports
python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import transformers; print(f'Transformers: {transformers.__version__}')"
# Run coordinate tests
pytest tests/test_coordinates.py -v
# Test parser
python scripts/test_parser.py
# Test dataset
python scripts/test_dataset.py
# Verify screenshot timing (ensures no data leakage)
python tests/verify_screenshot_timing.py
# Test complete pipeline with minimal model (M1 Mac compatible)
python scripts/test_minimal_model.pyThe data directories are symlinked:
/data→/Users/roee/data(read-only)/recorder→/Users/roee/src/auto/events-agent-antigraviti(read-only)
# Prepare and cache the dataset
python scripts/prepare_data.py --data_dir data/Yaniv1_session_2025-12-31T10-26-33
# Explore the data
jupyter notebook notebooks/01_data_exploration.ipynb# Train on DGX
python scripts/train.py --config config/dgx_config.yaml# Evaluate trained model
python scripts/evaluate.py --checkpoint checkpoints/best_model.pth
# Visualize predictions
python scripts/visualize_predictions.py --checkpoint checkpoints/best_model.pthThe model predicts actions in a hierarchical format:
Action Types (5 classes):
MOUSE_CLICK- Single click at coordinatesMOUSE_DOUBLE_CLICK- Double clickKEYBOARD_TYPE- Text inputKEYBOARD_SHORTCUT- Key combinations (Cmd-Tab, etc.)NO_ACTION- Model uncertainty
Output Format (JSON):
{
"action_type": "MOUSE_CLICK",
"parameters": {
"x": 0.278, # Normalized [0, 1]
"y": 0.509, # Normalized [0, 1]
"button": "left"
}
}Model: Qwen2.5-VL-7B with LoRA fine-tuning
- LoRA rank: 32, alpha: 64
- Target modules: q_proj, k_proj, v_proj, o_proj
Training:
- Batch size: 4 (per device), gradient accumulation: 8 (effective: 32)
- Learning rate: 2e-5 with cosine schedule
- Mixed precision: bfloat16 (bf16)
- Epochs: 10 (~50 minutes on DGX)
Loss Function:
- Action type: Cross-entropy classification
- Coordinates: L1 loss (MAE) on normalized [0,1] range
- Text: Token-level cross-entropy
Baseline (Phase 2):
- Action type accuracy: 65-75%
- Mouse coordinate MAE: 150-250 pixels (denormalized)
- Keyboard text accuracy: 40-60%
# From local machine
rsync -avz --progress /Users/roee/data/Yaniv1_session_2025-12-31T10-26-33/ \
dgx:/data/computer-use/yaniv_session/rsync -avz --progress --exclude 'checkpoints/*' --exclude 'logs/*' \
/Users/roee/src/train/my/ dgx:/workspace/computer-use-training/ssh dgx
cd /workspace/computer-use-training
conda activate computer-use
python scripts/train.py --config config/dgx_config.yaml- Phase 1 (Week 1): Data pipeline validation
- Phase 2 (Weeks 2-3): Baseline model training
- Phase 3 (Week 4): Evaluation and error analysis
- Phase 4 (Weeks 5-6): Scale to Hugging Face dataset
- Phase 5 (Week 7+): Advanced techniques (RL, multi-resolution)
This project uses:
- Qwen2.5-VL - Vision-language model
- PEFT - Parameter-efficient fine-tuning
- Recording data from custom Electron app
TBD