Last updated: 2026-01-01
Successfully implemented a complete data pipeline for training the computer use agent. The pipeline handles parsing, preprocessing, and batching of the homemade recording dataset.
src/
├── data/
│ ├── parser.py # Event JSON parsing
│ ├── image_preprocessing.py # Screenshot resizing
│ ├── sequence_builder.py # Action history construction
│ └── dataset.py # PyTorch Dataset
├── utils/
│ └── coordinate_utils.py # Coordinate normalization
tests/
└── test_coordinates.py # Unit tests
scripts/
├── test_parser.py # Parser validation
├── test_dataset.py # Dataset testing
└── prepare_data.py # Data preparation
Features:
normalize_coordinates()- Converts pixel coordinates to [0, 1] rangedenormalize_coordinates()- Converts back to pixelscalculate_pixel_distance()- Computes error in pixels- Multi-monitor support - Handles negative coordinates for secondary displays
Tested: ✓ All tests passing in tests/test_coordinates.py
Features:
- Parses 620 JSON event files from Yaniv session
- Classifies actions into types:
MOUSE_CLICK- with normalized (x, y) and buttonKEYBOARD_TYPE- reconstructed text from keystrokesKEYBOARD_SHORTCUT- Cmd-Tab, special keys, etc.
- Extracts window context (app name, URL, title)
- Provides statistics and filtering
- Validates coordinate normalization
Tested: ✓ Successfully parsed all 620 events via scripts/test_parser.py
Key Findings:
- Session duration: ~2.75 hours
- Average event spacing: ~16 seconds
- Multi-monitor handling works correctly (negative coords validated)
Features:
- Resize from 4K (3840x2160) to model input size (896x896)
- High-quality LANCZOS downsampling
- Optional aspect ratio preservation with padding
- Conservative augmentation (brightness/contrast only, no geometric transforms)
- Batch processing support
Design Choice: No geometric transforms to avoid invalidating coordinate labels
Features:
- Builds training samples with action history context
- Configurable history length (default: 5 previous actions)
- Generates text prompts with:
- Window context (app name, URL)
- Time deltas for each historical action
- Clear action formatting
- Produces target JSON strings for supervised learning
- Train/val/test temporal splitting (80/10/10)
Prompt Format Example:
You are a computer use agent. Based on the screenshot and recent actions, predict the next action.
Current application: Google Chrome
URL: https://www.youtube.com/watch?v=xz0-brt56L8
Recent actions:
1. [16s ago] KEYBOARD_SHORTCUT: Cmd-Key.tab
2. [5s ago] MOUSE_CLICK at (0.278, 0.509)
Current screenshot: [IMAGE]
What action should be taken next? Respond in JSON format.
Features:
ComputerUseDataset- Main dataset class- Integrates parsing, preprocessing, and sequence building
- Supports custom image processors (e.g., from transformers)
- Built-in augmentation support
create_dataloaders()- One-line dataloader creation- Temporal train/val/test splits
- Efficient batching with custom collate function
- Optional caching for faster loading
Usage:
train_loader, val_loader, test_loader = create_dataloaders(
data_dir='data/Yaniv1_session_2025-12-31T10-26-33',
batch_size=4,
history_length=5
)Scripts:
scripts/test_parser.py- Validates parser on 620 eventsscripts/test_dataset.py- Comprehensive dataset testsscripts/prepare_data.py- Data preparation with statisticstests/test_coordinates.py- Unit tests for coordinate utilities
All tests passing: ✓
Size: 620 events, 231 MB Duration: 2.75 hours Event spacing: ~16 seconds average
Splits (80/10/10):
- Train: 496 events
- Val: 62 events
- Test: 62 events
Action Distribution:
- Mouse clicks: ~35% (mostly left button)
- Keyboard typing: ~45%
- Keyboard shortcuts: ~20%
Multi-Monitor:
- Display 116 (primary): 1920x1080 at (0, 0)
- Display 117 (secondary): 1920x1080 at (-1920, 24)
- Decision: Use [0, 1] normalized coordinates
- Rationale: Resolution-agnostic, easier to learn, handles multi-monitor
- Trade-off: Requires denormalization for execution
- Decision: Include 5 previous actions in context
- Rationale: ~80 seconds of history, balances context vs input length
- Configurable: Can be changed per experiment
- Decision: Split by time (train=early, test=later)
- Rationale: More realistic than random split for sequential data
- Validates: Generalization to future time steps
- Decision: Only brightness/contrast augmentation
- Rationale: Geometric transforms would invalidate coordinate labels
- Conservative: Prevents data corruption
-
Implement model wrapper (
src/models/qwen_vlm.py)- Load Qwen2.5-VL-7B
- Configure LoRA adapters
- Define forward pass
-
Implement training loop (
src/training/trainer.py)- Hybrid loss function (classification + regression)
- Optimizer setup (AdamW)
- Checkpoint saving
- Logging to W&B
-
Create training script (
scripts/train.py)- Load dataloaders
- Initialize model
- Run training
- Evaluate on val set
-
DGX Setup
- Transfer data to DGX
- Create conda environment
- Test GPU availability
- Run initial training
- ✓ Data pipeline fully validated
- ✓ Parser handles all 620 events correctly
- ✓ Coordinate normalization tested on multi-monitor setup
- ✓ Dataset can be loaded in batches
- ✓ Prompts and targets generated correctly
- All data processing code complete
- Unit tests passing
- Documentation updated
- Ready to integrate with Qwen2.5-VL model
# Activate environment
conda activate computer-use
# Test parser
python scripts/test_parser.py
# Test dataset
python scripts/test_dataset.py
# Prepare data with statistics
python scripts/prepare_data.py --data_dir data/Yaniv1_session_2025-12-31T10-26-33from src.data.dataset import create_dataloaders
train_loader, val_loader, test_loader = create_dataloaders(
data_dir='data/Yaniv1_session_2025-12-31T10-26-33',
batch_size=4,
history_length=5,
num_workers=0
)
# Iterate
for batch in train_loader:
images = batch['images'] # List of PIL Images
prompts = batch['prompts'] # List of text prompts
targets = batch['targets'] # List of JSON action strings
metadata = batch['metadata'] # List of metadata dicts| Component | Status | Notes |
|---|---|---|
| Coordinate normalization | ✅ | Multi-monitor tested |
| Event parsing | ✅ | All 620 events parsed |
| Image preprocessing | ✅ | 4K → 896x896 resize |
| Sequence building | ✅ | History context works |
| PyTorch Dataset | ✅ | Batching validated |
| Train/val/test splits | ✅ | Temporal split 80/10/10 |
| Unit tests | ✅ | All passing |
Data Loading:
- First epoch: ~30s (image loading + preprocessing)
- Cached epochs: ~10s (if using CachedDataset)
- Batch size 4: ~2 samples/sec on CPU
Memory:
- Single 4K image: ~25 MB
- Resized to 896x896: ~2.4 MB
- Batch of 4: ~10 MB
- Full dataset in memory: ~1.5 GB (manageable)
Recommendations:
- Use
num_workers=4-8on DGX for faster loading - Consider
CachedDatasetfor repeated epochs - Batch size 4-8 optimal for GB10 GPU
-
Single session: Currently only Yaniv session (620 events)
- Future: Integrate Hugging Face dataset (3,167 tasks) in Phase 4
-
No double-click detection: All mouse events classified as MOUSE_CLICK
- Future: Implement temporal analysis to detect double-clicks
-
Limited augmentation: Only brightness/contrast
- Acceptable: Geometric transforms would break coordinate labels
-
CPU-based preprocessing: Images loaded on-the-fly
- Future: Pre-cache resized images for faster training
✅ Phase 1 Complete: Data pipeline is production-ready and fully tested.
🎯 Ready for Phase 2: Model implementation and training on DGX.