Project Progress

Last updated: 2026-01-01

✅ Completed: Phase 1 - Data Pipeline Implementation

Overview

Successfully implemented a complete data pipeline for training the computer use agent. The pipeline handles parsing, preprocessing, and batching of the homemade recording dataset.

Implemented Components

1. Project Structure ✓

src/
├── data/
│   ├── parser.py              # Event JSON parsing
│   ├── image_preprocessing.py # Screenshot resizing
│   ├── sequence_builder.py    # Action history construction
│   └── dataset.py             # PyTorch Dataset
├── utils/
│   └── coordinate_utils.py    # Coordinate normalization
tests/
└── test_coordinates.py        # Unit tests
scripts/
├── test_parser.py             # Parser validation
├── test_dataset.py            # Dataset testing
└── prepare_data.py            # Data preparation

2. Coordinate Utilities (`src/utils/coordinate_utils.py`) ✓

Features:

normalize_coordinates() - Converts pixel coordinates to [0, 1] range
denormalize_coordinates() - Converts back to pixels
calculate_pixel_distance() - Computes error in pixels
Multi-monitor support - Handles negative coordinates for secondary displays

Tested: ✓ All tests passing in tests/test_coordinates.py

3. Event Parser (`src/data/parser.py`) ✓

Features:

Parses 620 JSON event files from Yaniv session
Classifies actions into types:
- MOUSE_CLICK - with normalized (x, y) and button
- KEYBOARD_TYPE - reconstructed text from keystrokes
- KEYBOARD_SHORTCUT - Cmd-Tab, special keys, etc.
Extracts window context (app name, URL, title)
Provides statistics and filtering
Validates coordinate normalization

Tested: ✓ Successfully parsed all 620 events via scripts/test_parser.py

Key Findings:

Session duration: ~2.75 hours
Average event spacing: ~16 seconds
Multi-monitor handling works correctly (negative coords validated)

4. Image Preprocessing (`src/data/image_preprocessing.py`) ✓

Features:

Resize from 4K (3840x2160) to model input size (896x896)
High-quality LANCZOS downsampling
Optional aspect ratio preservation with padding
Conservative augmentation (brightness/contrast only, no geometric transforms)
Batch processing support

Design Choice: No geometric transforms to avoid invalidating coordinate labels

5. Sequence Builder (`src/data/sequence_builder.py`) ✓

Features:

Builds training samples with action history context
Configurable history length (default: 5 previous actions)
Generates text prompts with:
- Window context (app name, URL)
- Time deltas for each historical action
- Clear action formatting
Produces target JSON strings for supervised learning
Train/val/test temporal splitting (80/10/10)

Prompt Format Example:

You are a computer use agent. Based on the screenshot and recent actions, predict the next action.

Current application: Google Chrome
URL: https://www.youtube.com/watch?v=xz0-brt56L8

Recent actions:
1. [16s ago] KEYBOARD_SHORTCUT: Cmd-Key.tab
2. [5s ago] MOUSE_CLICK at (0.278, 0.509)

Current screenshot: [IMAGE]

What action should be taken next? Respond in JSON format.

6. PyTorch Dataset (`src/data/dataset.py`) ✓

Features:

ComputerUseDataset - Main dataset class
Integrates parsing, preprocessing, and sequence building
Supports custom image processors (e.g., from transformers)
Built-in augmentation support
create_dataloaders() - One-line dataloader creation
Temporal train/val/test splits
Efficient batching with custom collate function
Optional caching for faster loading

Usage:

train_loader, val_loader, test_loader = create_dataloaders(
    data_dir='data/Yaniv1_session_2025-12-31T10-26-33',
    batch_size=4,
    history_length=5
)

7. Testing & Validation ✓

Scripts:

scripts/test_parser.py - Validates parser on 620 events
scripts/test_dataset.py - Comprehensive dataset tests
scripts/prepare_data.py - Data preparation with statistics
tests/test_coordinates.py - Unit tests for coordinate utilities

All tests passing: ✓

Dataset Statistics (Yaniv Session)

Size: 620 events, 231 MB Duration: 2.75 hours Event spacing: ~16 seconds average

Splits (80/10/10):

Train: 496 events
Val: 62 events
Test: 62 events

Action Distribution:

Mouse clicks: ~35% (mostly left button)
Keyboard typing: ~45%
Keyboard shortcuts: ~20%

Multi-Monitor:

Display 116 (primary): 1920x1080 at (0, 0)
Display 117 (secondary): 1920x1080 at (-1920, 24)

Key Design Decisions

1. Normalized Coordinates

Decision: Use [0, 1] normalized coordinates
Rationale: Resolution-agnostic, easier to learn, handles multi-monitor
Trade-off: Requires denormalization for execution

2. Action History Length = 5

Decision: Include 5 previous actions in context
Rationale: ~80 seconds of history, balances context vs input length
Configurable: Can be changed per experiment

3. Temporal Splitting

Decision: Split by time (train=early, test=later)
Rationale: More realistic than random split for sequential data
Validates: Generalization to future time steps

4. No Geometric Augmentation

Decision: Only brightness/contrast augmentation
Rationale: Geometric transforms would invalidate coordinate labels
Conservative: Prevents data corruption

Next Steps

Immediate (Phase 2: Model Training)

Implement model wrapper (src/models/qwen_vlm.py)
- Load Qwen2.5-VL-7B
- Configure LoRA adapters
- Define forward pass
Implement training loop (src/training/trainer.py)
- Hybrid loss function (classification + regression)
- Optimizer setup (AdamW)
- Checkpoint saving
- Logging to W&B
Create training script (scripts/train.py)
- Load dataloaders
- Initialize model
- Run training
- Evaluate on val set
DGX Setup
- Transfer data to DGX
- Create conda environment
- Test GPU availability
- Run initial training

Testing Ready

✓ Data pipeline fully validated
✓ Parser handles all 620 events correctly
✓ Coordinate normalization tested on multi-monitor setup
✓ Dataset can be loaded in batches
✓ Prompts and targets generated correctly

Files Ready for Training

All data processing code complete
Unit tests passing
Documentation updated
Ready to integrate with Qwen2.5-VL model

Usage Examples

Quick Test

# Activate environment
conda activate computer-use

# Test parser
python scripts/test_parser.py

# Test dataset
python scripts/test_dataset.py

# Prepare data with statistics
python scripts/prepare_data.py --data_dir data/Yaniv1_session_2025-12-31T10-26-33

Create DataLoaders

from src.data.dataset import create_dataloaders

train_loader, val_loader, test_loader = create_dataloaders(
    data_dir='data/Yaniv1_session_2025-12-31T10-26-33',
    batch_size=4,
    history_length=5,
    num_workers=0
)

# Iterate
for batch in train_loader:
    images = batch['images']      # List of PIL Images
    prompts = batch['prompts']    # List of text prompts
    targets = batch['targets']    # List of JSON action strings
    metadata = batch['metadata']  # List of metadata dicts

Validation Status

Component	Status	Notes
Coordinate normalization	✅	Multi-monitor tested
Event parsing	✅	All 620 events parsed
Image preprocessing	✅	4K → 896x896 resize
Sequence building	✅	History context works
PyTorch Dataset	✅	Batching validated
Train/val/test splits	✅	Temporal split 80/10/10
Unit tests	✅	All passing

Performance Notes

Data Loading:

First epoch: ~30s (image loading + preprocessing)
Cached epochs: ~10s (if using CachedDataset)
Batch size 4: ~2 samples/sec on CPU

Memory:

Single 4K image: ~25 MB
Resized to 896x896: ~2.4 MB
Batch of 4: ~10 MB
Full dataset in memory: ~1.5 GB (manageable)

Recommendations:

Use num_workers=4-8 on DGX for faster loading
Consider CachedDataset for repeated epochs
Batch size 4-8 optimal for GB10 GPU

Known Limitations

Single session: Currently only Yaniv session (620 events)
- Future: Integrate Hugging Face dataset (3,167 tasks) in Phase 4
No double-click detection: All mouse events classified as MOUSE_CLICK
- Future: Implement temporal analysis to detect double-clicks
Limited augmentation: Only brightness/contrast
- Acceptable: Geometric transforms would break coordinate labels
CPU-based preprocessing: Images loaded on-the-fly
- Future: Pre-cache resized images for faster training

Conclusion

✅ Phase 1 Complete: Data pipeline is production-ready and fully tested.

🎯 Ready for Phase 2: Model implementation and training on DGX.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Progress

✅ Completed: Phase 1 - Data Pipeline Implementation

Overview

Implemented Components

1. Project Structure ✓

2. Coordinate Utilities (`src/utils/coordinate_utils.py`) ✓

3. Event Parser (`src/data/parser.py`) ✓

4. Image Preprocessing (`src/data/image_preprocessing.py`) ✓

5. Sequence Builder (`src/data/sequence_builder.py`) ✓

6. PyTorch Dataset (`src/data/dataset.py`) ✓

7. Testing & Validation ✓

Dataset Statistics (Yaniv Session)

Key Design Decisions

1. Normalized Coordinates

2. Action History Length = 5

3. Temporal Splitting

4. No Geometric Augmentation

Next Steps

Immediate (Phase 2: Model Training)

Testing Ready

Files Ready for Training

Usage Examples

Quick Test

Create DataLoaders

Validation Status

Performance Notes

Known Limitations

Conclusion

FilesExpand file tree

PROGRESS.md

Latest commit

History

PROGRESS.md

File metadata and controls

Project Progress

✅ Completed: Phase 1 - Data Pipeline Implementation

Overview

Implemented Components

1. Project Structure ✓

2. Coordinate Utilities (src/utils/coordinate_utils.py) ✓

3. Event Parser (src/data/parser.py) ✓

4. Image Preprocessing (src/data/image_preprocessing.py) ✓

5. Sequence Builder (src/data/sequence_builder.py) ✓

6. PyTorch Dataset (src/data/dataset.py) ✓

7. Testing & Validation ✓

Dataset Statistics (Yaniv Session)

Key Design Decisions

1. Normalized Coordinates

2. Action History Length = 5

3. Temporal Splitting

4. No Geometric Augmentation

Next Steps

Immediate (Phase 2: Model Training)

Testing Ready

Files Ready for Training

Usage Examples

Quick Test

Create DataLoaders

Validation Status

Performance Notes

Known Limitations

Conclusion

2. Coordinate Utilities (`src/utils/coordinate_utils.py`) ✓

3. Event Parser (`src/data/parser.py`) ✓

4. Image Preprocessing (`src/data/image_preprocessing.py`) ✓

5. Sequence Builder (`src/data/sequence_builder.py`) ✓

6. PyTorch Dataset (`src/data/dataset.py`) ✓