Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 98 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,29 @@


# 🩺 GUI-DR: GUI Domain-Randomization for generating diagnostic GUI grounding evaluation data
<p align="center">
<span style="display: inline-flex; align-items: center; justify-content: center; flex-wrap: wrap; gap: 6px;">
<a href="https://blog.fig.inc/gui-perturbed-a-domain-randomization-dataset-for-gui-grounding"><img src="media/technical-report-badge.svg" alt="Technical Report | Fig" height="20" style="display: block;"></a>
<a href="https://huggingface.co/datasets/figai/GUI-Perturbed"><img src="https://img.shields.io/badge/Data-HuggingFace-yellow?style=flat-square&logo=huggingface" alt="Hugging Face Data" style="display: block;"></a>
<a href="https://github.com/ManifoldRG/GUI-DR"><img src="https://img.shields.io/badge/GUI--DR-GitHub-blueviolet?style=flat-square&logo=github" alt="GUI-DR GitHub" style="display: block;"></a>
<a href="https://discord.gg/jxb5fXWf"><img src="https://img.shields.io/badge/Contribute-Discord-7289DA?style=flat-square&logo=discord" alt="Contribute on Discord" style="display: block;"></a>
</span>
<details align="center">
<summary>
<img src="https://img.shields.io/badge/Technical_Reports-Fig-007ec6?style=flat-square" alt="Technical Reports | Fig" height="20">
<a href="https://huggingface.co/datasets/figai/GUI-Perturbed"><img src="https://img.shields.io/badge/Data-HuggingFace-yellow?style=flat-square&logo=huggingface" alt="Hugging Face Data"></a>
<a href="https://github.com/ManifoldRG/GUI-DR"><img src="https://img.shields.io/badge/GUI--DR-GitHub-blueviolet?style=flat-square&logo=github" alt="GUI-DR GitHub"></a>
<a href="https://discord.gg/jxb5fXWf"><img src="https://img.shields.io/badge/Contribute-Discord-7289DA?style=flat-square&logo=discord" alt="Contribute on Discord"></a>
</summary>
<p>
<a href="https://blog.fig.inc/gui-perturbed-a-domain-randomization-dataset-for-gui-grounding"><img src="https://img.shields.io/badge/Dataset_&_Methodology-555?style=flat-square" alt="Dataset & Methodology"></a>
<a href="https://blog.fig.inc/measuring-brittleness-in-gui-grounding-models-using-gui-perturbed"><img src="https://img.shields.io/badge/Model_Robustness_Evaluation-555?style=flat-square" alt="Model Robustness Evaluation"></a>
<a href="https://blog.fig.inc/training-on-gui-perturbed-why-more-data-isnt-enough"><img src="https://img.shields.io/badge/Finetuning_Experiments-555?style=flat-square" alt="Finetuning Experiments"></a>
</p>
</details>

<details align="center">
<summary>
<img src="https://img.shields.io/badge/Result_Viewers-HuggingFace-yellow?style=flat-square&logo=huggingface" alt="Result Viewers" height="20">
</summary>
<p>
<a href="https://huggingface.co/spaces/figai/GUI-Perturbed-Baseline-Result-Viewer"><img src="https://img.shields.io/badge/Baseline_Result_Viewer-555?style=flat-square" alt="Baseline Result Viewer"></a>
<a href="https://huggingface.co/spaces/figai/GUI-Perturbed-Finetuned-Result-Viewer"><img src="https://img.shields.io/badge/Finetuned_Result_Viewer-555?style=flat-square" alt="Finetuned Result Viewer"></a>
</p>
</details>

### _GUI-DR is a part of a collaborative effort on Software Control Agents between Manifold Research and Fig_

Expand Down Expand Up @@ -246,7 +261,60 @@ Use **this repo** to reproduce or extend the data; use the **Hugging Face datase

## Evaluation

Download the [GUI-Perturbed](https://huggingface.co/datasets/figai/GUI-Perturbed) dataset to evaluate your models. An evaluation script will be released soon.
The evaluation script loads data directly from [GUI-Perturbed](https://huggingface.co/datasets/figai/GUI-Perturbed) on HuggingFace and runs inference against a model served via an OpenAI-compatible API (e.g., [vLLM](https://docs.vllm.ai/)).

### Prerequisites

**Serve your model** with vLLM (or any OpenAI-compatible endpoint):

```bash
# Example: serve your local_model with vLLM
vllm serve "/mnt/disks/eval-data/exp_2_checkpoint_1_epoch/" \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.9
```

### Running evaluation

```bash
uv run scripts/gui_perturbed_evaluator.py \
--output_dir data/predictions \
--config_id uitars15_no_reasoning_direct_query \
--dataset_variant original \
--model_name /mnt/disks/eval-data/exp_2_checkpoint_1_epoch/
```

### Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--output_dir` | _(required)_ | Directory for prediction output files. |
| `--config_id` | _(required)_ | Preset configuration ID. Use `--list_presets` to see all options. |
| `--dataset_variant` | `None` (all) | Filter by variant: `original`, `style`, `precision`, `text_zoom`. |
| `--model_name` | _(from preset)_ | Override the HuggingFace model identifier sent to the API. |
| `--api_url` | `http://localhost:8000/v1` | API endpoint (or set `VLLM_API_URL` env var). |
| `--api_key` | `EMPTY` | API key (or set `VLLM_API_KEY` env var). |
| `--temperature` | `0.0` | Sampling temperature. |
| `--max_tokens` | _(model-specific)_ | Max tokens for generation. |
| `--seed` | `None` | Random seed for reproducibility. |
| `--save_interval` | `10` | Save predictions to disk every N steps. |

### Available presets

Presets are generated for all combinations of **model** × **reasoning** × **instruction type**:

- **Models:** `gta1` (GTA1-7B), `qwen25vl` (Qwen2.5-VL-7B), `uitars15` (UI-TARS-1.5-7B)
- **Reasoning:** `no_reasoning`, `reasoning`
- **Instruction type:** `direct_query`, `relational_query`

Example preset IDs: `gta1_no_reasoning_direct_query`, `qwen25vl_reasoning_relational_query`, `uitars15_no_reasoning_direct_query`.

List all presets:

```bash
uv run scripts/gui_perturbed_evaluator.py --list_presets
```

---

Expand All @@ -272,6 +340,12 @@ We welcome contributions: new perturbation types, bug reports, and improvements.

---

## Acknowledgments

Our finetuning experiments were built on [Qwen-VL-Series-Finetune](https://github.com/2U1/Qwen-VL-Series-Finetune). We thank the authors for their open-source contributions.

---

## 📄 Citation

If you find GUI-Perturbed or this pipeline useful, please consider citing the dataset and technical report series.
Expand Down Expand Up @@ -300,4 +374,20 @@ If you find GUI-Perturbed or this pipeline useful, please consider citing the da
url = {https://blog.fig.inc/gui-perturbed-a-domain-randomization-dataset-for-gui-grounding},
note = {Part 1: Dataset \& methodology}
}

@online{measuring_gui_models_robustness_technical_report_2026,
title = {Measuring Brittleness in GUI Grounding Models using GUI-Perturbed},
author = {Wang, Yangyue and Mathur, Yash, and Zhou, Tony and Nyachhyon, Jinu and Guruprasad, Pranav and Sikka, Harsh},
year = {2026},
url = {https://blog.fig.inc/measuring-brittleness-in-gui-grounding-models-using-gui-perturbed},
note = {Part 2: Baseline evaluation}
}

@online{training_on_gui_perturbed_technical_report_2026,
title = {Training on GUI-Perturbed: Why More Data Isn’t Enough},
author = {Wang, Yangyue and Mathur, Yash, and Zhou, Tony and Nyachhyon, Jinu and Guruprasad, Pranav and Sikka, Harsh},
year = {2026},
url = {https://blog.fig.inc/training-on-gui-perturbed-why-more-data-isnt-enough},
note = {Part 3: Finetuning Experiments}
}
```
Binary file removed media/gui-dr.png
Binary file not shown.
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,14 @@ dependencies = [
"fastparquet>=2024.11.0",
"loguru>=0.7.2",
"matplotlib>=3.10.7",
"anthropic>=0.52.0",
"openai>=2.26.0",
"pandas>=2.3.3",
"pd>=0.0.4",
"pillow>=12.0.0",
"playwright>=1.55.0",
"streamlit>=1.54.0",
"python-dotenv>=1.2.2",
]

[project.urls]
Expand Down
Loading