We evaluate the Gelato-30B-A3B model on the ScreenSpot-Pro and OS-World-G benchmarks. We store our benchmark data at data. To download the corresponding images, you should do so from the official ScreenSpot-Pro and OS-World-G dataset pages.
Note on OS-World-G evaluation: The benchmark includes 54 refusal samples that can be included via --include-refusal. Gelato-30B-A3B supports refusal behavior through the gelato-refusal prompt mode. Official accuracy is calculated over all 564 samples (510 standard + 54 refusal).
The following Python packages are required to run the evaluation code:
transformers==4.57.0torch==2.8.0pillow==11.3.0qwen-vl-utils==0.0.14
Weights: mlfoundations/Gelato-30B-A3B
Eval Logs: Screenspot-Pro, OS-World-G and OS-World-G (Refined)
We run our evals on 4 40GB A100 GPUs with a maximum of 10 megapixel frames. We use huggingface's model parallelism to run the evaluation on 4 GPUs with a batch size of 1.
Run the following command to evaluate the Gelato-30B-A3B model on the ScreenSpot-Pro benchmark.
python evaluation/run_grounding_eval.py \
--json-file eval-grounding-data/screenspot-pro-eval.json \
--images-dir path/to/screenspot-pro-images \
--model path/to/Gelato-30B-A3B \
--model-type qwen3 \
--distributed-mode sharded \
--batch-size 1 \
--max-pixels 10000000 \
--prompt-mode gelato \
--output-dir repro-results
Run the following command to evaluate the Gelato-30B-A3B model on the OS-World-G benchmark.
# Without refusal
python evaluation/run_grounding_eval.py \
--json-file eval-grounding-data/osworld-g-eval.json \
--images-dir path/to/osworld-g-images \
--model path/to/Gelato-30B-A3B \
--model-type qwen3 \
--distributed-mode sharded \
--batch-size 1 \
--max-pixels 10000000 \
--prompt-mode gelato \
--output-dir repro-results
# With refusal
python evaluation/run_grounding_eval.py \
--json-file eval-grounding-data/osworld-g-eval.json \
--images-dir path/to/osworld-g-images \
--model path/to/Gelato-30B-A3B \
--model-type qwen3 \
--distributed-mode sharded \
--batch-size 1 \
--max-pixels 10000000 \
--prompt-mode gelato-refusal \
--include-refusal \
--output-dir repro-results
Weights: mlfoundations/Gelato-UI-TARS-1.5-7B
Eval Logs: Screenspot-Pro, OS-World-G and OS-World-G (Refined)
We run our evals on 4 40GB A100 GPUs with a maximum of 4 megapixel frames. We run the evaluation with a global batch size of 4.
Run the following command to evaluate the Gelato-UI-TARS-1.5-7B model on the ScreenSpot-Pro benchmark.
python evaluation/run_grounding_eval.py \
--json-file eval-grounding-data/screenspot-pro-eval.json \
--images-dir path/to/screenspot-pro-images \
--model path/to/Gelato-UI-TARS-1.5-7B \
--model-type qwen2.5 \
--distributed-mode dp \
--num-gpus 4 \
--batch-size 1 \
--max-pixels 4000000 \
--pixel-space-output \
--prompt-mode gelato \
--output-dir repro-results
Run the following command to evaluate the Gelato-UI-TARS-1.5-7B model on the OS-World-G benchmark.
python evaluation/run_grounding_eval.py \
--json-file eval-grounding-data/osworld-g-eval.json \
--images-dir path/to/osworld-g-images \
--model path/to/Gelato-UI-TARS-1.5-7B \
--model-type qwen2.5 \
--distributed-mode dp \
--num-gpus 4 \
--batch-size 1 \
--max-pixels 4000000 \
--pixel-space-output \
--prompt-mode gelato \
--output-dir repro-results
We run our evaluation code on GTA1-7B-2507 as a baseline model to verify the correctness of our evaluation pipeline.
Run the following command to evaluate the GTA1-7B-2507 model on the ScreenSpot-Pro benchmark.
python evaluation/run_grounding_eval.py \
--json-file eval-grounding-data/screenspot-pro-eval.json \
--images-dir path/to/screenspot-pro-images \
--model path/to/GTA1-7B-2507 \
--model-type qwen2.5 \
--distributed-mode dp \
--num-gpus 4 \
--batch-size 1 \
--max-pixels 4000000 \
--pixel-space-output \
--prompt-mode gta1 \
--output-dir repro-results
Run the following command to evaluate the GTA1-7B-2507 model on the OS-World-G benchmark.
python evaluation/run_grounding_eval.py \
--json-file eval-grounding-data/osworld-g-eval.json \
--images-dir path/to/osworld-g-images \
--model path/to/GTA1-7B-2507 \
--model-type qwen2.5 \
--distributed-mode dp \
--num-gpus 4 \
--batch-size 1 \
--max-pixels 4000000 \
--pixel-space-output \
--prompt-mode gta1 \
--output-dir repro-results
| Model | Accuracy | Logs |
|---|---|---|
| GTA1-7B-2507 | 50.1% | Official (reported) |
| GTA1-7B-2507 (reproduced) | 49.72% | View logs |
| Model | Accuracy | Logs |
|---|---|---|
| GTA1-7B-2507 | 55.1% | Official (reported) |
| GTA1-7B-2507 (reproduced) | 56.02% | View logs |
| Model | Accuracy | Logs |
|---|---|---|
| GTA1-7B-2507 | 67.7% | Official (reported) |
| GTA1-7B-2507 (reproduced) | 66.31% | View logs |