đ GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
This model addresses the challenge of GUI grounding for agents by leveraging the power of Qwen2-VL-7B. It offers an efficient and accurate solution for coordinate - free visual grounding in GUI scenarios, enhancing the interaction capabilities of GUI agents.
đ Quick Start
The following steps and code example will help you quickly get started with the GUI - Actor-7B model.
đģ Usage Examples
Basic Usage
import torch
from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2"
).eval()
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": example["image"],
},
{
"type": "text",
"text": example["instruction"]
},
],
},
]
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")
⨠Features
- Backbone Model: This model is developed based on Qwen2-VL-7B-Instruct.
- Enhanced Architecture: Augmented by an attention - based action head and finetuned for GUI grounding.
- Multiple Model Variants: There are different model sizes available, including 7B, 2B, 3B, etc., to meet various needs.
đ Documentation
This model was introduced in the paper GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents. For more details on model design and evaluation, please check:
Model Links
Performance Comparison on GUI Grounding Benchmarks
Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. â indicates scores obtained from our own evaluation of the official models on Huggingface.
Method |
Backbone VLM |
ScreenSpot-Pro |
ScreenSpot |
ScreenSpot-v2 |
72B models: |
|
|
|
|
AGUVIS-72B |
Qwen2-VL |
- |
89.2 |
- |
UGround-V1-72B |
Qwen2-VL |
34.5 |
89.4 |
- |
UI-TARS-72B |
Qwen2-VL |
38.1 |
88.4 |
90.3 |
7B models: |
|
|
|
|
OS-Atlas-7B |
Qwen2-VL |
18.9 |
82.5 |
84.1 |
AGUVIS-7B |
Qwen2-VL |
22.9 |
84.4 |
86.0â |
UGround-V1-7B |
Qwen2-VL |
31.1 |
86.3 |
87.6â |
UI-TARS-7B |
Qwen2-VL |
35.7 |
89.5 |
91.6 |
GUI-Actor-7B |
Qwen2-VL |
40.7 |
88.3 |
89.5 |
GUI-Actor-7B + Verifier |
Qwen2-VL |
44.2 |
89.7 |
90.9 |
2B models: |
|
|
|
|
UGround-V1-2B |
Qwen2-VL |
26.6 |
77.1 |
- |
UI-TARS-2B |
Qwen2-VL |
27.7 |
82.3 |
84.7 |
GUI-Actor-2B |
Qwen2-VL |
36.7 |
86.5 |
88.6 |
GUI-Actor-2B + Verifier |
Qwen2-VL |
41.8 |
86.9 |
89.3 |
Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.
Method |
Backbone VLM |
ScreenSpot-Pro |
ScreenSpot-v2 |
7B models: |
|
|
|
Qwen2.5-VL-7B |
Qwen2.5-VL |
27.6 |
88.8 |
Jedi-7B |
Qwen2.5-VL |
39.5 |
91.7 |
GUI-Actor-7B |
Qwen2.5-VL |
44.6 |
92.1 |
GUI-Actor-7B + Verifier |
Qwen2.5-VL |
47.7 |
92.5 |
3B models: |
|
|
|
Qwen2.5-VL-3B |
Qwen2.5-VL |
25.9 |
80.9 |
Jedi-3B |
Qwen2.5-VL |
36.1 |
88.6 |
GUI-Actor-3B |
Qwen2.5-VL |
42.2 |
91.0 |
GUI-Actor-3B + Verifier |
Qwen2.5-VL |
45.9 |
92.4 |
đ License
The model is released under the MIT license.
đĻ Model Information
Property |
Details |
Model Type |
GUI-Actor-7B with Qwen2-VL-7B as backbone VLM |
Training Data |
Not provided in the original document |
đ Citation
@article{wu2025guiactor,
title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
year={2025},
eprint={2506.03143},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://www.arxiv.org/pdf/2506.03143},
}