🚀 GUI-Actor-7B 以Qwen2-VL-7B为骨干视觉语言模型
GUI-Actor-7B是一个用于图形用户界面(GUI)代理的模型,它基于Qwen2-VL-7B-Instruct开发,通过添加基于注意力的动作头并进行微调,能够在GUI接地任务中表现出色。该模型在相关论文中被提出,为GUI代理的视觉接地任务提供了无坐标的解决方案。
模型信息
属性 |
详情 |
基础模型 |
Qwen/Qwen2-VL-7B-Instruct |
许可证 |
MIT |
库名称 |
transformers |
任务类型 |
图像文本到文本 |
模型介绍
此模型在论文 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents 中被引入。它基于 Qwen2-VL-7B-Instruct 开发,通过添加基于注意力的动作头并使用 此处数据集(即将推出) 进行微调,以执行GUI接地任务。
如需了解更多关于模型设计和评估的详细信息,请查看:üè† 项目页面 | üíª Github仓库 | üìë 论文。
模型链接
性能比较
以Qwen2-VL为骨干的GUI接地基准测试结果
表格1展示了在ScreenSpot-Pro、ScreenSpot和ScreenSpot-v2上以 Qwen2-VL 为骨干的主要结果。‚Ć 表示我们对Huggingface上官方模型进行评估得到的分数。
方法 |
骨干视觉语言模型 |
ScreenSpot-Pro |
ScreenSpot |
ScreenSpot-v2 |
72B模型: |
|
|
|
|
AGUVIS-72B |
Qwen2-VL |
- |
89.2 |
- |
UGround-V1-72B |
Qwen2-VL |
34.5 |
89.4 |
- |
UI-TARS-72B |
Qwen2-VL |
38.1 |
88.4 |
90.3 |
7B模型: |
|
|
|
|
OS-Atlas-7B |
Qwen2-VL |
18.9 |
82.5 |
84.1 |
AGUVIS-7B |
Qwen2-VL |
22.9 |
84.4 |
86.0† |
UGround-V1-7B |
Qwen2-VL |
31.1 |
86.3 |
87.6† |
UI-TARS-7B |
Qwen2-VL |
35.7 |
89.5 |
91.6 |
GUI-Actor-7B |
Qwen2-VL |
40.7 |
88.3 |
89.5 |
GUI-Actor-7B + 验证器 |
Qwen2-VL |
44.2 |
89.7 |
90.9 |
2B模型: |
|
|
|
|
UGround-V1-2B |
Qwen2-VL |
26.6 |
77.1 |
- |
UI-TARS-2B |
Qwen2-VL |
27.7 |
82.3 |
84.7 |
GUI-Actor-2B |
Qwen2-VL |
36.7 |
86.5 |
88.6 |
GUI-Actor-2B + 验证器 |
Qwen2-VL |
41.8 |
86.9 |
89.3 |
以Qwen2.5-VL为骨干的GUI接地基准测试结果
表格2展示了在ScreenSpot-Pro和ScreenSpot-v2上以 Qwen2.5-VL 为骨干的主要结果。
方法 |
骨干视觉语言模型 |
ScreenSpot-Pro |
ScreenSpot-v2 |
7B模型: |
|
|
|
Qwen2.5-VL-7B |
Qwen2.5-VL |
27.6 |
88.8 |
Jedi-7B |
Qwen2.5-VL |
39.5 |
91.7 |
GUI-Actor-7B |
Qwen2.5-VL |
44.6 |
92.1 |
GUI-Actor-7B + 验证器 |
Qwen2.5-VL |
47.7 |
92.5 |
3B模型: |
|
|
|
Qwen2.5-VL-3B |
Qwen2.5-VL |
25.9 |
80.9 |
Jedi-3B |
Qwen2.5-VL |
36.1 |
88.6 |
GUI-Actor-3B |
Qwen2.5-VL |
42.2 |
91.0 |
GUI-Actor-3B + 验证器 |
Qwen2.5-VL |
45.9 |
92.4 |
💻 使用示例
基础用法
import torch
from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2"
).eval()
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"指令: {example['instruction']}")
print(f"真实动作区域 (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "你是一个GUI代理。你被赋予一个任务和屏幕截图。你需要执行一系列pyautogui动作来完成任务。",
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": example["image"],
},
{
"type": "text",
"text": example["instruction"]
},
],
},
]
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"预测点击点: [{round(px, 4)}, {round(py, 4)}]")
引用
如果您使用了该模型,请引用以下论文:
@article{wu2025guiactor,
title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
year={2025},
eprint={2506.03143},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://www.arxiv.org/pdf/2506.03143},
}