GUI - Actor - 7B - Qwen2 - VL开源视觉语言模型，无坐标解决GUI代理视觉接地难题

首页

GUI Actor 7B Qwen2 VL

由 microsoft 开发

GUI-Actor-7B是基于Qwen2-VL-7B-Instruct开发的视觉语言模型，专注于图形用户界面(GUI)代理任务，提供无坐标的视觉接地解决方案。

多模态融合

Transformers

开源协议:MIT #GUI视觉定位 #无坐标交互 #多模态代理

下载量 207

发布时间 : 6/1/2025

模型简介

该模型通过添加基于注意力的动作头并进行微调，能够在GUI接地任务中表现出色，适用于自动化GUI操作场景。

模型特点

无坐标视觉接地

采用创新的无坐标解决方案，直接预测GUI操作位置，简化交互流程

基于注意力机制的动作头

通过专门设计的注意力动作头增强模型对GUI元素的定位能力

多尺寸模型选择

提供从2B到7B不同参数规模的模型版本，适应不同计算资源需求

验证器增强

可选配专用验证器模型，进一步提升操作准确性

模型能力

GUI元素识别

屏幕操作定位

多模态理解（图像+文本）

自动化任务执行

使用案例

软件自动化测试

自动化UI测试

自动识别和操作软件界面元素进行功能测试

在ScreenSpot-Pro基准测试上达到40.7%准确率

RPA流程自动化

业务流程自动化

通过视觉理解自动完成重复性GUI操作任务

在ScreenSpot-v2基准测试上达到89.5%准确率

🚀 GUI-Actor-7B 以Qwen2-VL-7B为骨干视觉语言模型

GUI-Actor-7B是一个用于图形用户界面（GUI）代理的模型，它基于Qwen2-VL-7B-Instruct开发，通过添加基于注意力的动作头并进行微调，能够在GUI接地任务中表现出色。该模型在相关论文中被提出，为GUI代理的视觉接地任务提供了无坐标的解决方案。

模型信息

属性	详情
基础模型	Qwen/Qwen2-VL-7B-Instruct
许可证	MIT
库名称	transformers
任务类型	图像文本到文本

模型介绍

此模型在论文 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents 中被引入。它基于 Qwen2-VL-7B-Instruct 开发，通过添加基于注意力的动作头并使用此处数据集（即将推出）进行微调，以执行GUI接地任务。

如需了解更多关于模型设计和评估的详细信息，请查看：üè† 项目页面 | üíª Github仓库 | üìë 论文。

模型链接

模型名称	Hugging Face链接
GUI-Actor-7B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-2B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-7B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-3B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-Verifier-2B	ü§ó Hugging Face

性能比较

以Qwen2-VL为骨干的GUI接地基准测试结果

表格1展示了在ScreenSpot-Pro、ScreenSpot和ScreenSpot-v2上以 Qwen2-VL 为骨干的主要结果。‚Ä† 表示我们对Huggingface上官方模型进行评估得到的分数。

方法	骨干视觉语言模型	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72B模型:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7B模型:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0‚Ä†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6‚Ä†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + 验证器	Qwen2-VL	44.2	89.7	90.9
*2B模型:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + 验证器	Qwen2-VL	41.8	86.9	89.3

以Qwen2.5-VL为骨干的GUI接地基准测试结果

表格2展示了在ScreenSpot-Pro和ScreenSpot-v2上以 Qwen2.5-VL 为骨干的主要结果。

方法	骨干视觉语言模型	ScreenSpot-Pro	ScreenSpot-v2
*7B模型:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + 验证器	Qwen2.5-VL	47.7	92.5
*3B模型:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + 验证器	Qwen2.5-VL	45.9	92.4

💻 使用示例

基础用法

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# 加载模型
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# 准备示例
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"指令: {example['instruction']}")
print(f"真实动作区域 (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "你是一个GUI代理。你被赋予一个任务和屏幕截图。你需要执行一系列pyautogui动作来完成任务。",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image 或路径字符串
                # "image_url": "https://xxxxx.png" 或 "https://xxxxx.jpg" 或 "file://xxxxx.png" 或 "data:image/png;base64,xxxxxxxx"，将按 "base64," 分割
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# 推理
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"预测点击点: [{round(px, 4)}, {round(py, 4)}]")

# >> 模型响应
# 指令: 关闭此窗口
# 真实动作区域 (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# 预测点击点: [0.9709, 0.1548]

引用

如果您使用了该模型，请引用以下论文：

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}