GUI - Actor - 7B - Qwen2 - VL開源視覺語言模型，無座標解決GUI代理視覺接地難題

首頁

GUI Actor 7B Qwen2 VL

由microsoft開發

GUI-Actor-7B是基於Qwen2-VL-7B-Instruct開發的視覺語言模型，專注於圖形用戶界面(GUI)代理任務，提供無座標的視覺接地解決方案。

多模態融合

Transformers

開源協議:MIT #GUI視覺定位 #無座標交互 #多模態代理

下載量 207

發布時間 : 6/1/2025

模型概述

該模型通過添加基於注意力的動作頭並進行微調，能夠在GUI接地任務中表現出色，適用於自動化GUI操作場景。

模型特點

無座標視覺接地

採用創新的無座標解決方案，直接預測GUI操作位置，簡化交互流程

基於注意力機制的動作頭

通過專門設計的注意力動作頭增強模型對GUI元素的定位能力

多尺寸模型選擇

提供從2B到7B不同參數規模的模型版本，適應不同計算資源需求

驗證器增強

可選配專用驗證器模型，進一步提升操作準確性

模型能力

GUI元素識別

屏幕操作定位

多模態理解（圖像+文本）

自動化任務執行

使用案例

軟件自動化測試

自動化UI測試

自動識別和操作軟件界面元素進行功能測試

在ScreenSpot-Pro基準測試上達到40.7%準確率

RPA流程自動化

業務流程自動化

通過視覺理解自動完成重複性GUI操作任務

在ScreenSpot-v2基準測試上達到89.5%準確率

🚀 GUI-Actor-7B 以Qwen2-VL-7B為骨幹視覺語言模型

GUI-Actor-7B是一個用於圖形用戶界面（GUI）代理的模型，它基於Qwen2-VL-7B-Instruct開發，通過添加基於注意力的動作頭並進行微調，能夠在GUI接地任務中表現出色。該模型在相關論文中被提出，為GUI代理的視覺接地任務提供了無座標的解決方案。

模型信息

屬性	詳情
基礎模型	Qwen/Qwen2-VL-7B-Instruct
許可證	MIT
庫名稱	transformers
任務類型	圖像文本到文本

模型介紹

此模型在論文 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents 中被引入。它基於 Qwen2-VL-7B-Instruct 開發，通過添加基於注意力的動作頭並使用此處數據集（即將推出）進行微調，以執行GUI接地任務。

如需瞭解更多關於模型設計和評估的詳細信息，請查看：üè† 項目頁面 | üíª Github倉庫 | üìë 論文。

模型鏈接

模型名稱	Hugging Face鏈接
GUI-Actor-7B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-2B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-7B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-3B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-Verifier-2B	ü§ó Hugging Face

性能比較

以Qwen2-VL為骨幹的GUI接地基準測試結果

表格1展示了在ScreenSpot-Pro、ScreenSpot和ScreenSpot-v2上以 Qwen2-VL 為骨幹的主要結果。‚Ä† 表示我們對Huggingface上官方模型進行評估得到的分數。

方法	骨幹視覺語言模型	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72B模型:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7B模型:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0‚Ä†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6‚Ä†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + 驗證器	Qwen2-VL	44.2	89.7	90.9
*2B模型:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + 驗證器	Qwen2-VL	41.8	86.9	89.3

以Qwen2.5-VL為骨幹的GUI接地基準測試結果

表格2展示了在ScreenSpot-Pro和ScreenSpot-v2上以 Qwen2.5-VL 為骨幹的主要結果。

方法	骨幹視覺語言模型	ScreenSpot-Pro	ScreenSpot-v2
*7B模型:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + 驗證器	Qwen2.5-VL	47.7	92.5
*3B模型:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + 驗證器	Qwen2.5-VL	45.9	92.4

💻 使用示例

基礎用法

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# 加載模型
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# 準備示例
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"指令: {example['instruction']}")
print(f"真實動作區域 (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "你是一個GUI代理。你被賦予一個任務和屏幕截圖。你需要執行一系列pyautogui動作來完成任務。",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image 或路徑字符串
                # "image_url": "https://xxxxx.png" 或 "https://xxxxx.jpg" 或 "file://xxxxx.png" 或 "data:image/png;base64,xxxxxxxx"，將按 "base64," 分割
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# 推理
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"預測點擊點: [{round(px, 4)}, {round(py, 4)}]")

# >> 模型響應
# 指令: 關閉此窗口
# 真實動作區域 (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# 預測點擊點: [0.9709, 0.1548]

引用

如果您使用了該模型，請引用以下論文：

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}