🚀 GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
このモデルは論文 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents で紹介されました。
Qwen2-VL-7B-Instruct をベースに開発され、注意力ベースのアクションヘッドを追加し、こちらのデータセット (近日公開) を使用してGUIグラウンディングを行うように微調整されています。
モデルの設計と評価の詳細については、以下をご確認ください。üè† プロジェクトページ | üíª Githubリポジトリ | üìë 論文
Property |
Details |
Model Name |
GUI-Actor-7B with Qwen2-VL-7B as backbone VLM |
base_model |
Qwen/Qwen2-VL-7B-Instruct |
license |
mit |
library_name |
transformers |
pipeline_tag |
image-text-to-text |
🔍 GUIグラウンディングベンチマークにおける性能比較
表1. Qwen2-VL をバックボーンとする ScreenSpot-Pro、ScreenSpot、および ScreenSpot-v2 の主要な結果。 ‚Ć は、Huggingface の公式モデルを独自に評価して得られたスコアを示します。
手法 |
バックボーンVLM |
ScreenSpot-Pro |
ScreenSpot |
ScreenSpot-v2 |
72Bモデル: |
|
|
|
|
AGUVIS-72B |
Qwen2-VL |
- |
89.2 |
- |
UGround-V1-72B |
Qwen2-VL |
34.5 |
89.4 |
- |
UI-TARS-72B |
Qwen2-VL |
38.1 |
88.4 |
90.3 |
7Bモデル: |
|
|
|
|
OS-Atlas-7B |
Qwen2-VL |
18.9 |
82.5 |
84.1 |
AGUVIS-7B |
Qwen2-VL |
22.9 |
84.4 |
86.0† |
UGround-V1-7B |
Qwen2-VL |
31.1 |
86.3 |
87.6† |
UI-TARS-7B |
Qwen2-VL |
35.7 |
89.5 |
91.6 |
GUI-Actor-7B |
Qwen2-VL |
40.7 |
88.3 |
89.5 |
GUI-Actor-7B + Verifier |
Qwen2-VL |
44.2 |
89.7 |
90.9 |
2Bモデル: |
|
|
|
|
UGround-V1-2B |
Qwen2-VL |
26.6 |
77.1 |
- |
UI-TARS-2B |
Qwen2-VL |
27.7 |
82.3 |
84.7 |
GUI-Actor-2B |
Qwen2-VL |
36.7 |
86.5 |
88.6 |
GUI-Actor-2B + Verifier |
Qwen2-VL |
41.8 |
86.9 |
89.3 |
表2. Qwen2.5-VL をバックボーンとする ScreenSpot-Pro および ScreenSpot-v2 の主要な結果。
手法 |
バックボーンVLM |
ScreenSpot-Pro |
ScreenSpot-v2 |
7Bモデル: |
|
|
|
Qwen2.5-VL-7B |
Qwen2.5-VL |
27.6 |
88.8 |
Jedi-7B |
Qwen2.5-VL |
39.5 |
91.7 |
GUI-Actor-7B |
Qwen2.5-VL |
44.6 |
92.1 |
GUI-Actor-7B + Verifier |
Qwen2.5-VL |
47.7 |
92.5 |
3Bモデル: |
|
|
|
Qwen2.5-VL-3B |
Qwen2.5-VL |
25.9 |
80.9 |
Jedi-3B |
Qwen2.5-VL |
36.1 |
88.6 |
GUI-Actor-3B |
Qwen2.5-VL |
42.2 |
91.0 |
GUI-Actor-3B + Verifier |
Qwen2.5-VL |
45.9 |
92.4 |
💻 使用例
基本的な使用法
import torch
from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
model_name_or_path,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2"
).eval()
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
}
]
},
{
"role": "user",
"content": [
{
"type": "image",
"image": example["image"],
},
{
"type": "text",
"text": example["instruction"]
},
],
},
]
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")
📚 引用
@article{wu2025guiactor,
title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents},
author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
year={2025},
eprint={2506.03143},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://www.arxiv.org/pdf/2506.03143},
}