GUI - Actor - 7B - Qwen2 - VLオープンソースビジュアル言語モデル。座標なしでGUIエージェントのビジュアル接地の難題を解決

ホーム

GUI Actor 7B Qwen2 VL

microsoftによって開発

GUI-Actor-7BはQwen2-VL-7B-Instructをベースに開発されたビジュアル言語モデルで、グラフィカルユーザーインターフェイス（GUI）エージェントタスクに特化し、座標なしのビジュアル接地ソリューションを提供します。

マルチモーダル融合

Transformers

オープンソースライセンス:MIT #GUIビジュアル定位 #座標なしインタラクション #マルチモーダルエージェント

ダウンロード数 207

リリース時間 : 6/1/2025

モデル概要

このモデルは、注意力ベースの動作ヘッドを追加して微調整することで、GUI接地タスクで優れた性能を発揮し、自動化GUI操作シーンに適しています。

モデル特徴

座標なしビジュアル接地

革新的な座標なしソリューションを採用し、直接GUI操作位置を予測し、インタラクションプロセスを簡素化します。

注意力機構ベースの動作ヘッド

特別に設計された注意力動作ヘッドにより、モデルのGUI要素の定位能力を強化します。

多サイズモデル選択

20億から70億までの異なるパラメータ規模のモデルバージョンを提供し、異なる計算リソースのニーズに対応します。

バリデータ強化

専用のバリデータモデルをオプションで搭載でき、操作の正確性をさらに向上させます。

モデル能力

GUI要素識別

画面操作定位

マルチモーダル理解（画像＋テキスト）

自動化タスク実行

使用事例

ソフトウェア自動化テスト

自動化UIテスト

ソフトウェアインターフェイス要素を自動的に識別して操作し、機能テストを行います。

ScreenSpot-Proベンチマークテストで40.7％の正確率を達成しました。

RPAプロセス自動化

ビジネスプロセス自動化

ビジュアル理解を通じて自動的に繰り返しのGUI操作タスクを完了します。

ScreenSpot-v2ベンチマークテストで89.5％の正確率を達成しました。

🚀 GUI-Actor-7B with Qwen2-VL-7B as backbone VLM

このモデルは論文 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents で紹介されました。 Qwen2-VL-7B-Instruct をベースに開発され、注意力ベースのアクションヘッドを追加し、こちらのデータセット (近日公開) を使用してGUIグラウンディングを行うように微調整されています。

モデルの設計と評価の詳細については、以下をご確認ください。üè† プロジェクトページ | üíª Githubリポジトリ | üìë 論文

Property	Details
Model Name	GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
base_model	Qwen/Qwen2-VL-7B-Instruct
license	mit
library_name	transformers
pipeline_tag	image-text-to-text

モデル名	Hugging Faceリンク
GUI-Actor-7B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-2B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-7B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-3B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-Verifier-2B	ü§ó Hugging Face

🔍 GUIグラウンディングベンチマークにおける性能比較

表1. Qwen2-VL をバックボーンとする ScreenSpot-Pro、ScreenSpot、および ScreenSpot-v2 の主要な結果。 ‚Ä† は、Huggingface の公式モデルを独自に評価して得られたスコアを示します。

手法	バックボーンVLM	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72Bモデル:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7Bモデル:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0‚Ä†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6‚Ä†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + Verifier	Qwen2-VL	44.2	89.7	90.9
*2Bモデル:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + Verifier	Qwen2-VL	41.8	86.9	89.3

表2. Qwen2.5-VL をバックボーンとする ScreenSpot-Pro および ScreenSpot-v2 の主要な結果。

手法	バックボーンVLM	ScreenSpot-Pro	ScreenSpot-v2
*7Bモデル:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + Verifier	Qwen2.5-VL	47.7	92.5
*3Bモデル:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + Verifier	Qwen2.5-VL	45.9	92.4

💻 使用例

基本的な使用法

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# load model
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# prepare example
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image or str to path
                # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# inference
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")

# >> Model Response
# Intruction: close this window
# ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# Predicted click point: [0.9709, 0.1548]

📚 引用

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}