オープンソースのGUI - Actor - 2B - Qwen2 - VLモデルは、グラフィカルユーザーインターフェイスの位置特定タスクを高精度で完了します。

GUI Actor 2B Qwen2 VL

microsoftによって開発

GUI-Actor-2BはQwen2-VL-2Bに基づく視覚言語モデルで、グラフィカルユーザーインターフェイス（GUI）の位置特定タスク用に設計されています。アテンションベースのアクションヘッドを追加して微調整することで、複数のGUI位置特定ベンチマークテストで良好な結果を得ています。

テキスト生成画像

Transformers

オープンソースライセンス:MIT #GUI位置特定 #視覚言語モデル #アテンションアクションヘッド

ダウンロード数 163

リリース時間 : 6/1/2025

モデル概要

このモデルは主にグラフィカルユーザーインターフェイスの位置特定タスクを実行するために使用され、画面スクリーンショットと命令に基づいて操作位置を予測することができます。

モデル特徴

Qwen2-VLバックボーンモデルに基づく

強力なQwen2-VL-2B視覚言語モデルをベースにしており、優れた視覚理解能力を備えています。

専用アクションヘッドの設計

アテンションベースのアクションヘッドを追加し、GUI位置特定タスクを専用に最適化しています。

複数のベンチマークテストで優れた成績

ScreenSpot-Pro、ScreenSpot、ScreenSpot-v2などの複数のGUI位置特定ベンチマークでトップの成績を収めています。

モデル能力

GUI要素位置特定

視覚言語理解

画面命令理解

操作点予測

使用事例

自動化テスト

GUI要素位置特定

命令に基づいて画面上の特定の要素を自動的に位置特定します。

ScreenSpot-Proで36.7%の正解率を達成しました。

支援ツール

障害者用操作支援

視覚障害者が音声命令でグラフィカルインターフェイスを操作するのを支援します。

🚀 GUI-Actor-2B（Qwen2-VL-2BをバックボーンVLMとする）

このモデルは論文 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents で紹介されました。Qwen2-VL-2B-Instruct をベースに開発され、注意力ベースのアクションヘッドを追加し、こちらのデータセット（近日公開）を使用してGUIグラウンディングを行うように微調整されています。

モデルの設計と評価の詳細については、üè† プロジェクトページ | üíª Githubリポジトリ | üìë 論文をご確認ください。

プロパティ	詳細
ベースモデル	Qwen/Qwen2-VL-2B-Instruct
ライセンス	MIT
ライブラリ名	transformers
パイプラインタグ	image-text-to-text

モデル名	Hugging Faceリンク
GUI-Actor-7B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-2B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-7B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-3B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-Verifier-2B	ü§ó Hugging Face

✨ 主な機能

このモデルは、GUIエージェントのための座標フリーな視覚グラウンディングを行うことができます。Qwen2-VLをバックボーンとして使用し、注意力ベースのアクションヘッドを追加することで、GUIグラウンディングタスクに特化した性能を発揮します。

📚 ドキュメント

üìä GUIグラウンディングベンチマークでの性能比較

表1. Qwen2-VL をバックボーンとする ScreenSpot-Pro、ScreenSpot、および ScreenSpot-v2 での主要な結果。 ‚Ä† はHuggingface上の公式モデルを独自に評価して得られたスコアを示します。

手法	バックボーンVLM	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72Bモデル:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7Bモデル:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0‚Ä†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6‚Ä†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + Verifier	Qwen2-VL	44.2	89.7	90.9
*2Bモデル:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + Verifier	Qwen2-VL	41.8	86.9	89.3

表2. Qwen2.5-VL をバックボーンとする ScreenSpot-Pro および ScreenSpot-v2 での主要な結果。

手法	バックボーンVLM	ScreenSpot-Pro	ScreenSpot-v2
*7Bモデル:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + Verifier	Qwen2.5-VL	47.7	92.5
*3Bモデル:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + Verifier	Qwen2.5-VL	45.9	92.4

💻 使用例

基本的な使用法

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# モデルの読み込み
model_name_or_path = "microsoft/GUI-Actor-2B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# サンプルの準備
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"指示: {example['instruction']}")
print(f"正解のアクション領域 (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "あなたはGUIエージェントです。タスクと画面のスクリーンショットが与えられます。タスクを完了するために一連のpyautoguiアクションを実行する必要があります。",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Imageまたはパスの文字列
                # "image_url": "https://xxxxx.png" または "https://xxxxx.jpg" または "file://xxxxx.png" または "data:image/png;base64,xxxxxxxx"、"base64," で分割されます
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# 推論
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"予測されたクリックポイント: [{round(px, 4)}, {round(py, 4)}]")

# >> モデルの応答
# 指示: このウィンドウを閉じる
# 正解のアクション領域 (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# 予測されたクリックポイント: [0.9709, 0.1548]

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

üìù 引用

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}