Open-source the GUI-Actor-2B-Qwen2-VL model, which can accurately complete graphical user interface positioning tasks.

GUI Actor 2B Qwen2 VL

Developed by microsoft

GUI-Actor-2B is a vision-language model based on Qwen2-VL-2B, specifically designed for graphical user interface (GUI) positioning tasks. By adding an attention-based action head and fine-tuning, it performs well in multiple GUI positioning benchmark tests.

Text-to-Image

Transformers

Open Source License:MIT #GUI positioning #Vision-Language Model #Attention-based action head

Downloads 163

Release Time : 6/1/2025

Model Overview

This model is mainly used to perform GUI positioning tasks and can predict the operation position based on screenshots and instructions.

Model Features

Based on the Qwen2-VL backbone model

Built on the powerful Qwen2-VL-2B vision-language model, with excellent visual understanding ability

Dedicated action head design

Add an attention-based action head to specifically optimize GUI positioning tasks

Excellent performance in multiple benchmark tests

Achieved leading results on multiple GUI positioning benchmarks such as ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2

Model Capabilities

GUI element positioning

Vision-Language understanding

Screen instruction understanding

Operation point prediction

Use Cases

Automated testing

GUI element positioning

Automatically locate specific elements on the screen according to instructions

Achieved an accuracy of 36.7% on ScreenSpot-Pro

Assistive tools

Accessibility operation assistance

Help visually impaired users operate the graphical interface through voice instructions

🚀 GUI-Actor-2B with Qwen2-VL-2B as backbone VLM

This model addresses the challenge of GUI grounding for GUI agents. It leverages the power of Qwen2-VL-2B and is enhanced with an attention - based action head, providing a more effective solution for GUI - related tasks.

🚀 Quick Start

This model was introduced in the paper GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents. It is developed based on Qwen2-VL-2B-Instruct , augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset here (coming soon).

For more details on model design and evaluation, please check: üè† Project Page | üíª Github Repo | üìë Paper.

✨ Features

Model Information

Property	Details
Base Model	Qwen/Qwen2-VL-2B-Instruct
License	MIT
Library Name	transformers
Pipeline Tag	image-text-to-text

Model Variants

Model Name	Hugging Face Link
GUI-Actor-7B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-2B-Qwen2-VL	ü§ó Hugging Face
GUI-Actor-7B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-3B-Qwen2.5-VL	ü§ó Hugging Face
GUI-Actor-Verifier-2B	ü§ó Hugging Face

Performance Comparison on GUI Grounding Benchmarks

Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. ‚Ä† indicates scores obtained from our own evaluation of the official models on Huggingface.

Method	Backbone VLM	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72B models:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7B models:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0‚Ä†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6‚Ä†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + Verifier	Qwen2-VL	44.2	89.7	90.9
*2B models:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + Verifier	Qwen2-VL	41.8	86.9	89.3

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.

Method	Backbone VLM	ScreenSpot-Pro	ScreenSpot-v2
*7B models:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + Verifier	Qwen2.5-VL	47.7	92.5
*3B models:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + Verifier	Qwen2.5-VL	45.9	92.4

💻 Usage Examples

Basic Usage

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# load model
model_name_or_path = "microsoft/GUI-Actor-2B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# prepare example
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image or str to path
                # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# inference
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")

# >> Model Response
# Intruction: close this window
# ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# Predicted click point: [0.9709, 0.1548]

📄 License

The model is released under the MIT license.

📚 Citation

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご