GUI - Actor - 7B - Qwen2 - VL Open-source Vision-Language Model, Solving the Visual Grounding Problem of GUI Agents without Coordinates

GUI Actor 7B Qwen2 VL

Developed by microsoft

GUI-Actor-7B is a vision-language model developed based on Qwen2-VL-7B-Instruct, focusing on graphical user interface (GUI) agent tasks and providing a coordinate-free visual grounding solution.

Multimodal Fusion

Transformers

Open Source License:MIT #GUI Visual Positioning #Coordinate-free Interaction #Multimodal Agent

Downloads 207

Release Time : 6/1/2025

Model Overview

By adding an attention-based action head and fine-tuning, this model can perform excellently in GUI grounding tasks and is suitable for automated GUI operation scenarios.

Model Features

Coordinate-free Visual Grounding

Adopt an innovative coordinate-free solution to directly predict GUI operation positions and simplify the interaction process

Attention-based Action Head

Enhance the model's positioning ability for GUI elements through a specially designed attention-based action head

Multiple Model Sizes to Choose From

Provide model versions with different parameter scales from 2B to 7B to meet different computing resource requirements

Validator Enhancement

Optionally equipped with a dedicated validator model to further improve operation accuracy

Model Capabilities

GUI Element Recognition

Screen Operation Positioning

Multimodal Understanding (Image + Text)

Automated Task Execution

Use Cases

Software Automated Testing

Automated UI Testing

Automatically identify and operate software interface elements for functional testing

Achieved an accuracy of 40.7% on the ScreenSpot-Pro benchmark test

RPA Process Automation

Business Process Automation

Automatically complete repetitive GUI operation tasks through visual understanding

Achieved an accuracy of 89.5% on the ScreenSpot-v2 benchmark test

🚀 GUI-Actor-7B with Qwen2-VL-7B as backbone VLM

This model addresses the challenge of GUI grounding for agents by leveraging the power of Qwen2-VL-7B. It offers an efficient and accurate solution for coordinate - free visual grounding in GUI scenarios, enhancing the interaction capabilities of GUI agents.

🚀 Quick Start

The following steps and code example will help you quickly get started with the GUI - Actor-7B model.

💻 Usage Examples

Basic Usage

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# load model
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# prepare example
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image or str to path
                # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# inference
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")

# >> Model Response
# Intruction: close this window
# ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# Predicted click point: [0.9709, 0.1548]

✨ Features

Backbone Model: This model is developed based on Qwen2-VL-7B-Instruct.
Enhanced Architecture: Augmented by an attention - based action head and finetuned for GUI grounding.
Multiple Model Variants: There are different model sizes available, including 7B, 2B, 3B, etc., to meet various needs.

📚 Documentation

This model was introduced in the paper GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents. For more details on model design and evaluation, please check:

Model Links

Model Name	Hugging Face Link
GUI-Actor-7B-Qwen2-VL	Hugging Face
GUI-Actor-2B-Qwen2-VL	Hugging Face
GUI-Actor-7B-Qwen2.5-VL	Hugging Face
GUI-Actor-3B-Qwen2.5-VL	Hugging Face
GUI-Actor-Verifier-2B	Hugging Face

Performance Comparison on GUI Grounding Benchmarks

Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.

Method	Backbone VLM	ScreenSpot-Pro	ScreenSpot	ScreenSpot-v2
*72B models:*
AGUVIS-72B	Qwen2-VL	-	89.2	-
UGround-V1-72B	Qwen2-VL	34.5	89.4	-
UI-TARS-72B	Qwen2-VL	38.1	88.4	90.3
*7B models:*
OS-Atlas-7B	Qwen2-VL	18.9	82.5	84.1
AGUVIS-7B	Qwen2-VL	22.9	84.4	86.0†
UGround-V1-7B	Qwen2-VL	31.1	86.3	87.6†
UI-TARS-7B	Qwen2-VL	35.7	89.5	91.6
GUI-Actor-7B	Qwen2-VL	40.7	88.3	89.5
GUI-Actor-7B + Verifier	Qwen2-VL	44.2	89.7	90.9
*2B models:*
UGround-V1-2B	Qwen2-VL	26.6	77.1	-
UI-TARS-2B	Qwen2-VL	27.7	82.3	84.7
GUI-Actor-2B	Qwen2-VL	36.7	86.5	88.6
GUI-Actor-2B + Verifier	Qwen2-VL	41.8	86.9	89.3

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.

Method	Backbone VLM	ScreenSpot-Pro	ScreenSpot-v2
*7B models:*
Qwen2.5-VL-7B	Qwen2.5-VL	27.6	88.8
Jedi-7B	Qwen2.5-VL	39.5	91.7
GUI-Actor-7B	Qwen2.5-VL	44.6	92.1
GUI-Actor-7B + Verifier	Qwen2.5-VL	47.7	92.5
*3B models:*
Qwen2.5-VL-3B	Qwen2.5-VL	25.9	80.9
Jedi-3B	Qwen2.5-VL	36.1	88.6
GUI-Actor-3B	Qwen2.5-VL	42.2	91.0
GUI-Actor-3B + Verifier	Qwen2.5-VL	45.9	92.4

📄 License

The model is released under the MIT license.

📦 Model Information

Property	Details
Model Type	GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
Training Data	Not provided in the original document

📖 Citation

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご