GTA1-72B开源GUI定位模型 - 精准定位界面元素，无需冗长推理！

首页

GTA1 72B

由 HelloKKMe 开发

GTA1是基于强化学习（GRPO）训练的最先进GUI定位模型，通过直接激励可操作响应而非冗长推理，实现精准界面元素定位。

图像生成文本

Transformers

#强化学习GUI定位 #高精度界面元素识别 #多尺寸屏幕适配

下载量 163

发布时间 : 6/9/2025

模型简介

该模型专注于图形用户界面(GUI)元素的精确定位，采用强化学习方法优化定位效果，在多个基准测试中表现优异。

模型特点

强化学习驱动

采用GRPO等强化学习算法，直接激励可操作响应而非冗长推理

目标对齐特性

通过奖励成功的点击操作实现精准定位，而非依赖文本推理链

多尺寸支持

提供7B、32B和72B三种参数规模的模型选择

模型能力

GUI元素定位

视觉-语言理解

坐标预测

多分辨率适配

使用案例

自动化测试

UI元素自动化点击

在自动化测试中精确定位界面元素进行模拟操作

在ScreenSpot-V2数据集达到94.8%准确率

辅助技术

视障辅助导航

帮助视障用户定位界面元素进行交互

🚀 强化学习驱动的GUI定位模型

本项目基于强化学习（如GRPO），解决了传统方法依赖冗长文本推理的问题，直接激励可操作且基于实际的响应。我们分享了使用GRPO训练的最先进的GUI定位模型，在多个数据集上取得了优异的成绩。

🚀 快速开始

若你想快速上手本项目，可参考以下代码示例进行推理操作。更多详细内容请查看我们的代码仓库。

✨ 主要特性

目标对齐：强化学习（如GRPO）具有固有的目标对齐特性，通过奖励成功的点击操作，而非鼓励冗长的文本思维链（CoT）推理，从而实现更好的定位效果。
直接激励：与严重依赖详细CoT推理的方法不同，GRPO直接激励可操作且基于实际的响应。
先进模型：基于我们博客的研究成果，我们分享了使用GRPO训练的最先进的GUI定位模型。

📈 性能表现

我们遵循标准评估协议，在三个具有挑战性的数据集上对模型进行基准测试。我们的方法在所有开源模型家族中始终取得最佳结果。以下是对比结果：

模型	规模	是否开源	ScreenSpot-V2	ScreenSpotPro	OSWORLD-G
OpenAI CUA	—	❌	87.9	23.4	—
Claude 3.7	—	❌	87.6	27.7	—
JEDI - 7B	7B	✅	91.7	39.5	54.1
SE - GUI	7B	✅	90.3	47.0	—
UI - TARS	7B	✅	91.6	35.7	47.5
UI - TARS - 1.5*	7B	✅	89.7*	42.0*	64.2*
UGround - v1 - 7B	7B	✅	—	31.1	36.4
Qwen2.5 - VL - 32B - Instruct	32B	✅	91.9*	48.0	59.6*
UGround - v1 - 72B	72B	✅	—	34.5	—
Qwen2.5 - VL - 72B - Instruct	72B	✅	94.00*	53.3	62.2*
UI - TARS	72B	✅	90.3	38.1	—
GTA1 (我们的模型)	7B	✅	92.4 _{(∆ +2.7)}	50.1_{(∆ +8.1)}	67.7 _{(∆ +3.5)}
GTA1 (我们的模型)	32B	✅	93.2 _{(∆ +1.3)}	53.6 _{(∆ +5.6)}	61.9_{(∆ +2.3)}
GTA1 (我们的模型)	72B	✅	94.8_{(∆ +0.8)}	58.4 _{(∆ +5.1)}	66.7_{(∆ +4.5)}

⚠️ 重要提示

模型规模以十亿（B）为单位表示参数数量。

短横线（—）表示当前无法获取的结果。

上标星号（﹡）表示我们的评估结果。

UI - TARS - 1.5 7B、Qwen2.5 - VL - 32B - Instruct 和 Qwen2.5 - VL - 72B - Instruct 作为我们的基线模型。

∆ 表示我们的模型相对于基线的性能提升。

💻 使用示例

基础用法

以下是一个代码片段，展示了如何使用训练好的模型进行推理：

from PIL import Image
from qwen_vl_utils import process_vision_info, smart_resize
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
import re

SYSTEM_PROMPT = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.

Output the coordinate pair exactly:
(x,y)
'''
SYSTEM_PROMPT=SYSTEM_PROMPT.strip()

# Function to extract coordinates from model output
def extract_coordinates(raw_string):
    try:
        matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
        return [tuple(map(int, match)) for match in matches][0]
    except:
        return 0,0

# Load model and processor
model_path = "HelloKKMe/GTA1-72B"
max_new_tokens = 32

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    model_path,
    min_pixels=3136,
    max_pixels= 4096 * 2160
)

# Load and resize image
image = Image.open("file path")
instruction = "description"  # Instruction for grounding
width, height = image.width, image.height

resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=processor.image_processor.patch_size * processor.image_processor.merge_size,
    min_pixels=processor.image_processor.min_pixels,
    max_pixels=processor.image_processor.max_pixels,
)
resized_image = image.resize((resized_width, resized_height))
scale_x, scale_y = width / resized_width, height / resized_height

# Prepare system and user messages
system_message = {
   "role": "system",
   "content": SYSTEM_PROMPT.format(height=resized_height,width=resized_width)
}

user_message = {
    "role": "user",
    "content": [
        {"type": "image", "image": resized_image},
        {"type": "text", "text": instruction}
    ]
}

# Tokenize and prepare inputs
image_inputs, video_inputs = process_vision_info([system_message, user_message])
text = processor.apply_chat_template([system_message, user_message], tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(model.device)

# Generate prediction
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, temperature=1.0, use_cache=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]

# Extract and rescale coordinates
pred_x, pred_y  = extract_coordinates(output_text) 
pred_x*=scale_x
pred_y*=scale_y 
print(pred_x,pred_y)