GTA1-32B Open-source GUI Localization Model - Direct and Precise Localization by Clicking without Thought Chain Reasoning

GTA1 32B

Developed by HelloKKMe

GTA1 is a GUI positioning model based on Reinforcement Learning (GRPO). It achieves precise positioning by directly rewarding successful clicks, avoiding lengthy thought-chain reasoning.

Image-to-Text

Transformers

#Reinforcement Learning GUI Positioning #High-precision Click Prediction #Multi-size Screen Adaptation

Downloads 220

Release Time : 6/4/2025

Model Overview

This project uses the Reinforcement Learning algorithm GRPO to train a GUI positioning model, focusing on achieving more precise positioning of GUI elements. The model directly incentivizes actionable and practical responses rather than relying on complex text reasoning, and performs excellently on multiple challenging datasets.

Model Features

Goal Alignment

Reinforcement Learning (such as GRPO) helps achieve precise positioning due to its inherent goal alignment feature, which rewards successful clicks rather than encouraging lengthy text thought-chain (CoT) reasoning.

Direct Incentive

Different from methods that heavily rely on lengthy CoT reasoning, GRPO directly incentivizes actionable and practical responses.

Excellent Performance

After benchmark testing on multiple challenging datasets, the model consistently achieves the best results among all open-source model families.

Model Capabilities

GUI Element Positioning

Vision-Language Understanding

Multi-scale Image Processing

Use Cases

Automated Testing

Automated Clicking of GUI Elements

Automatically locate and click specified GUI elements in automated testing

Improve testing efficiency and accuracy

Assistive Technology

Barrier-free Interaction

Help visually impaired users locate interactive elements on the screen

Enhance the barrier-free user experience

🚀 GTA1: State-of-the-Art GUI Grounding Models

This project shares state-of-the-art GUI grounding models trained using GRPO, which directly incentivizes actionable and grounded responses in reinforcement learning.

🚀 Quick Start

Reinforcement learning (RL) (e.g., GRPO) aids in grounding due to its inherent objective alignment—rewarding successful clicks—instead of promoting long textual Chain-of-Thought (CoT) reasoning. Unlike methods that rely heavily on verbose CoT reasoning, GRPO directly encourages actionable and grounded responses. Based on the findings from our blog, we present state-of-the-art GUI grounding models trained with GRPO.

✨ Features

Objective Alignment: GRPO rewards successful clicks, directly incentivizing grounded responses.
State-of-the-Art Performance: Achieves the best results among all open - source model families on challenging datasets.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from PIL import Image
from qwen_vl_utils import process_vision_info, smart_resize
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
import re

SYSTEM_PROMPT = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.

Output the coordinate pair exactly:
(x,y)
'''
SYSTEM_PROMPT=SYSTEM_PROMPT.strip()

# Function to extract coordinates from model output
def extract_coordinates(raw_string):
    try:
        matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
        return [tuple(map(int, match)) for match in matches][0]
    except:
        return 0,0

# Load model and processor
model_path = "HelloKKMe/GTA1-32B"
max_new_tokens = 32

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    model_path,
    min_pixels=3136,
    max_pixels= 4096 * 2160
)

# Load and resize image
image = Image.open("file path")
instruction = "description"  # Instruction for grounding
width, height = image.width, image.height

resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=processor.image_processor.patch_size * processor.image_processor.merge_size,
    min_pixels=processor.image_processor.min_pixels,
    max_pixels=processor.image_processor.max_pixels,
)
resized_image = image.resize((resized_width, resized_height))
scale_x, scale_y = width / resized_width, height / resized_height

# Prepare system and user messages
system_message = {
   "role": "system",
   "content": SYSTEM_PROMPT.format(height=resized_height,width=resized_width)
}

user_message = {
    "role": "user",
    "content": [
        {"type": "image", "image": resized_image},
        {"type": "text", "text": instruction}
    ]
}

# Tokenize and prepare inputs
image_inputs, video_inputs = process_vision_info([system_message, user_message])
text = processor.apply_chat_template([system_message, user_message], tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(model.device)

# Generate prediction
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, temperature=1.0, use_cache=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]

# Extract and rescale coordinates
pred_x, pred_y  = extract_coordinates(output_text) 
pred_x*=scale_x
pred_y*=scale_y 
print(pred_x,pred_y)

Refer to our code for more details.

📚 Documentation

Performance

We adhere to the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently outperforms all other open - source model families. The following are the comparative results:

Model	Size	Open Source	ScreenSpot-V2	ScreenSpotPro	OSWORLD-G
OpenAI CUA	—	❌	87.9	23.4	—
Claude 3.7	—	❌	87.6	27.7	—
JEDI-7B	7B	✅	91.7	39.5	54.1
SE-GUI	7B	✅	90.3	47.0	—
UI-TARS	7B	✅	91.6	35.7	47.5
UI-TARS-1.5*	7B	✅	89.7*	42.0*	64.2*
UGround-v1-7B	7B	✅	—	31.1	36.4
Qwen2.5-VL-32B-Instruct	32B	✅	91.9*	48.0	59.6*
UGround-v1-72B	72B	✅	—	34.5	—
Qwen2.5-VL-72B-Instruct	72B	✅	94.00*	53.3	62.2*
UI-TARS	72B	✅	90.3	38.1	—
GTA1 (Ours)	7B	✅	92.4 _{(∆ +2.7)}	50.1_{(∆ +8.1)}	67.7 _{(∆ +3.5)}
GTA1 (Ours)	32B	✅	93.2 _{(∆ +1.3)}	53.6 _{(∆ +5.6)}	61.9_{(∆ +2.3)}
GTA1 (Ours)	72B	✅	94.8_{(∆ +0.8)}	58.4 _{(∆ +5.1)}	66.7_{(∆ +4.5)}

⚠️ Important Note

Model size is indicated in billions (B) of parameters.

A dash (—) denotes results that are currently unavailable.

A superscript asterisk (﹡) denotes our evaluated result.

UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.

∆ indicates the performance improvement (∆) of our model compared to its baseline.

🔧 Technical Details

No technical details (more than 50 - word specific technical explanations) are provided in the original document, so this section is skipped.

📄 License

No license information is provided in the original document, so this section is skipped.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご