๐ GTA1: State-of-the-Art GUI Grounding Models
This project shares state-of-the-art GUI grounding models trained using GRPO, which directly incentivizes actionable and grounded responses in reinforcement learning.
๐ Quick Start
Reinforcement learning (RL) (e.g., GRPO) aids in grounding due to its inherent objective alignmentโrewarding successful clicksโinstead of promoting long textual Chain-of-Thought (CoT) reasoning. Unlike methods that rely heavily on verbose CoT reasoning, GRPO directly encourages actionable and grounded responses. Based on the findings from our blog, we present state-of-the-art GUI grounding models trained with GRPO.
โจ Features
- Objective Alignment: GRPO rewards successful clicks, directly incentivizing grounded responses.
- State-of-the-Art Performance: Achieves the best results among all open - source model families on challenging datasets.
๐ฆ Installation
No installation steps are provided in the original document, so this section is skipped.
๐ป Usage Examples
Basic Usage
from PIL import Image
from qwen_vl_utils import process_vision_info, smart_resize
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
import re
SYSTEM_PROMPT = '''
You are an expert UI element locator. Given a GUI image and a user's element description, provide the coordinates of the specified element as a single (x,y) point. The image resolution is height {height} and width {width}. For elements with area, return the center point.
Output the coordinate pair exactly:
(x,y)
'''
SYSTEM_PROMPT=SYSTEM_PROMPT.strip()
def extract_coordinates(raw_string):
try:
matches = re.findall(r"\((-?\d*\.?\d+),\s*(-?\d*\.?\d+)\)", raw_string)
return [tuple(map(int, match)) for match in matches][0]
except:
return 0,0
model_path = "HelloKKMe/GTA1-32B"
max_new_tokens = 32
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
model_path,
min_pixels=3136,
max_pixels= 4096 * 2160
)
image = Image.open("file path")
instruction = "description"
width, height = image.width, image.height
resized_height, resized_width = smart_resize(
image.height,
image.width,
factor=processor.image_processor.patch_size * processor.image_processor.merge_size,
min_pixels=processor.image_processor.min_pixels,
max_pixels=processor.image_processor.max_pixels,
)
resized_image = image.resize((resized_width, resized_height))
scale_x, scale_y = width / resized_width, height / resized_height
system_message = {
"role": "system",
"content": SYSTEM_PROMPT.format(height=resized_height,width=resized_width)
}
user_message = {
"role": "user",
"content": [
{"type": "image", "image": resized_image},
{"type": "text", "text": instruction}
]
}
image_inputs, video_inputs = process_vision_info([system_message, user_message])
text = processor.apply_chat_template([system_message, user_message], tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, temperature=1.0, use_cache=True)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)[0]
pred_x, pred_y = extract_coordinates(output_text)
pred_x*=scale_x
pred_y*=scale_y
print(pred_x,pred_y)
Refer to our code for more details.
๐ Documentation
Performance
We adhere to the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently outperforms all other open - source model families. The following are the comparative results:
Model |
Size |
Open Source |
ScreenSpot-V2 |
ScreenSpotPro |
OSWORLD-G |
OpenAI CUA |
โ |
โ |
87.9 |
23.4 |
โ |
Claude 3.7 |
โ |
โ |
87.6 |
27.7 |
โ |
JEDI-7B |
7B |
โ
|
91.7 |
39.5 |
54.1 |
SE-GUI |
7B |
โ
|
90.3 |
47.0 |
โ |
UI-TARS |
7B |
โ
|
91.6 |
35.7 |
47.5 |
UI-TARS-1.5* |
7B |
โ
|
89.7* |
42.0* |
64.2* |
UGround-v1-7B |
7B |
โ
|
โ |
31.1 |
36.4 |
Qwen2.5-VL-32B-Instruct |
32B |
โ
|
91.9* |
48.0 |
59.6* |
UGround-v1-72B |
72B |
โ
|
โ |
34.5 |
โ |
Qwen2.5-VL-72B-Instruct |
72B |
โ
|
94.00* |
53.3 |
62.2* |
UI-TARS |
72B |
โ
|
90.3 |
38.1 |
โ |
GTA1 (Ours) |
7B |
โ
|
92.4 (โ +2.7) |
50.1(โ +8.1) |
67.7 (โ +3.5) |
GTA1 (Ours) |
32B |
โ
|
93.2 (โ +1.3) |
53.6 (โ +5.6) |
61.9(โ +2.3) |
GTA1 (Ours) |
72B |
โ
|
94.8(โ +0.8) |
58.4 (โ +5.1) |
66.7(โ +4.5) |
โ ๏ธ Important Note
- Model size is indicated in billions (B) of parameters.
- A dash (โ) denotes results that are currently unavailable.
- A superscript asterisk (๏นก) denotes our evaluated result.
- UI-TARS-1.5 7B, Qwen2.5-VL-32B-Instruct, and Qwen2.5-VL-72B-Instruct are applied as our baseline models.
- โ indicates the performance improvement (โ) of our model compared to its baseline.
๐ง Technical Details
No technical details (more than 50 - word specific technical explanations) are provided in the original document, so this section is skipped.
๐ License
No license information is provided in the original document, so this section is skipped.