Qwen2.5-VL-3B-UI-R1-E开源模型 - 免费用于视觉问答，精准定位界面操作元素

首页

Qwen2.5 VL 3B UI R1 E

由 LZXzju 开发

UI-R1-E-3B是基于Qwen2.5-VL-3B-Instruct微调的高效GUI定位模型，专注于视觉问答任务，特别擅长在用户界面截图中定位和识别操作元素。

图像生成文本

Safetensors

英语开源协议:MIT #GUI定位 #无思考过程推理 #高精度坐标预测

下载量 75

发布时间 : 5/14/2025

模型简介

该模型通过强化学习增强GUI代理的行为预测能力，能够准确识别用户界面中的操作元素并预测执行命令所需的操作（如点击）及其坐标位置。

模型特点

高效GUI定位

在用户界面截图中精确定位操作元素，预测点击坐标

无思考过程推理

相比带思考过程的版本，推理速度更快且准确率更高

多平台支持

在移动端(Mobile)、桌面端(Desktop)和网页(Web)界面均有优异表现

模型能力

GUI元素识别

操作指令理解

坐标定位预测

跨平台界面分析

使用案例

自动化测试

UI自动化测试

自动识别界面元素并执行测试操作

在ScreenSpotV2基准测试中平均准确率达89.5%

辅助功能

视觉障碍辅助

帮助视觉障碍用户理解界面元素位置

🚀 高效GUI基础模型UI-R1-E-3B

本仓库包含高效的GUI基础模型 UI-R1-E-3B，该模型在论文 UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning 中提出。此模型可用于视觉问答任务，基于基础模型 Qwen/Qwen2.5-VL-3B-Instruct 开发，采用MIT许可证。

项目页面：https://github.com/lll6gg/UI-R1

旧版本：UI-R1-3B

🚀 快速开始

本项目提供了高效的GUI基础模型UI-R1-E-3B，以下将从不同方面介绍该模型的相关信息，包括基准测试结果、评估代码等。

✨ 主要特性

高效的GUI基础模型：UI-R1-E-3B在多个基准测试中表现出色，为GUI代理的动作预测提供了强大的支持。
多场景适用性：在不同的设备类型（如移动设备、桌面设备、网页）和推理模式下都有良好的性能表现。

📚 详细文档

基准测试1：ScreenSpotV2

ScreenSpotV2	推理模式	Mobile-T	Mobile-I	Desktop-T	Desktop-I	Web-T	Web-I	平均↑ / 长度↓
OS-ATLAS-7B	无思考过程	95.2	75.8	90.7	63.6	90.6	77.3	84.1 /
UI-TARS-7B	无思考过程	95.2	79.1	90.7	68.6	90.6	78.3	84.7 /
UI-R1-3B (v1)	有思考过程	96.2	84.3	92.3	63.6	89.2	75.4	85.4 / 67
GUI-R1-3B	有思考过程	97.6	78.2	94.3	64.3	91.0	72.4	85.0 / 80
UI-R1-3B (v2)	有思考过程	97.6	79.6	92.3	67.9	88.9	77.8	85.8 / 60
UI-R1-E-3B	无思考过程	98.2	83.9	94.8	75.0	93.2	83.7	89.5 / 28

基准测试2：ScreenSpot-Pro

ScreenSpot-Pro	推理模式	平均长度↓	平均准确率↑
UGround-7B	无思考过程	-	16.5
OS-ATLAS-7B	无思考过程	-	18.9
UI-R1-3B (v1)	有思考过程	102	17.8
GUI-R1-3B	有思考过程	114	26.6
UI-R1-3B (v2)	有思考过程	129	29.8
UI-R1-E-3B	无思考过程	28	33.5

排行榜：UI-I2E-Bench

模型	ScreenSpot	UI-I2E-Bench 平均	ScreenSpot-Pro	平均
UI-TARS-1.5-7B	88.1	73.2	42.2	67.8
Uground-V1-72B	89.7	76.3	34.3	66.8
UI-TARS-72B	88.4	73.7	38.1	66.7
UI-R1-E-3B	89.2	69.1	33.5	63.9
Uground-V1-7B	87.1	70.3	31.1	62.8
InfiGUI-R1	87.5	69.7	29.6	62.3
UI-TARS-7B	89.5	61.4	35.7	62.2
Qwen2.5-VL-72B	87.1	51.4	43.6	60.7
UI-I2E-VLM-7B	82.5	69.5	23.6	58.5
UI-TARS-2B	82.3	62	27.7	57.3
Qwen2.5-VL-7B	84.7	53.8	29	55.8
OmniParser-V2	72	54.8	39.6	55.5
Uground-V1-2B	78.8	57.4	26.6	54.3
OS-Atlas-7B	82.5	58.6	18.9	53.3
UI-R1-3B	83.3	58.5	17.8	53.2
UGround-7B	74.1	54.2	16.5	48.3
UI-I2E-VLM-4B	70.4	53.4	12.2	45.3
OmniParser	73.9	53.1	8.3	45.1
ShowUI-2B	76.8	41.5	7.7	42
Qwen2.5-VL-3B	55.5	41.7	23.9	41.3
Aguvis-7B	84.4	53.2	22.9	40.4
OS-Atlas-4B	70.1	44.3	3.7	39.4
Qwen2-VL-7B	42.6	48.7	1.6	31
Seeclick	55.8	26.4	1.1	27.8
InternVL2-4B	4.2	0.9	0.3	1.8

💻 使用示例

基础用法

UI-R1-E-3B生成代码

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    args.model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cpu",
)
model = model.to(torch.device(rank))
model = model.eval()
processor = AutoProcessor.from_pretrained(ori_processor_path)
question_template = (
    f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
    "Please provide the action to perform (enumerate in ['click'])"
    "and the coordinate where the cursor is moved to(integer) if click is performed.\n"
    "Output the final answer in <answer> </answer> tags directly."
    "The output answer format should be as follows:\n"
    "<answer>[{'action': 'click', 'coordinate': [x, y]}]</answer>\n"
    "Please strictly follow the format."
)
query = '<image>\n' + question_template
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path}
        ] + [{"type": "text", "text": query}],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
response = response[0]
pred_coord, _ = extract_coord(response)

根据图像调整预测坐标

image = Image.open(image_path)
origin_width, origin_height = image.size
resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
scale_x = origin_width / resized_width
scale_y = origin_height / resized_height
pred_coord[0] = int(pred_coord[0] * scale_x)
pred_coord[1] = int(pred_coord[1] * scale_y)

智能调整图像大小函数

import math
def smart_resize(
    height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
):
    """Rescales the image so that the following conditions are met:

    1. Both dimensions (height and width) are divisible by 'factor'.

    2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].

    3. The aspect ratio of the image is maintained as closely as possible.

    """
    if height < factor or width < factor:
        raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    elif max(height, width) / min(height, width) > 200:
        raise ValueError(
            f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
        )
    h_bar = round(height / factor) * factor
    w_bar = round(width / factor) * factor
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor
    return h_bar, w_bar