Qwen2.5-VL-3B-UI-R1-E開源模型 - 免費用於視覺問答，精準定位界面操作元素

首頁

Qwen2.5 VL 3B UI R1 E

由LZXzju開發

UI-R1-E-3B是基於Qwen2.5-VL-3B-Instruct微調的高效GUI定位模型，專注於視覺問答任務，特別擅長在用戶界面截圖中定位和識別操作元素。

圖像生成文本

Safetensors

英語開源協議:MIT #GUI定位 #無思考過程推理 #高精度座標預測

下載量 75

發布時間 : 5/14/2025

模型概述

該模型通過強化學習增強GUI代理的行為預測能力，能夠準確識別用戶界面中的操作元素並預測執行命令所需的操作（如點擊）及其座標位置。

模型特點

高效GUI定位

在用戶界面截圖中精確定位操作元素，預測點擊座標

無思考過程推理

相比帶思考過程的版本，推理速度更快且準確率更高

多平臺支持

在移動端(Mobile)、桌面端(Desktop)和網頁(Web)界面均有優異表現

模型能力

GUI元素識別

操作指令理解

座標定位預測

跨平臺界面分析

使用案例

自動化測試

UI自動化測試

自動識別界面元素並執行測試操作

在ScreenSpotV2基準測試中平均準確率達89.5%

輔助功能

視覺障礙輔助

幫助視覺障礙用戶理解界面元素位置

🚀 高效GUI基礎模型UI-R1-E-3B

本倉庫包含高效的GUI基礎模型 UI-R1-E-3B，該模型在論文 UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning 中提出。此模型可用於視覺問答任務，基於基礎模型 Qwen/Qwen2.5-VL-3B-Instruct 開發，採用MIT許可證。

項目頁面：https://github.com/lll6gg/UI-R1

舊版本：UI-R1-3B

🚀 快速開始

本項目提供了高效的GUI基礎模型UI-R1-E-3B，以下將從不同方面介紹該模型的相關信息，包括基準測試結果、評估代碼等。

✨ 主要特性

高效的GUI基礎模型：UI-R1-E-3B在多個基準測試中表現出色，為GUI代理的動作預測提供了強大的支持。
多場景適用性：在不同的設備類型（如移動設備、桌面設備、網頁）和推理模式下都有良好的性能表現。

📚 詳細文檔

基準測試1：ScreenSpotV2

ScreenSpotV2	推理模式	Mobile-T	Mobile-I	Desktop-T	Desktop-I	Web-T	Web-I	平均↑ / 長度↓
OS-ATLAS-7B	無思考過程	95.2	75.8	90.7	63.6	90.6	77.3	84.1 /
UI-TARS-7B	無思考過程	95.2	79.1	90.7	68.6	90.6	78.3	84.7 /
UI-R1-3B (v1)	有思考過程	96.2	84.3	92.3	63.6	89.2	75.4	85.4 / 67
GUI-R1-3B	有思考過程	97.6	78.2	94.3	64.3	91.0	72.4	85.0 / 80
UI-R1-3B (v2)	有思考過程	97.6	79.6	92.3	67.9	88.9	77.8	85.8 / 60
UI-R1-E-3B	無思考過程	98.2	83.9	94.8	75.0	93.2	83.7	89.5 / 28

基準測試2：ScreenSpot-Pro

ScreenSpot-Pro	推理模式	平均長度↓	平均準確率↑
UGround-7B	無思考過程	-	16.5
OS-ATLAS-7B	無思考過程	-	18.9
UI-R1-3B (v1)	有思考過程	102	17.8
GUI-R1-3B	有思考過程	114	26.6
UI-R1-3B (v2)	有思考過程	129	29.8
UI-R1-E-3B	無思考過程	28	33.5

排行榜：UI-I2E-Bench

模型	ScreenSpot	UI-I2E-Bench 平均	ScreenSpot-Pro	平均
UI-TARS-1.5-7B	88.1	73.2	42.2	67.8
Uground-V1-72B	89.7	76.3	34.3	66.8
UI-TARS-72B	88.4	73.7	38.1	66.7
UI-R1-E-3B	89.2	69.1	33.5	63.9
Uground-V1-7B	87.1	70.3	31.1	62.8
InfiGUI-R1	87.5	69.7	29.6	62.3
UI-TARS-7B	89.5	61.4	35.7	62.2
Qwen2.5-VL-72B	87.1	51.4	43.6	60.7
UI-I2E-VLM-7B	82.5	69.5	23.6	58.5
UI-TARS-2B	82.3	62	27.7	57.3
Qwen2.5-VL-7B	84.7	53.8	29	55.8
OmniParser-V2	72	54.8	39.6	55.5
Uground-V1-2B	78.8	57.4	26.6	54.3
OS-Atlas-7B	82.5	58.6	18.9	53.3
UI-R1-3B	83.3	58.5	17.8	53.2
UGround-7B	74.1	54.2	16.5	48.3
UI-I2E-VLM-4B	70.4	53.4	12.2	45.3
OmniParser	73.9	53.1	8.3	45.1
ShowUI-2B	76.8	41.5	7.7	42
Qwen2.5-VL-3B	55.5	41.7	23.9	41.3
Aguvis-7B	84.4	53.2	22.9	40.4
OS-Atlas-4B	70.1	44.3	3.7	39.4
Qwen2-VL-7B	42.6	48.7	1.6	31
Seeclick	55.8	26.4	1.1	27.8
InternVL2-4B	4.2	0.9	0.3	1.8

💻 使用示例

基礎用法

UI-R1-E-3B生成代碼

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    args.model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="cpu",
)
model = model.to(torch.device(rank))
model = model.eval()
processor = AutoProcessor.from_pretrained(ori_processor_path)
question_template = (
    f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
    "Please provide the action to perform (enumerate in ['click'])"
    "and the coordinate where the cursor is moved to(integer) if click is performed.\n"
    "Output the final answer in <answer> </answer> tags directly."
    "The output answer format should be as follows:\n"
    "<answer>[{'action': 'click', 'coordinate': [x, y]}]</answer>\n"
    "Please strictly follow the format."
)
query = '<image>\n' + question_template
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path}
        ] + [{"type": "text", "text": query}],
    }
]
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
response = response[0]
pred_coord, _ = extract_coord(response)

根據圖像調整預測座標

image = Image.open(image_path)
origin_width, origin_height = image.size
resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
scale_x = origin_width / resized_width
scale_y = origin_height / resized_height
pred_coord[0] = int(pred_coord[0] * scale_x)
pred_coord[1] = int(pred_coord[1] * scale_y)

智能調整圖像大小函數

import math
def smart_resize(
    height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
):
    """Rescales the image so that the following conditions are met:

    1. Both dimensions (height and width) are divisible by 'factor'.

    2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].

    3. The aspect ratio of the image is maintained as closely as possible.

    """
    if height < factor or width < factor:
        raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
    elif max(height, width) / min(height, width) > 200:
        raise ValueError(
            f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
        )
    h_bar = round(height / factor) * factor
    w_bar = round(width / factor) * factor
    if h_bar * w_bar > max_pixels:
        beta = math.sqrt((height * width) / max_pixels)
        h_bar = math.floor(height / beta / factor) * factor
        w_bar = math.floor(width / beta / factor) * factor
    elif h_bar * w_bar < min_pixels:
        beta = math.sqrt(min_pixels / (height * width))
        h_bar = math.ceil(height * beta / factor) * factor
        w_bar = math.ceil(width * beta / factor) * factor
    return h_bar, w_bar