UGround-V1-2B开源GUI视觉定位模型 - 简单训练实现精准视觉定位

首页

Uground V1 2B

由 osunlp 开发

UGround是一个强大的GUI视觉定位模型，采用简单的方法进行训练，由OSUNLP和Orby AI合作完成。

多模态融合

Transformers

英语开源协议:Apache-2.0 #GUI视觉定位 #多模态交互 #高精度坐标预测

下载量 975

发布时间 : 1/3/2025

模型简介

UGround是一个专注于GUI视觉定位的模型，能够精确定位屏幕上的特定元素或对象，适用于各种GUI交互场景。

模型特点

强大的GUI视觉定位能力

能够精确定位屏幕上的特定元素或对象，准确识别GUI中的各种组件。

简单的训练方法

采用简洁有效的训练策略，实现了高性能的视觉定位能力。

多尺寸图像处理

支持处理各种分辨率和比例的图像，适应不同的GUI界面。

多语言支持

除了英语和中文，还支持理解图像中多种语言的文本内容。

模型能力

GUI元素定位

视觉问答

多模态理解

跨语言文本识别

复杂推理和决策

使用案例

自动化测试

GUI元素自动识别

自动识别和定位应用程序界面中的按钮、文本框等元素

提高自动化测试的准确性和效率

辅助技术

视觉辅助工具

帮助视障用户理解和操作GUI界面

提升无障碍访问体验

机器人控制

基于视觉的机器人操作

通过GUI界面控制机器人执行任务

实现更自然的机器人交互方式

🚀 UGround-V1-2B （基于Qwen2-VL）

UGround是一个强大的GUI视觉定位模型，采用简单的方法进行训练。更多详情请查看我们的主页和论文。这项工作是OSUNLP和Orby AI合作完成的。雷达图

主页：https://osu-nlp-group.github.io/UGround/
仓库：https://github.com/OSU-NLP-Group/UGround
论文（ICLR'25口头报告）：https://arxiv.org/abs/2410.05243
演示：https://huggingface.co/spaces/orby-osu/UGround
联系人：苟博宇

✨ 主要特性

UGround是一个强大的GUI视觉定位模型，使用简单的训练方法。
与OSUNLP和Orby AI合作完成。
提供了模型、主页、仓库、论文、演示等相关资源。

📦 安装指南

文档中未提及安装步骤，故跳过此章节。

💻 使用示例

推理示例

vLLM服务器

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

或者

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

你可以在Qwen2-VL的官方仓库中找到更多关于训练和推理的说明。

视觉定位提示

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]

messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

Qwen2-VL-2B-Instruct使用示例

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation

📚 详细文档

模型信息

Model-V1：

发布计划

模型权重：
- [x] 初始版本（论文中使用的版本）
- [x] 基于Qwen2-VL的V1版本（2B、7B、72B）
代码：
- [x] UGround的推理代码（初始版本和基于Qwen2-VL的版本）
- [x] 离线实验（代码、结果和有用资源）
  - [x] ScreenSpot
  - [x] Multimodal-Mind2Web
  - [x] OmniAct
  - [x] Android Control
- [x] 在线实验
  - [x] Mind2Web-Live-SeeAct-V
  - [x] AndroidWorld-SeeAct-V
- [ ] 数据合成管道（即将推出）
训练数据（V1）：https://huggingface.co/datasets/osunlp/UGround-V1-Data
在线演示（HF Spaces）

主要结果

GUI视觉定位：ScreenSpot（标准设置）

ScreenSpot（标准）	架构	SFT数据	移动文本	移动图标	桌面文本	桌面图标	网页文本	网页图标	平均
InternVL-2-4B	InternVL-2		9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	Groma		10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	MiniGPT-v2		8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4			22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o			20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	Fuyu		41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	Ferret-UI		64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	CogAgent		67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	InternVL-2	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser			93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D			85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B			92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude（计算机使用）			98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner									84.0
UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	Qwen2-VL	Aguvis-Stage-1&2	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	83.4	94.9	85.7	90.4	87.9	89.4

GUI视觉定位：ScreenSpot（代理设置）

规划器	代理-ScreenSpot	架构	SFT数据	移动文本	移动图标	桌面文本	桌面图标	网页文本	网页图标	平均
GPT-4o	Qwen-VL	Qwen-VL		21.3	21.4	18.6	10.7	9.1	5.8	14.5
GPT-4o	Qwen-GUI	Qwen-VL	GUICourse	67.8	24.5	53.1	16.4	50.4	18.5	38.5
GPT-4o	SeeClick	Qwen-VL	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
GPT-4o	OS-Atlas-Base-4B	InternVL-2	OS-Atlas	94.1	73.8	77.8	47.1	86.5	65.3	74.1
GPT-4o	OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.8	79.9	90.2	66.4	92.6	79.1	83.7
GPT-4o	UGround-V1	LLaVA-UGround-V1	UGround-V1	93.4	76.9	92.8	67.9	88.7	68.9	81.4
GPT-4o	UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	77.7	92.8	63.6	90.0	70.9	81.5
GPT-4o	UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	79.9	93.3	73.6	89.6	73.3	84.0

Qwen2-VL-2B-Instruct介绍

新特性

对各种分辨率和比例图像的最优理解：Qwen2-VL在视觉理解基准测试中取得了最先进的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
理解20分钟以上的视频：Qwen2-VL可以理解超过20分钟的视频，用于高质量的基于视频的问答、对话、内容创作等。
可操作移动设备、机器人等的代理：具有复杂推理和决策能力，Qwen2-VL可以与手机、机器人等设备集成，根据视觉环境和文本指令进行自动操作。
多语言支持：为了服务全球用户，除了英语和中文，Qwen2-VL现在支持理解图像中不同语言的文本，包括大多数欧洲语言、日语、韩语、阿拉伯语、越南语等。

模型架构更新

朴素动态分辨率：与以前不同，Qwen2-VL可以处理任意图像分辨率，将其映射到动态数量的视觉标记，提供更像人类的视觉处理体验。

- **多模态旋转位置嵌入（M-ROPE）**：将位置嵌入分解为多个部分，以捕获1D文本、2D视觉和3D视频的位置信息，增强其多模态处理能力。

我们有三个参数分别为20亿、70亿和720亿的模型。此仓库包含经过指令微调的2B Qwen2-VL模型。更多信息，请访问我们的博客和GitHub。

评估

图像基准测试

基准测试	InternVL2-2B	MiniCPM-V 2.0	Qwen2-VL-2B
MMMU_验证集	36.3	38.2	41.1
DocVQA_测试集	86.9		90.1
InfoVQA_测试集	58.9		65.5
ChartQA_测试集	76.2		73.5
TextVQA_验证集	73.4		79.7
OCRBench	781	605	794
MTVQA			20.0
VCR_英语简单			81.45
VCR_中文简单			46.16
RealWorldQA	57.3	55.8	62.9
MME_总和	1876.8	1808.6	1872.0
MMBench-EN_测试集	73.2	69.1	74.9
MMBench-CN_测试集	70.9	66.5	73.5
MMBench-V1.1_测试集	69.6	65.8	72.2
MMT-Bench_测试集			54.5
MMStar	49.8	39.1	48.0
MMVet_GPT-4-Turbo	39.7	41.0	49.5
HallBench_平均	38.0	36.1	41.7
MathVista_测试mini	46.0	39.8	43.0
MathVision			12.4

视频基准测试

基准测试	Qwen2-VL-2B
MVBench	63.2
PerceptionTest_测试集	53.9
EgoSchema_测试集	54.9
Video-MME_{无/有字幕}	55.6/60.4

要求

Qwen2-VL的代码已包含在最新的Hugging face transformers中，我们建议你使用以下命令从源代码构建：

pip install git+https://github.com/huggingface/transformers

否则，你可能会遇到以下错误：

KeyError: 'qwen2_vl'

快速开始

我们提供了一个工具包，帮助你更方便地处理各种类型的视觉输入，包括base64编码、URL和交错的图像和视频。你可以使用以下命令安装：

pip install qwen-vl-utils

🔧 技术细节

文档中未提及技术细节相关内容，故跳过此章节。

📄 许可证

本项目采用Apache-2.0许可证。

📚 引用信息

如果你发现这项工作有用，请考虑引用我们的论文：

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }