UGround-V1-2B開源GUI視覺定位模型 - 簡單訓練實現精準視覺定位

首頁

Uground V1 2B

由osunlp開發

UGround是一個強大的GUI視覺定位模型，採用簡單的方法進行訓練，由OSUNLP和Orby AI合作完成。

多模態融合

Transformers

英語開源協議:Apache-2.0 #GUI視覺定位 #多模態交互 #高精度座標預測

下載量 975

發布時間 : 1/3/2025

模型概述

UGround是一個專注於GUI視覺定位的模型，能夠精確定位屏幕上的特定元素或對象，適用於各種GUI交互場景。

模型特點

強大的GUI視覺定位能力

能夠精確定位屏幕上的特定元素或對象，準確識別GUI中的各種組件。

簡單的訓練方法

採用簡潔有效的訓練策略，實現了高性能的視覺定位能力。

多尺寸圖像處理

支持處理各種分辨率和比例的圖像，適應不同的GUI界面。

多語言支持

除了英語和中文，還支持理解圖像中多種語言的文本內容。

模型能力

GUI元素定位

視覺問答

多模態理解

跨語言文本識別

複雜推理和決策

使用案例

自動化測試

GUI元素自動識別

自動識別和定位應用程序界面中的按鈕、文本框等元素

提高自動化測試的準確性和效率

輔助技術

視覺輔助工具

幫助視障用戶理解和操作GUI界面

提升無障礙訪問體驗

機器人控制

基於視覺的機器人操作

通過GUI界面控制機器人執行任務

實現更自然的機器人交互方式

🚀 UGround-V1-2B （基於Qwen2-VL）

UGround是一個強大的GUI視覺定位模型，採用簡單的方法進行訓練。更多詳情請查看我們的主頁和論文。這項工作是OSUNLP和Orby AI合作完成的。雷達圖

主頁：https://osu-nlp-group.github.io/UGround/
倉庫：https://github.com/OSU-NLP-Group/UGround
論文（ICLR'25口頭報告）：https://arxiv.org/abs/2410.05243
演示：https://huggingface.co/spaces/orby-osu/UGround
聯繫人：苟博宇

✨ 主要特性

UGround是一個強大的GUI視覺定位模型，使用簡單的訓練方法。
與OSUNLP和Orby AI合作完成。
提供了模型、主頁、倉庫、論文、演示等相關資源。

📦 安裝指南

文檔中未提及安裝步驟，故跳過此章節。

💻 使用示例

推理示例

vLLM服務器

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

或者

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

你可以在Qwen2-VL的官方倉庫中找到更多關於訓練和推理的說明。

視覺定位提示

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]

messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

Qwen2-VL-2B-Instruct使用示例

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation

📚 詳細文檔

模型信息

Model-V1：

發佈計劃

模型權重：
- [x] 初始版本（論文中使用的版本）
- [x] 基於Qwen2-VL的V1版本（2B、7B、72B）
代碼：
- [x] UGround的推理代碼（初始版本和基於Qwen2-VL的版本）
- [x] 離線實驗（代碼、結果和有用資源）
  - [x] ScreenSpot
  - [x] Multimodal-Mind2Web
  - [x] OmniAct
  - [x] Android Control
- [x] 在線實驗
  - [x] Mind2Web-Live-SeeAct-V
  - [x] AndroidWorld-SeeAct-V
- [ ] 數據合成管道（即將推出）
訓練數據（V1）：https://huggingface.co/datasets/osunlp/UGround-V1-Data
在線演示（HF Spaces）

主要結果

GUI視覺定位：ScreenSpot（標準設置）

ScreenSpot（標準）	架構	SFT數據	移動文本	移動圖標	桌面文本	桌面圖標	網頁文本	網頁圖標	平均
InternVL-2-4B	InternVL-2		9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	Groma		10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	MiniGPT-v2		8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4			22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o			20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	Fuyu		41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	Ferret-UI		64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	CogAgent		67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	InternVL-2	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser			93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D			85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B			92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude（計算機使用）			98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner									84.0
UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	Qwen2-VL	Aguvis-Stage-1&2	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	83.4	94.9	85.7	90.4	87.9	89.4

GUI視覺定位：ScreenSpot（代理設置）

規劃器	代理-ScreenSpot	架構	SFT數據	移動文本	移動圖標	桌面文本	桌面圖標	網頁文本	網頁圖標	平均
GPT-4o	Qwen-VL	Qwen-VL		21.3	21.4	18.6	10.7	9.1	5.8	14.5
GPT-4o	Qwen-GUI	Qwen-VL	GUICourse	67.8	24.5	53.1	16.4	50.4	18.5	38.5
GPT-4o	SeeClick	Qwen-VL	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
GPT-4o	OS-Atlas-Base-4B	InternVL-2	OS-Atlas	94.1	73.8	77.8	47.1	86.5	65.3	74.1
GPT-4o	OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.8	79.9	90.2	66.4	92.6	79.1	83.7
GPT-4o	UGround-V1	LLaVA-UGround-V1	UGround-V1	93.4	76.9	92.8	67.9	88.7	68.9	81.4
GPT-4o	UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	77.7	92.8	63.6	90.0	70.9	81.5
GPT-4o	UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	79.9	93.3	73.6	89.6	73.3	84.0

Qwen2-VL-2B-Instruct介紹

新特性

對各種分辨率和比例圖像的最優理解：Qwen2-VL在視覺理解基準測試中取得了最先進的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
理解20分鐘以上的視頻：Qwen2-VL可以理解超過20分鐘的視頻，用於高質量的基於視頻的問答、對話、內容創作等。
可操作移動設備、機器人等的代理：具有複雜推理和決策能力，Qwen2-VL可以與手機、機器人等設備集成，根據視覺環境和文本指令進行自動操作。
多語言支持：為了服務全球用戶，除了英語和中文，Qwen2-VL現在支持理解圖像中不同語言的文本，包括大多數歐洲語言、日語、韓語、阿拉伯語、越南語等。

模型架構更新

樸素動態分辨率：與以前不同，Qwen2-VL可以處理任意圖像分辨率，將其映射到動態數量的視覺標記，提供更像人類的視覺處理體驗。

- **多模態旋轉位置嵌入（M-ROPE）**：將位置嵌入分解為多個部分，以捕獲1D文本、2D視覺和3D視頻的位置信息，增強其多模態處理能力。

我們有三個參數分別為20億、70億和720億的模型。此倉庫包含經過指令微調的2B Qwen2-VL模型。更多信息，請訪問我們的博客和GitHub。

評估

圖像基準測試

基準測試	InternVL2-2B	MiniCPM-V 2.0	Qwen2-VL-2B
MMMU_驗證集	36.3	38.2	41.1
DocVQA_測試集	86.9		90.1
InfoVQA_測試集	58.9		65.5
ChartQA_測試集	76.2		73.5
TextVQA_驗證集	73.4		79.7
OCRBench	781	605	794
MTVQA			20.0
VCR_英語簡單			81.45
VCR_中文簡單			46.16
RealWorldQA	57.3	55.8	62.9
MME_總和	1876.8	1808.6	1872.0
MMBench-EN_測試集	73.2	69.1	74.9
MMBench-CN_測試集	70.9	66.5	73.5
MMBench-V1.1_測試集	69.6	65.8	72.2
MMT-Bench_測試集			54.5
MMStar	49.8	39.1	48.0
MMVet_GPT-4-Turbo	39.7	41.0	49.5
HallBench_平均	38.0	36.1	41.7
MathVista_測試mini	46.0	39.8	43.0
MathVision			12.4

視頻基準測試

基準測試	Qwen2-VL-2B
MVBench	63.2
PerceptionTest_測試集	53.9
EgoSchema_測試集	54.9
Video-MME_{無/有字幕}	55.6/60.4

要求

Qwen2-VL的代碼已包含在最新的Hugging face transformers中，我們建議你使用以下命令從源代碼構建：

pip install git+https://github.com/huggingface/transformers

否則，你可能會遇到以下錯誤：

KeyError: 'qwen2_vl'

快速開始

我們提供了一個工具包，幫助你更方便地處理各種類型的視覺輸入，包括base64編碼、URL和交錯的圖像和視頻。你可以使用以下命令安裝：

pip install qwen-vl-utils

🔧 技術細節

文檔中未提及技術細節相關內容，故跳過此章節。

📄 許可證

本項目採用Apache-2.0許可證。

📚 引用信息

如果你發現這項工作有用，請考慮引用我們的論文：

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }