首頁

Uground V1 7B

由osunlp開發

UGround是一款採用簡單配方訓練的強大GUI視覺定位模型，由OSU NLP Group與Orby AI合作完成。

圖像生成文本

Transformers

英語開源協議:Apache-2.0 #GUI視覺定位 #多模態交互 #動態分辨率處理

下載量 2,053

發布時間 : 1/3/2025

模型概述

UGround是一款基於Qwen2-VL的GUI視覺定位模型，專注於精確定位屏幕上特定區域/元素/對象的座標。

模型特點

多模態視覺定位

能夠精確定位屏幕上特定區域/元素/對象的座標(x,y)。

高性能

在ScreenSpot基準測試中表現優異，平均分達到86.3。

智能體集成

可集成手機/機器人等設備實現視覺環境下的自動操作。

模型能力

GUI視覺定位

多模態理解

智能體操作

使用案例

GUI視覺定位

ScreenSpot基準測試

在標準設置下進行GUI視覺定位測試

平均分86.3，在多個子任務中表現優異

智能體設置

與GPT-4o規劃器結合使用

平均分84.0，在移動端和桌面端表現突出

🚀 UGround-V1-7B （基於Qwen2-VL）

UGround是一個強大的GUI視覺定位模型，採用簡單的方法進行訓練。更多詳細信息請查看我們的主頁和論文。本項目是俄亥俄州立大學自然語言處理小組和Orby AI的合作成果。雷達圖

主頁：https://osu-nlp-group.github.io/UGround/
代碼倉庫：https://github.com/OSU-NLP-Group/UGround
論文（ICLR'25口頭報告）：https://arxiv.org/abs/2410.05243
演示：https://huggingface.co/spaces/orby-osu/UGround
聯繫人：苟博宇

✨ 主要特性

強大的GUI視覺定位能力：在多個GUI視覺定位任務中表現出色，如ScreenSpot等。
多模型版本：提供不同參數規模的模型版本，包括2B、7B和72B。
豐富的實驗支持：涵蓋離線和在線實驗，提供推理代碼和實驗結果。
數據合成管道：即將推出數據合成管道，方便用戶進行數據生成。

📦 安裝指南

可參考Qwen2-VL的官方倉庫獲取更多訓練和推理的說明。

💻 使用示例

vLLM服務器

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

或者

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

視覺定位提示

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]


messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

示例圖片

📚 詳細文檔

模型

模型V1：

發佈計劃

[x] 模型權重
- [x] 初始版本（論文中使用的版本）
- [x] 基於Qwen2-VL的V1版本
  - [x] 2B
  - [x] 7B
  - [x] 72B
[x] 代碼
- [x] UGround的推理代碼（初始版本和基於Qwen2-VL的版本）
- [x] 離線實驗（代碼、結果和有用資源）
  - [x] ScreenSpot（以及由GPT-4/4o生成的指代表達）
  - [x] 多模態Mind2Web
  - [x] OmniAct
  - [x] 安卓控制
- [x] 在線實驗
  - [x] Mind2Web-Live-SeeAct-V
  - [x] AndroidWorld-SeeAct-V
- [ ] 數據合成管道（即將推出）
[x] 訓練數據（V1）
[x] 在線演示（HF Spaces）

主要結果

GUI視覺定位：ScreenSpot（標準設置）

結果圖片

ScreenSpot（標準）	架構	SFT數據	移動文本	移動圖標	桌面文本	桌面圖標	網頁文本	網頁圖標	平均
InternVL-2-4B	InternVL-2		9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	Groma		10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	MiniGPT-v2		8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4			22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o			20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	Fuyu		41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	Ferret-UI		64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	CogAgent		67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	InternVL-2	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser			93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D			85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B（Qwen2-VL）	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B			92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude（計算機使用）			98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner									84.0
UGround-V1-7B（Qwen2-VL）	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	Qwen2-VL	Aguvis-Stage-1&2	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B（Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	83.4	94.9	85.7	90.4	87.9	89.4

GUI視覺定位：ScreenSpot（代理設置）

規劃器	代理-ScreenSpot	架構	SFT數據	移動文本	移動圖標	桌面文本	桌面圖標	網頁文本	網頁圖標	平均
GPT-4o	Qwen-VL	Qwen-VL		21.3	21.4	18.6	10.7	9.1	5.8	14.5
GPT-4o	Qwen-GUI	Qwen-VL	GUICourse	67.8	24.5	53.1	16.4	50.4	18.5	38.5
GPT-4o	SeeClick	Qwen-VL	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
GPT-4o	OS-Atlas-Base-4B	InternVL-2	OS-Atlas	94.1	73.8	77.8	47.1	86.5	65.3	74.1
GPT-4o	OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.8	79.9	90.2	66.4	92.6	79.1	83.7
GPT-4o	UGround-V1	LLaVA-UGround-V1	UGround-V1	93.4	76.9	92.8	67.9	88.7	68.9	81.4
GPT-4o	UGround-V1-2B（Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	77.7	92.8	63.6	90.0	70.9	81.5
GPT-4o	UGround-V1-7B（Qwen2-VL）	Qwen2-VL	UGround-V1	94.1	79.9	93.3	73.6	89.6	73.3	84.0

🔧 技術細節

暫未提供相關技術細節。

📄 許可證

本項目採用Apache-2.0許可證。

引用信息

如果您覺得本項目有用，請考慮引用我們的論文：

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }

Qwen2-VL-7B-Instruct

介紹

我們很高興推出Qwen2-VL，這是我們Qwen-VL模型的最新版本，凝聚了近一年的創新成果。

Qwen2-VL有哪些新特性？

關鍵改進：

對各種分辨率和比例圖像的最先進理解：Qwen2-VL在視覺理解基準測試中取得了最先進的性能，包括MathVista、DocVQA、RealWorldQA、MTVQA等。
理解20分鐘以上的視頻：Qwen2-VL可以理解超過20分鐘的視頻，用於高質量的基於視頻的問答、對話、內容創作等。
可操作手機、機器人等的智能體：憑藉複雜的推理和決策能力，Qwen2-VL可以與手機、機器人等設備集成，根據視覺環境和文本指令進行自動操作。
多語言支持：為了服務全球用戶，除了英語和中文，Qwen2-VL現在支持理解圖像中不同語言的文本，包括大多數歐洲語言、日語、韓語、阿拉伯語、越南語等。

模型架構更新：

樸素動態分辨率：與之前不同，Qwen2-VL可以處理任意圖像分辨率，將其映射到動態數量的視覺標記，提供更接近人類的視覺處理體驗。

- **多模態旋轉位置嵌入（M-ROPE）**：將位置嵌入分解為多個部分，以捕獲一維文本、二維視覺和三維視頻的位置信息，增強其多模態處理能力。

我們有三個參數分別為20億、70億和720億的模型。本倉庫包含經過指令微調的70億參數Qwen2-VL模型。更多信息請訪問我們的博客和GitHub。

評估

圖像基準測試

基準測試	InternVL2-8B	MiniCPM-V 2.6	GPT-4o-mini	Qwen2-VL-7B
MMMU_驗證集	51.8	49.8	60	54.1
DocVQA_測試集	91.6	90.8	-	94.5
InfoVQA_測試集	74.8	-	-	76.5
ChartQA_測試集	83.3	-	-	83.0
TextVQA_驗證集	77.4	80.1	-	84.3
OCRBench	794	852	785	845
MTVQA	-	-	-	26.3
VCR_英文簡單	-	73.88	83.60	89.70
VCR_中文簡單	-	10.18	1.10	59.94
RealWorldQA	64.4	-	-	70.1
MME_總和	2210.3	2348.4	2003.4	2326.8
MMBench-EN_測試集	81.7	-	-	83.0
MMBench-CN_測試集	81.2	-	-	80.5
MMBench-V1.1_測試集	79.4	78.0	76.0	80.7
MMT-Bench_測試集	-	-	-	63.7
MMStar	61.5	57.5	54.8	60.7
MMVet_GPT-4-Turbo	54.2	60.0	66.9	62.0
HallBench_平均	45.2	48.1	46.1	50.6
MathVista_{測試迷你版}	58.3	60.6	52.4	58.2
MathVision	-	-	-	16.3

視頻基準測試

基準測試	Internvl2-8B	LLaVA-OneVision-7B	MiniCPM-V 2.6	Qwen2-VL-7B
MVBench	66.4	56.7	-	67.0
PerceptionTest_測試集	-	57.1	-	62.3
EgoSchema_測試集	-	60.1	-	66.7
Video-MME_有無字幕	54.0/56.9	58.2/-	60.9/63.6	63.3/69.0

要求

Qwen2-VL的代碼已包含在最新的Hugging face transformers中，我們建議您使用以下命令從源代碼構建：

pip install git+https://github.com/huggingface/transformers

否則，您可能會遇到以下錯誤：

KeyError: 'qwen2_vl'

快速開始

我們提供了一個工具包，幫助您更方便地處理各種類型的視覺輸入，包括base64編碼、URL和交錯的圖像和視頻。您可以使用以下命令安裝：

pip install qwen-vl-utils

以下是一個代碼示例，展示如何使用transformers和qwen_vl_utils來使用聊天模型：

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)

# 我們建議啟用flash_attention_2以獲得更好的加速和內存節省，特別是在多圖像和視頻場景中。
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# 模型中每個圖像的視覺標記數量的默認範圍是4-16384。您可以根據需要設置min_pixels和max_pixels，例如標記數量範圍為256-1280，以平衡速度和內存使用。
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

不使用qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# 將模型以半精度加載到可用設備上
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# 圖像
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 預處理輸入
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 預期輸出: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# 推理：生成輸出
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text)

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]
# 包含視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息進行批量處理
messages = [messages1, messages1]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

侷限性

雖然Qwen2-VL適用於廣泛的視覺任務，但瞭解其侷限性同樣重要。以下是一些已知的限制：

缺乏音頻支持：當前模型無法理解視頻中的音頻信息。
數據時效性：我們的圖像數據集更新至2023年6月，此日期之後的信息可能未涵蓋。
個體和知識產權識別限制：模型識別特定個體或知識產權的能力有限，可能無法全面涵蓋所有知名人物或品牌。
複雜指令處理能力有限：當面對複雜的多步驟指令時，模型的理解和執行能力有待提高。
計數準確性不足：特別是在複雜場景中，對象計數的準確性不高，需要進一步改進。
空間推理能力較弱：特別是在3D空間中，模型對對象位置關係的推斷不足，難以精確判斷對象的相對位置。

這些侷限性是模型優化和改進的持續方向，我們致力於不斷提升模型的性能和應用範圍。

引用

如果您覺得我們的工作有幫助，請隨意引用我們的論文。

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}