Model Overview
Model Features
Model Capabilities
Use Cases
๐ UGround-V1-72B (Qwen2-VL-Based)(w/o LoRA)
UGround is a powerful GUI visual grounding model trained with a straightforward approach. For more details, visit our homepage and check out our paper. This project is a collaborative effort between OSUNLP and Orby AI.
- Homepage: UGround Homepage
- Repository: UGround Repository
- Paper: ArXiv Paper
- Demo: Hugging Face Demo
- Point of Contact: Boyu Gou
โจ Features
Models
- Model-V1:
Release Plan
- Model Weights:
- Model Weights on Hugging Face
- Initial Version (used in the paper)
- Qwen2-VL-Based V1
- 2B
- 7B
- 72B
- Model Weights on Hugging Face
- Code:
- Inference Code of UGround (Initial & Qwen2-VL-Based)
- Offline Experiments (Code, Results, and Useful Resources)
- ScreenSpot (along with referring expressions generated by GPT-4/4o)
- Multimodal-Mind2Web
- OmniAct
- Android Control
- Online Experiments
- Mind2Web-Live-SeeAct-V
- AndroidWorld-SeeAct-V
- Data Synthesis Pipeline (Coming Soon)
- Training-Data (V1)
- Online Demo (HF Spaces)
Main Results
GUI Visual Grounding: ScreenSpot (Standard Setting)
ScreenSpot (Standard) | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
---|---|---|---|---|---|---|---|---|---|
InternVL-2-4B | InternVL-2 | 9.2 | 4.8 | 4.6 | 4.3 | 0.9 | 0.1 | 4.0 | |
Groma | Groma | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 | |
Qwen-VL | Qwen-VL | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 | |
MiniGPT-v2 | MiniGPT-v2 | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 | |
GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 | ||
GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 | ||
Fuyu | Fuyu | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 | |
Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
Ferret-UI-Llama8b | Ferret-UI | 64.5 | 32.3 | 45.9 | 11.4 | 28.3 | 11.7 | 32.3 | |
Qwen2-VL | Qwen2-VL | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 | |
CogAgent | CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 | |
SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
OmniParser | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | 73.0 | ||
UGround | LLaVA-UGround-V1 | UGround-V1 | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
Molmo-7B-D | 85.4 | 69.0 | 79.4 | 70.7 | 81.3 | 65.5 | 75.2 | ||
UGround-V1-2B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
Molmo-72B | 92.7 | 79.5 | 86.1 | 64.3 | 83.0 | 66.0 | 78.6 | ||
Aguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 81.0 |
Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
Claude (Computer-Use) | 98.2 | 85.6 | 79.9 | 57.1 | 92.2 | 84.5 | 82.9 | ||
Aguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | 95.6 | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | 83.0 |
Project Mariner | 84.0 | ||||||||
UGround-V1-7B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 93.0 | 79.9 | 93.8 | 76.4 | 90.9 | 84.0 | 86.3 |
AGUVIS-72B | Qwen2-VL | Aguvis-Stage-1&2 | 94.5 | 85.2 | 95.4 | 77.9 | 91.3 | 85.9 | 88.4 |
UGround-V1-72B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 83.4 | 94.9 | 85.7 | 90.4 | 87.9 | 89.4 |
GUI Visual Grounding: ScreenSpot (Agent Setting)
Planner | Agent-Screenspot | arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | Qwen-VL | Qwen-VL | 21.3 | 21.4 | 18.6 | 10.7 | 9.1 | 5.8 | 14.5 | |
GPT-4o | Qwen-GUI | Qwen-VL | GUICourse | 67.8 | 24.5 | 53.1 | 16.4 | 50.4 | 18.5 | 38.5 |
GPT-4o | SeeClick | Qwen-VL | SeeClick | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | 52.4 |
GPT-4o | OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 94.1 | 73.8 | 77.8 | 47.1 | 86.5 | 65.3 | 74.1 |
GPT-4o | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.8 | 79.9 | 90.2 | 66.4 | 92.6 | 79.1 | 83.7 |
GPT-4o | UGround-V1 | LLaVA-UGround-V1 | UGround-V1 | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | 81.4 |
GPT-4o | UGround-V1-2B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 77.7 | 92.8 | 63.6 | 90.0 | 70.9 | 81.5 |
GPT-4o | UGround-V1-7B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 79.9 | 93.3 | 73.6 | 89.6 | 73.3 | 84.0 |
๐ฆ Installation
The code of Qwen2-VL has been in the latest Hugging face transformers. We advise you to build from source with the command pip install git+https://github.com/huggingface/transformers
, or you might encounter the following error:
KeyError: 'qwen2_vl'
You can also install the toolkit for handling visual input using the following command:
pip install qwen-vl-utils
๐ป Usage Examples
Inference
vLLM server
vllm serve osunlp/UGround-V1-7B --api-key token-abc123 --dtype float16
or
python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16
You can find more instruction about training and inference in Qwen2-VL's Official Repo.
Visual Grounding Prompt
def format_openai_template(description: str, base64_image):
return [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
{
"type": "text",
"text": f"""
Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
- Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
- If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
- Your answer should be a single string (x, y) corresponding to the point of the interest.
Description: {description}
Answer:"""
},
],
},
]
messages = format_openai_template(description, base64_image)
completion = await client.chat.completions.create(
model=args.model_path,
messages=messages,
temperature=0 # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)
# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)
Quickstart
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
๐ Documentation
Qwen2-VL-72B-Instruct
Introduction
We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.
Whatโs New in Qwen2-VL?
Key Enhancements
- SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
- Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
- Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
- Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
Model Architecture Updates
- Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
- Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
We have three models with 2, 8 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2-VL model. For more information, visit our Blog and GitHub.
Evaluation
Image Benchmarks
Benchmark | Previous SoTA (Open-source LVLM) |
Claude-3.5 Sonnet | GPT-4o | Qwen2-VL-72B |
---|---|---|---|---|
MMMUval | 58.3 | 68.3 | 69.1 | 64.5 |
DocVQAtest | 94.1 | 95.2 | 92.8 | 96.5 |
InfoVQAtest | 82.0 | - | - | 84.5 |
ChartQAtest | 88.4 | 90.8 | 85.7 | 88.3 |
TextVQAval | 84.4 | - | - | 85.5 |
OCRBench | 852 | 788 | 736 | 877 |
MTVQA | 17.3 | 25.7 | 27.8 | 30.9 |
VCRen easy | 84.67 | 63.85 | 91.55 | 91.93 |
VCRzh easy | 22.09 | 1.0 | 14.87 | 65.37 |
RealWorldQA | 72.2 | 60.1 | 75.4 | 77.8 |
MMEsum | 2414.7 | 1920.0 | 2328.7 | 2482.7 |
MMBench-ENtest | 86.5 | 79.7 | 83.4 | 86.5 |
MMBench-CNtest | 86.3 | 80.7 | 82.1 | 86.6 |
MMBench-V1.1test | 85.5 | 78.5 | 82.2 | 85.9 |
MMT-Benchtest | 63.4 | - | 65.5 | 71.7 |
MMStar | 67.1 | 62.2 | 63.9 | 68.3 |
MMVetGPT-4-Turbo | 65.7 | 66.0 | 69.1 | 74.0 |
HallBenchavg | 55.2 | 49.9 | 55.0 | 58.1 |
MathVistatestmini | 67.5 | 67.7 | 63.8 | 70.5 |
MathVision | 16.97 | - | 30.4 | 25.9 |
Video Benchmarks
Benchmark | Previous SoTA (Open-source LVLM) |
Gemini 1.5-Pro | GPT-4o | Qwen2-VL-72B |
---|---|---|---|---|
MVBench | 69.6 | - | - | 73.6 |
PerceptionTesttest | 66.9 | - | - | 68.0 |
EgoSchematest | 62.0 | 63.2 | 72.2 | 77.9 |
Video-MME (wo/w subs) |
66.3/69.6 | 75.0/81.3 | 71.9/77.2 | 71.2/77.8 |
Agent Benchmarks
Benchmark | Metric | Previous SoTA | GPT-4o | Qwen2-VL-72B | |
---|---|---|---|---|---|
General | FnCall[1] | TM | - | 90.2 | 93.1 |
EM | - | 50.0 | 53.2 | ||
Game | Number Line | SR | 89.4[2] | 91.5 | 100.0 |
BlackJack | SR | 40.2[2] | 34.5 | 42.6 | |
EZPoint | SR | 50.0[2] | 85.5 | 100.0 | |
Point24 | SR | 2.6[2] | 3.0 | 4.5 | |
Android | AITZ | TM | 83.0[3] | 70.0 | 89.6 |
EM | 47.7[3] | 35.3 | 72.1 | ||
AI2THOR | ALFREDvalid-unseen | SR | 67.7[4] | - | 67.8 |
GC | 75.3[4] | - | 75.8 | ||
VLN | R2Rvalid-unseen | SR | 79.0 | 43.7[5] | 51.7 |
REVERIEvalid-unseen | SR | 61.0 | 31.6[5] | 31.0 |
SR, GC, TM and EM are short for success rate, goal-condition success, type match and exact match. ALFRED is supported by SAM[6].
- Self-Curated Function Call Benchmark by Qwen Team
- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
- Android in the Zoo: Chain-of-Action-Thought for GUI Agents
- ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
- MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
- Segment Anything.
Multilingual Benchmarks
Models | AR | DE | FR | IT | JA | KO | RU | TH | VI | AVG |
---|---|---|---|---|---|---|---|---|---|---|
Qwen2-VL-72B | 20.7 | 36.5 | 44.1 | 42.8 | 21.6 | 37.4 | 15.6 | 17.7 | 41.6 | 30.9 |
GPT-4o | 20.2 | 34.2 | 41.2 | 32.7 | 20.0 | 33.9 | 11.5 | 22.5 | 34.2 | 27.8 |
Claude3 Opus | 15.1 | 33.4 | 40.6 | 34.4 | 19.4 | 27.2 | 13.0 | 19.5 | 29.1 | 25.7 |
Gemini Ultra | 14.7 | 32.3 | 40.0 | 31.8 | 12.3 | 17.2 | 11.8 | 20.3 | 28.6 | 23.2 |
๐ง Technical Details
The UGround model is a GUI visual grounding model based on Qwen2-VL. It is trained with a simple recipe and shows excellent performance on various benchmarks. The model architecture of Qwen2-VL has several key updates, such as Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE), which enhance its visual processing and multimodal understanding capabilities.
๐ License
This project is licensed under the tongyi-qianwen license.
๐ Citation Information
If you find this work useful, please consider citing our papers:
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}






