Uground V1 7B
UGround is a powerful GUI visual positioning model trained with a simple recipe, developed in collaboration by OSU NLP Group and Orby AI.
Downloads 2,053
Release Time : 1/3/2025
Model Overview
UGround is a GUI visual positioning model based on Qwen2-VL, specializing in accurately locating coordinates of specific areas/elements/objects on the screen.
Model Features
Multimodal visual positioning
Capable of accurately locating coordinates (x,y) of specific areas/elements/objects on the screen.
High performance
Excellent performance on the ScreenSpot benchmark, achieving an average score of 86.3.
Agent integration
Can be integrated with devices like phones/robots to enable automated operations in visual environments.
Model Capabilities
GUI visual positioning
Multimodal understanding
Agent operation
Use Cases
GUI visual positioning
ScreenSpot benchmark
Conducting GUI visual positioning tests under standard settings
Average score of 86.3, excelling in multiple subtasks
Agent setup
Used in combination with GPT-4o planner
Average score of 84.0, outstanding performance on mobile and desktop platforms
đ UGround-V1-7B (Qwen2-VL-Based)
UGround is a powerful GUI visual grounding model trained using a simple approach. For more detailed information, please visit our homepage and refer to our paper. This project is a collaborative effort between the OSU NLP Group and Orby AI.
- Homepage: https://osu-nlp-group.github.io/UGround/
- Repository: https://github.com/OSU-NLP-Group/UGround
- Paper (ICLR'25 Oral): https://arxiv.org/abs/2410.05243
- Demo: https://huggingface.co/spaces/orby-osu/UGround
- Point of Contact: Boyu Gou
⨠Features
- UGround is a strong GUI visual grounding model trained with a simple recipe.
- It is a collaboration between OSU NLP Group and Orby AI.
đĻ Installation
No installation steps are provided in the original README.
đģ Usage Examples
Inference
vLLM server
vllm serve osunlp/UGround-V1-7B --api-key token-abc123 --dtype float16
or
python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16
You can find more instruction about training and inference in Qwen2-VL's Official Repo.
Visual Grounding Prompt
def format_openai_template(description: str, base64_image):
return [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
},
{
"type": "text",
"text": f"""
Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.
- Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
- If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
- Your answer should be a single string (x, y) corresponding to the point of the interest.
Description: {description}
Answer:"""
},
],
},
]
messages = format_openai_template(description, base64_image)
completion = await client.chat.completions.create(
model=args.model_path,
messages=messages,
temperature=0 # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)
# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)
đ Documentation
Models
- Model-V1:
Release Plan
- [x] Model Weights
- [x] Initial Version (the one used in the paper)
- [x] Qwen2-VL-Based V1
- [x] 2B
- [x] 7B
- [x] 72B
- [x] Code
- [x] Inference Code of UGround (Initial & Qwen2-VL-Based
- [x] Offline Experiments (Code, Results, and Useful Resources)
- [x] ScreenSpot (along with referring expressions generated by GPT-4/4o)
- [x] Multimodal-Mind2Web
- [x] OmniAct
- [x] Android Control
- [x] Online Experiments
- [x] Mind2Web-Live-SeeAct-V
- [x] AndroidWorld-SeeAct-V
- [ ] Data Synthesis Pipeline (Coming Soon)
- [x] Training-Data (V1)
- [x] Online Demo (HF Spaces)
Main Results
GUI Visual Grounding: ScreenSpot (Standard Setting)
ScreenSpot (Standard) | Arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
---|---|---|---|---|---|---|---|---|---|
InternVL-2-4B | InternVL-2 | 9.2 | 4.8 | 4.6 | 4.3 | 0.9 | 0.1 | 4.0 | |
Groma | Groma | 10.3 | 2.6 | 4.6 | 4.3 | 5.7 | 3.4 | 5.2 | |
Qwen-VL | Qwen-VL | 9.5 | 4.8 | 5.7 | 5.0 | 3.5 | 2.4 | 5.2 | |
MiniGPT-v2 | MiniGPT-v2 | 8.4 | 6.6 | 6.2 | 2.9 | 6.5 | 3.4 | 5.7 | |
GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | 16.2 | ||
GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | 18.3 | ||
Fuyu | Fuyu | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 | |
Qwen-GUI | Qwen-VL | GUICourse | 52.4 | 10.9 | 45.9 | 5.7 | 43.0 | 13.6 | 28.6 |
Ferret-UI-Llama8b | Ferret-UI | 64.5 | 32.3 | 45.9 | 11.4 | 28.3 | 11.7 | 32.3 | |
Qwen2-VL | Qwen2-VL | 61.3 | 39.3 | 52.0 | 45.0 | 33.0 | 21.8 | 42.1 | |
CogAgent | CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 | |
SeeClick | Qwen-VL | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 85.7 | 58.5 | 72.2 | 45.7 | 82.6 | 63.1 | 68.0 |
OmniParser | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | 73.0 | ||
UGround | LLaVA-UGround-V1 | UGround-V1 | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
Iris | Iris | SeeClick | 85.3 | 64.2 | 86.7 | 57.5 | 82.6 | 71.2 | 74.6 |
ShowUI-G | ShowUI | ShowUI | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 75.0 |
ShowUI | ShowUI | ShowUI | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |
Molmo-7B-D | 85.4 | 69.0 | 79.4 | 70.7 | 81.3 | 65.5 | 75.2 | ||
UGround-V1-2B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 89.4 | 72.0 | 88.7 | 65.7 | 81.3 | 68.9 | 77.7 |
Molmo-72B | 92.7 | 79.5 | 86.1 | 64.3 | 83.0 | 66.0 | 78.6 | ||
Aguvis-G-7B | Qwen2-VL | Aguvis-Stage-1 | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | 81.0 |
OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | 81.0 |
Aria-UI | Aria | Aria-UI | 92.3 | 73.8 | 93.3 | 64.3 | 86.5 | 76.2 | 81.1 |
Claude (Computer-Use) | 98.2 | 85.6 | 79.9 | 57.1 | 92.2 | 84.5 | 82.9 | ||
Aguvis-7B | Qwen2-VL | Aguvis-Stage-1&2 | 95.6 | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | 83.0 |
Project Mariner | 84.0 | ||||||||
UGround-V1-7B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 93.0 | 79.9 | 93.8 | 76.4 | 90.9 | 84.0 | 86.3 |
AGUVIS-72B | Qwen2-VL | Aguvis-Stage-1&2 | 94.5 | 85.2 | 95.4 | 77.9 | 91.3 | 85.9 | 88.4 |
UGround-V1-72B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 83.4 | 94.9 | 85.7 | 90.4 | 87.9 | 89.4 |
GUI Visual Grounding: ScreenSpot (Agent Setting)
Planner | Agent-Screenspot | arch | SFT data | Mobile-Text | Mobile-Icon | Desktop-Text | Desktop-Icon | Web-Text | Web-Icon | Avg |
---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | Qwen-VL | Qwen-VL | 21.3 | 21.4 | 18.6 | 10.7 | 9.1 | 5.8 | 14.5 | |
GPT-4o | Qwen-GUI | Qwen-VL | GUICourse | 67.8 | 24.5 | 53.1 | 16.4 | 50.4 | 18.5 | 38.5 |
GPT-4o | SeeClick | Qwen-VL | SeeClick | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | 52.4 |
GPT-4o | OS-Atlas-Base-4B | InternVL-2 | OS-Atlas | 94.1 | 73.8 | 77.8 | 47.1 | 86.5 | 65.3 | 74.1 |
GPT-4o | OS-Atlas-Base-7B | Qwen2-VL | OS-Atlas | 93.8 | 79.9 | 90.2 | 66.4 | 92.6 | 79.1 | 83.7 |
GPT-4o | UGround-V1 | LLaVA-UGround-V1 | UGround-V1 | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | 81.4 |
GPT-4o | UGround-V1-2B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 77.7 | 92.8 | 63.6 | 90.0 | 70.9 | 81.5 |
GPT-4o | UGround-V1-7B (Qwen2-VL) | Qwen2-VL | UGround-V1 | 94.1 | 79.9 | 93.3 | 73.6 | 89.6 | 73.3 | 84.0 |
đ§ Technical Details
No technical details are provided in the original README.
đ License
This project is licensed under the Apache-2.0 license.
Citation Information
If you find this work useful, please consider citing our papers:
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}
Clip Vit Large Patch14
CLIP is a vision-language model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, supporting zero-shot image classification.
Image-to-Text
C
openai
44.7M
1,710
Clip Vit Base Patch32
CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.
Image-to-Text
C
openai
14.0M
666
Siglip So400m Patch14 384
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved sigmoid loss function to optimize image-text matching tasks.
Image-to-Text
Transformers

S
google
6.1M
526
Clip Vit Base Patch16
CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.
Image-to-Text
C
openai
4.6M
119
Blip Image Captioning Base
Bsd-3-clause
BLIP is an advanced vision-language pretrained model, excelling in image captioning tasks and supporting both conditional and unconditional text generation.
Image-to-Text
Transformers

B
Salesforce
2.8M
688
Blip Image Captioning Large
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling at image caption generation tasks, supporting both conditional and unconditional image caption generation.
Image-to-Text
Transformers

B
Salesforce
2.5M
1,312
Openvla 7b
MIT
OpenVLA 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, capable of generating robot actions based on language instructions and camera images.
Image-to-Text
Transformers English

O
openvla
1.7M
108
Llava V1.5 7b
LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna, supporting image-text interaction.
Image-to-Text
Transformers

L
liuhaotian
1.4M
448
Vit Gpt2 Image Captioning
Apache-2.0
This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.
Image-to-Text
Transformers

V
nlpconnect
939.88k
887
Blip2 Opt 2.7b
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation tasks.
Image-to-Text
Transformers English

B
Salesforce
867.78k
359
Featured Recommended AI Models
Š 2025AIbase