UGround-V1-2B Open-source GUI Visual Positioning Model - Achieve Precise Visual Positioning with Simple Training

Uground V1 2B

Developed by osunlp

UGround is a powerful GUI visual positioning model trained using a simple method, jointly developed by OSUNLP and Orby AI.

Multimodal Fusion

Transformers

EnglishOpen Source License:Apache-2.0 #GUI Visual Positioning #Multimodal Interaction #High-precision Coordinate Prediction

Downloads 975

Release Time : 1/3/2025

Model Overview

UGround is a model focused on GUI visual positioning, capable of accurately locating specific elements or objects on the screen, suitable for various GUI interaction scenarios.

Model Features

Powerful GUI Visual Positioning Capability

Capable of accurately locating specific elements or objects on the screen and precisely identifying various components in the GUI.

Simple Training Method

Adopts a simple and effective training strategy to achieve high-performance visual positioning capabilities.

Multi-size Image Processing

Supports processing images of various resolutions and ratios to adapt to different GUI interfaces.

Multilingual Support

In addition to English and Chinese, it also supports understanding text content in multiple languages in images.

Model Capabilities

GUI Element Positioning

Visual Question Answering

Multimodal Understanding

Cross-lingual Text Recognition

Complex Reasoning and Decision-making

Use Cases

Automated Testing

Automatic GUI Element Recognition

Automatically recognize and locate elements such as buttons and text boxes in the application interface

Improve the accuracy and efficiency of automated testing

Assistive Technology

Visual Assistive Tool

Help visually impaired users understand and operate the GUI interface

Enhance the barrier-free access experience

Robot Control

Vision-based Robot Operation

Control the robot to perform tasks through the GUI interface

Achieve a more natural way of robot interaction

🚀 UGround-V1-2B (Qwen2-VL-Based)

UGround is a powerful GUI visual grounding model trained with a simple recipe, offering high performance in visual grounding tasks. This collaborative work between OSUNLP and Orby AI provides a new solution for multimodal interaction.

radar

Homepage: https://osu-nlp-group.github.io/UGround/
Repository: https://github.com/OSU-NLP-Group/UGround
Paper (ICLR'25 Oral): https://arxiv.org/abs/2410.05243
Demo: https://huggingface.co/spaces/orby-osu/UGround
Point of Contact: Boyu Gou

✨ Features

Strong Visual Grounding Ability: UGround is a robust GUI visual grounding model, trained with a straightforward approach to achieve excellent performance in visual grounding tasks.
Multiple Model Versions: Offers various model versions, including different parameter scales, to meet diverse application requirements.
Rich Resources: Provides a homepage, repository, paper, demo, and training data, facilitating users' in - depth understanding and utilization of the model.

📦 Installation

No installation steps were provided in the original document.

💻 Usage Examples

Inference with vLLM server

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

You can find more instruction about training and inference in Qwen2-VL's Official Repo.

Visual Grounding Prompt

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]

messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

📚 Documentation

Models

Model-V1:

Release Plan

[x] Model Weights
- [x] Initial Version (the one used in the paper)
- [x] Qwen2-VL-Based V1 (2B, 7B, 72B)
[x] Code
- [x] Inference Code of UGround (Initial & Qwen2-VL-Based)
- [x] Offline Experiments (Code, Results, and Useful Resources)
  - [x] ScreenSpot
  - [x] Multimodal-Mind2Web
  - [x] OmniAct
  - [x] Android Control
- [x] Online Experiments
  - [x] Mind2Web-Live-SeeAct-V
  - [x] AndroidWorld-SeeAct-V
- [ ] Data Synthesis Pipeline (Coming Soon)
[x] Training-Data (V1)
[x] Online Demo (HF Spaces)

Main Results

GUI Visual Grounding: ScreenSpot (Standard Setting)

image/png

ScreenSpot (Standard)	Arch	SFT data	Mobile-Text	Mobile-Icon	Desktop-Text	Desktop-Icon	Web-Text	Web-Icon	Avg
InternVL-2-4B	InternVL-2		9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	Groma		10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	MiniGPT-v2		8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4			22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o			20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	Fuyu		41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	Ferret-UI		64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	CogAgent		67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	InternVL-2	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser			93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D			85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B			92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude (Computer-Use)			98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner									84.0
UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	Qwen2-VL	Aguvis-Stage-1&2	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	83.4	94.9	85.7	90.4	87.9	89.4

GUI Visual Grounding: ScreenSpot (Agent Setting)

Planner	Agent-Screenspot	arch	SFT data	Mobile-Text	Mobile-Icon	Desktop-Text	Desktop-Icon	Web-Text	Web-Icon	Avg
GPT-4o	Qwen-VL	Qwen-VL		21.3	21.4	18.6	10.7	9.1	5.8	14.5
GPT-4o	Qwen-GUI	Qwen-VL	GUICourse	67.8	24.5	53.1	16.4	50.4	18.5	38.5
GPT-4o	SeeClick	Qwen-VL	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
GPT-4o	OS-Atlas-Base-4B	InternVL-2	OS-Atlas	94.1	73.8	77.8	47.1	86.5	65.3	74.1
GPT-4o	OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.8	79.9	90.2	66.4	92.6	79.1	83.7
GPT-4o	UGround-V1	LLaVA-UGround-V1	UGround-V1	93.4	76.9	92.8	67.9	88.7	68.9	81.4
GPT-4o	UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	77.7	92.8	63.6	90.0	70.9	81.5
GPT-4o	UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	79.9	93.3	73.6	89.6	73.3	84.0

🔧 Technical Details

Qwen2-VL-2B-Instruct

Introduction

We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.

What’s New in Qwen2-VL?

Key Enhancements:

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

Model Architecture Updates:

Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.

- **Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.

We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 2B Qwen2-VL model. For more information, visit our Blog and GitHub.

Evaluation

Image Benchmarks

Benchmark	InternVL2-2B	MiniCPM-V 2.0	Qwen2-VL-2B
MMMU_val	36.3	38.2	41.1
DocVQA_test	86.9	-	90.1
InfoVQA_test	58.9	-	65.5
ChartQA_test	76.2	-	73.5
TextVQA_val	73.4	-	79.7
OCRBench	781	605	794
MTVQA	-	-	20.0
VCR_{en easy}	-	-	81.45
VCR_{zh easy}	-	-	46.16
RealWorldQA	57.3	55.8	62.9
MME_sum	1876.8	1808.6	1872.0
MMBench-EN_test	73.2	69.1	74.9
MMBench-CN_test	70.9	66.5	73.5
MMBench-V1.1_test	69.6	65.8	72.2
MMT-Bench_test	-	-	54.5
MMStar	49.8	39.1	48.0
MMVet_GPT-4-Turbo	39.7	41.0	49.5
HallBench_avg	38.0	36.1	41.7
MathVista_testmini	46.0	39.8	43.0
MathVision	-	-	12.4

Video Benchmarks

Benchmark	Qwen2-VL-2B
MVBench	63.2
PerceptionTest_test	53.9
EgoSchema_test	54.9
Video-MME_{wo/w subs}	55.6/60.4

Requirements

The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:

KeyError: 'qwen2_vl'

Quickstart

We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:

pip install qwen-vl-utils

Here we show a code snippet to show you how to use the chat model with transformers and qwen_vl_utils:

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Without qwen_vl_utils

from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

# Load the model in half-precision on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# Image
url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]


# Preprocess the inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor(
    text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
inputs = inputs.to("cuda")

# Inference: Generation

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation Information

If you find this work useful, please consider citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご