UGround-V1-72B Open-Source Visual Localization Model - Free for Image Text-to-Text Multimodal Tasks

Uground V1 72B

Developed by osunlp

UGround is a powerful GUI visual localization model trained with a simple recipe, focusing on image-text-to-text multimodal tasks.

Image-to-Text

Transformers

EnglishOpen Source License:Other #Multimodal GUI Localization #Visual Instruction Understanding #Cross-platform Control

Downloads 129

Release Time : 1/11/2025

Model Overview

UGround is a visual localization model jointly developed by OSUNLP and Orby AI, based on the Qwen2-VL architecture, capable of handling multimodal interaction tasks between images and text.

Model Features

Powerful GUI Visual Localization Capability

UGround can accurately understand and locate elements in graphical user interfaces, enabling efficient image-text interaction.

Multimodal Support

The model supports multimodal interaction between images and text, capable of handling complex visual and language tasks.

Based on Qwen2-VL Architecture

Utilizes the advanced Qwen2-VL-72B architecture, offering powerful computational capabilities and processing efficiency.

Model Capabilities

Image-text interaction

GUI element localization

Multimodal task processing

Use Cases

GUI Automation

Screen Element Localization

Used in automated testing to locate and manipulate GUI elements on the screen.

Improves the accuracy and efficiency of automated testing.

Multimodal Interaction

Image Caption Generation

Generates detailed textual descriptions based on image content.

Enhances the quality of image understanding and description.

🚀 UGround-V1-72B (Qwen2-VL-Based)(w/o LoRA)

UGround is a powerful GUI visual grounding model trained with a straightforward approach. For more details, visit our homepage and check out our paper. This project is a collaborative effort between OSUNLP and Orby AI.

radar

Homepage: UGround Homepage
Repository: UGround Repository
Paper: ArXiv Paper
Demo: Hugging Face Demo
Point of Contact: Boyu Gou

✨ Features

Models

Model-V1:

Release Plan

Model Weights:
- Model Weights on Hugging Face
  - Initial Version (used in the paper)
  - Qwen2-VL-Based V1
    - 2B
    - 7B
    - 72B
Code:
- Inference Code of UGround (Initial & Qwen2-VL-Based)
- Offline Experiments (Code, Results, and Useful Resources)
  - ScreenSpot (along with referring expressions generated by GPT-4/4o)
  - Multimodal-Mind2Web
  - OmniAct
  - Android Control
- Online Experiments
  - Mind2Web-Live-SeeAct-V
  - AndroidWorld-SeeAct-V
- Data Synthesis Pipeline (Coming Soon)
Training-Data (V1)
Online Demo (HF Spaces)

Main Results

GUI Visual Grounding: ScreenSpot (Standard Setting)

image/png

ScreenSpot (Standard)	Arch	SFT data	Mobile-Text	Mobile-Icon	Desktop-Text	Desktop-Icon	Web-Text	Web-Icon	Avg
InternVL-2-4B	InternVL-2		9.2	4.8	4.6	4.3	0.9	0.1	4.0
Groma	Groma		10.3	2.6	4.6	4.3	5.7	3.4	5.2
Qwen-VL	Qwen-VL		9.5	4.8	5.7	5.0	3.5	2.4	5.2
MiniGPT-v2	MiniGPT-v2		8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4			22.6	24.5	20.2	11.8	9.2	8.8	16.2
GPT-4o			20.2	24.9	21.1	23.6	12.2	7.8	18.3
Fuyu	Fuyu		41.0	1.3	33.0	3.6	33.9	4.4	19.5
Qwen-GUI	Qwen-VL	GUICourse	52.4	10.9	45.9	5.7	43.0	13.6	28.6
Ferret-UI-Llama8b	Ferret-UI		64.5	32.3	45.9	11.4	28.3	11.7	32.3
Qwen2-VL	Qwen2-VL		61.3	39.3	52.0	45.0	33.0	21.8	42.1
CogAgent	CogAgent		67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	Qwen-VL	SeeClick	78.0	52.0	72.2	30.0	55.7	32.5	53.4
OS-Atlas-Base-4B	InternVL-2	OS-Atlas	85.7	58.5	72.2	45.7	82.6	63.1	68.0
OmniParser			93.9	57.0	91.3	63.6	81.3	51.0	73.0
UGround	LLaVA-UGround-V1	UGround-V1	82.8	60.3	82.5	63.6	80.4	70.4	73.3
Iris	Iris	SeeClick	85.3	64.2	86.7	57.5	82.6	71.2	74.6
ShowUI-G	ShowUI	ShowUI	91.6	69.0	81.8	59.0	83.0	65.5	75.0
ShowUI	ShowUI	ShowUI	92.3	75.5	76.3	61.1	81.7	63.6	75.1
Molmo-7B-D			85.4	69.0	79.4	70.7	81.3	65.5	75.2
UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	89.4	72.0	88.7	65.7	81.3	68.9	77.7
Molmo-72B			92.7	79.5	86.1	64.3	83.0	66.0	78.6
Aguvis-G-7B	Qwen2-VL	Aguvis-Stage-1	88.3	78.2	88.1	70.7	85.7	74.8	81.0
OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.0	72.9	91.8	62.9	90.9	74.3	81.0
Aria-UI	Aria	Aria-UI	92.3	73.8	93.3	64.3	86.5	76.2	81.1
Claude (Computer-Use)			98.2	85.6	79.9	57.1	92.2	84.5	82.9
Aguvis-7B	Qwen2-VL	Aguvis-Stage-1&2	95.6	77.7	93.8	67.1	88.3	75.2	83.0
Project Mariner									84.0
UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	93.0	79.9	93.8	76.4	90.9	84.0	86.3
AGUVIS-72B	Qwen2-VL	Aguvis-Stage-1&2	94.5	85.2	95.4	77.9	91.3	85.9	88.4
UGround-V1-72B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	83.4	94.9	85.7	90.4	87.9	89.4

GUI Visual Grounding: ScreenSpot (Agent Setting)

Planner	Agent-Screenspot	arch	SFT data	Mobile-Text	Mobile-Icon	Desktop-Text	Desktop-Icon	Web-Text	Web-Icon	Avg
GPT-4o	Qwen-VL	Qwen-VL		21.3	21.4	18.6	10.7	9.1	5.8	14.5
GPT-4o	Qwen-GUI	Qwen-VL	GUICourse	67.8	24.5	53.1	16.4	50.4	18.5	38.5
GPT-4o	SeeClick	Qwen-VL	SeeClick	81.0	59.8	69.6	33.6	43.9	26.2	52.4
GPT-4o	OS-Atlas-Base-4B	InternVL-2	OS-Atlas	94.1	73.8	77.8	47.1	86.5	65.3	74.1
GPT-4o	OS-Atlas-Base-7B	Qwen2-VL	OS-Atlas	93.8	79.9	90.2	66.4	92.6	79.1	83.7
GPT-4o	UGround-V1	LLaVA-UGround-V1	UGround-V1	93.4	76.9	92.8	67.9	88.7	68.9	81.4
GPT-4o	UGround-V1-2B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	77.7	92.8	63.6	90.0	70.9	81.5
GPT-4o	UGround-V1-7B (Qwen2-VL)	Qwen2-VL	UGround-V1	94.1	79.9	93.3	73.6	89.6	73.3	84.0

📦 Installation

The code of Qwen2-VL has been in the latest Hugging face transformers. We advise you to build from source with the command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:

KeyError: 'qwen2_vl'

You can also install the toolkit for handling visual input using the following command:

pip install qwen-vl-utils

💻 Usage Examples

Inference

vLLM server

vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

You can find more instruction about training and inference in Qwen2-VL's Official Repo.

Visual Grounding Prompt

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]


messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

image/png

Quickstart

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-72B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.

📚 Documentation

Qwen2-VL-72B-Instruct

Introduction

We're excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.

What’s New in Qwen2-VL?

Key Enhancements

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

Model Architecture Updates

Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.

Multimodal Rotary Position Embedding (M-ROPE): Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.

We have three models with 2, 8 and 72 billion parameters. This repo contains the instruction-tuned 72B Qwen2-VL model. For more information, visit our Blog and GitHub.

Evaluation

Image Benchmarks

Benchmark	Previous SoTA ^{(Open-source LVLM)}	Claude-3.5 Sonnet	GPT-4o	Qwen2-VL-72B
MMMU_val	58.3	68.3	69.1	64.5
DocVQA_test	94.1	95.2	92.8	96.5
InfoVQA_test	82.0	-	-	84.5
ChartQA_test	88.4	90.8	85.7	88.3
TextVQA_val	84.4	-	-	85.5
OCRBench	852	788	736	877
MTVQA	17.3	25.7	27.8	30.9
VCR_{en easy}	84.67	63.85	91.55	91.93
VCR_{zh easy}	22.09	1.0	14.87	65.37
RealWorldQA	72.2	60.1	75.4	77.8
MME_sum	2414.7	1920.0	2328.7	2482.7
MMBench-EN_test	86.5	79.7	83.4	86.5
MMBench-CN_test	86.3	80.7	82.1	86.6
MMBench-V1.1_test	85.5	78.5	82.2	85.9
MMT-Bench_test	63.4	-	65.5	71.7
MMStar	67.1	62.2	63.9	68.3
MMVet_GPT-4-Turbo	65.7	66.0	69.1	74.0
HallBench_avg	55.2	49.9	55.0	58.1
MathVista_testmini	67.5	67.7	63.8	70.5
MathVision	16.97	-	30.4	25.9

Video Benchmarks

Benchmark	Previous SoTA ^{(Open-source LVLM)}	Gemini 1.5-Pro	GPT-4o	Qwen2-VL-72B
MVBench	69.6	-	-	73.6
PerceptionTest_test	66.9	-	-	68.0
EgoSchema_test	62.0	63.2	72.2	77.9
Video-MME _{(wo/w subs)}	66.3/69.6	75.0/81.3	71.9/77.2	71.2/77.8

Agent Benchmarks

	Benchmark	Metric	Previous SoTA	GPT-4o	Qwen2-VL-72B
General	FnCall^[1]	TM	-	90.2	93.1
		EM	-	50.0	53.2
Game	Number Line	SR	89.4^[2]	91.5	100.0
	BlackJack	SR	40.2^[2]	34.5	42.6
	EZPoint	SR	50.0^[2]	85.5	100.0
	Point24	SR	2.6^[2]	3.0	4.5
Android	AITZ	TM	83.0^[3]	70.0	89.6
		EM	47.7^[3]	35.3	72.1
AI2THOR	ALFRED_valid-unseen	SR	67.7^[4]	-	67.8
		GC	75.3^[4]	-	75.8
VLN	R2R_valid-unseen	SR	79.0	43.7^[5]	51.7
	REVERIE_valid-unseen	SR	61.0	31.6^[5]	31.0

SR, GC, TM and EM are short for success rate, goal-condition success, type match and exact match. ALFRED is supported by SAM^[6].

Self-Curated Function Call Benchmark by Qwen Team
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
Segment Anything.

Multilingual Benchmarks

Models	AR	DE	FR	IT	JA	KO	RU	TH	VI	AVG
Qwen2-VL-72B	20.7	36.5	44.1	42.8	21.6	37.4	15.6	17.7	41.6	30.9
GPT-4o	20.2	34.2	41.2	32.7	20.0	33.9	11.5	22.5	34.2	27.8
Claude3 Opus	15.1	33.4	40.6	34.4	19.4	27.2	13.0	19.5	29.1	25.7
Gemini Ultra	14.7	32.3	40.0	31.8	12.3	17.2	11.8	20.3	28.6	23.2

🔧 Technical Details

The UGround model is a GUI visual grounding model based on Qwen2-VL. It is trained with a simple recipe and shows excellent performance on various benchmarks. The model architecture of Qwen2-VL has several key updates, such as Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE), which enhance its visual processing and multimodal understanding capabilities.

📄 License

This project is licensed under the tongyi-qianwen license.

📚 Citation Information

If you find this work useful, please consider citing our papers:

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご