🚀 UGround(基於LLaVA的初始版本)
UGround是一個通過簡單方法訓練的強大GUI視覺定位模型。它解決了GUI視覺定位任務中的準確性和效率問題,為相關領域的研究和應用提供了有力支持。本項目由OSU NLP Group和Orby AI合作完成。
⚠️ 重要提示
我們已經基於Qwen2 - VL使用相同數據訓練了更強的模型。建議使用這些模型以獲得更好的性能,以及更便捷的訓練、推理和部署體驗。

- 主頁:https://osu-nlp-group.github.io/UGround/
- 代碼倉庫:https://github.com/OSU-NLP-Group/UGround
- 論文:https://arxiv.org/abs/2410.05243
- 演示:https://huggingface.co/spaces/orby-osu/UGround
- 聯繫人:Boyu Gou
📦 模型信息
🚀 發佈計劃
- [x] 模型權重
- [x] 初始版本(論文中使用的版本)
- [x] 基於Qwen2 - VL的V1版本(2B、7B、72B)
- [x] 代碼
- [x] 訓練數據(V1)
- [x] 在線演示(HF Spaces)
✨ 主要結果
GUI視覺定位:ScreenSpot(標準設置)
定位模型 |
架構 |
SFT數據 |
移動文本 |
移動圖標 |
桌面文本 |
桌面圖標 |
網頁文本 |
網頁圖標 |
平均值 |
GPT - 4 |
|
|
22.6 |
24.5 |
20.2 |
11.8 |
9.2 |
8.8 |
16.2 |
GPT - 4o |
|
|
20.2 |
24.9 |
21.1 |
23.6 |
12.2 |
7.8 |
18.3 |
MiniGPT - v2 |
MiniGPT - v2 |
|
8.4 |
6.6 |
6.2 |
2.9 |
6.5 |
3.4 |
5.7 |
Groma |
Groma |
|
10.3 |
2.6 |
4.6 |
4.3 |
5.7 |
3.4 |
5.2 |
Fuyu |
Fuyu |
|
41.0 |
1.3 |
33.0 |
3.6 |
33.9 |
4.4 |
19.5 |
Qwen - VL |
Qwen - VL |
|
9.5 |
4.8 |
5.7 |
5.0 |
3.5 |
2.4 |
5.2 |
SeeClick |
Qwen - VL |
SeeClick |
78.0 |
52.0 |
72.2 |
30.0 |
55.7 |
32.5 |
53.4 |
Qwen - GUI |
Qwen - VL |
GUICourse |
52.4 |
10.9 |
45.9 |
5.7 |
43.0 |
13.6 |
28.6 |
UGround - V1 |
LLaVA - UGround - V1 |
UGround - V1 |
82.8 |
60.3 |
82.5 |
63.6 |
80.4 |
70.4 |
73.3 |
Qwen2 - VL |
Qwen2 - VL |
|
61.3 |
39.3 |
52.0 |
45.0 |
33.0 |
21.8 |
42.1 |
Auguvis - G - 7B |
Qwen2 - VL |
Aguvis - Stage - 1 |
88.3 |
78.2 |
88.1 |
70.7 |
85.7 |
74.8 |
81.0 |
Auguvis - 7B |
Qwen2 - VL |
Aguvis - Stage - 1&2 |
95.6 |
77.7 |
93.8 |
67.1 |
88.3 |
75.2 |
83.0 |
OS - Atlas - Base - 4B |
InternVL |
OS - Atlas |
85.7 |
58.5 |
72.2 |
45.7 |
82.6 |
63.1 |
68.0 |
OS - Atlas - Base - 7B |
Qwen2 - VL |
OS - Atlas |
93.0 |
72.9 |
91.8 |
62.9 |
90.9 |
74.3 |
81.0 |
ShowUI - G |
ShowUI |
ShowUI |
91.6 |
69.0 |
81.8 |
59.0 |
83.0 |
65.5 |
75.0 |
ShowUI |
ShowUI |
ShowUI |
92.3 |
75.5 |
76.3 |
61.1 |
81.7 |
63.6 |
75.1 |
Iris |
Iris |
SeeClick |
85.3 |
64.2 |
86.7 |
57.5 |
82.6 |
71.2 |
74.6 |
Aria - UI |
Aria |
Aria - UI |
92.3 |
73.8 |
93.3 |
64.3 |
86.5 |
76.2 |
81.1 |
UGround - V1 - 2B (Qwen2 - VL) |
Qwen2 - VL |
UGround - V1 |
89.4 |
72.0 |
88.7 |
65.7 |
81.3 |
68.9 |
77.7 |
UGround - V1 - 7B (Qwen2 - VL) |
Qwen2 - VL |
UGround - V1 |
93.0 |
79.9 |
93.8 |
76.4 |
90.9 |
84.0 |
86.3 |
GUI視覺定位:ScreenSpot(代理設置)
規劃器 |
定位模型 |
架構 |
SFT數據 |
移動文本 |
移動圖標 |
桌面文本 |
桌面圖標 |
網頁文本 |
網頁圖標 |
平均值 |
GPT - 4o |
Qwen - VL |
Qwen - VL |
|
21.3 |
21.4 |
18.6 |
10.7 |
9.1 |
5.8 |
14.5 |
GPT - 4o |
SeeClick |
Qwen - VL |
SeeClick |
81.0 |
59.8 |
69.6 |
33.6 |
43.9 |
26.2 |
52.4 |
GPT - 4o |
Qwen - GUI |
Qwen - VL |
GUICourse |
67.8 |
24.5 |
53.1 |
16.4 |
50.4 |
18.5 |
38.5 |
GPT - 4o |
UGround - V1 |
LLaVA - UGround - V1 |
UGround - V1 |
93.4 |
76.9 |
92.8 |
67.9 |
88.7 |
68.9 |
81.4 |
GPT - 4o |
OS - Atlas - Base - 4B |
InternVL |
OS - Atlas |
94.1 |
73.8 |
77.8 |
47.1 |
86.5 |
65.3 |
74.1 |
GPT - 4o |
OS - Atlas - Base - 7B |
Qwen2 - VL |
OS - Atlas |
93.8 |
79.9 |
90.2 |
66.4 |
92.6 |
79.1 |
83.7 |
GPT - 4o |
UGround - V1 - 2B (Qwen2 - VL) |
Qwen2 - VL |
UGround - V1 |
94.1 |
77.7 |
92.8 |
63.6 |
90.0 |
70.9 |
81.5 |
GPT - 4o |
UGround - V1 - 7B (Qwen2 - VL) |
Qwen2 - VL |
UGround - V1 |
94.1 |
79.9 |
93.3 |
73.6 |
89.6 |
73.3 |
84.0 |

📚 引用信息
如果您覺得本工作有用,請考慮引用我們的論文:
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}