Home

Uground V1 2B

Developed by osunlp

UGroundは強力なGUIビジュアル定位モデルで、簡単な方法で訓練され、OSUNLPとOrby AIが共同で開発しました。

マルチモーダル融合

Transformers

EnglishOpen Source License:Apache-2.0 #GUIビジュアル定位 #マルチモーダルインタラクション #高精度座標予測

Downloads 975

Release Time : 1/3/2025

Model Overview

UGroundはGUIビジュアル定位に特化したモデルで、画面上の特定の要素やオブジェクトを正確に定位でき、さまざまなGUIインタラクションシーンに適用できます。

Model Features

強力なGUIビジュアル定位能力

画面上の特定の要素やオブジェクトを正確に定位でき、GUI内のさまざまなコンポーネントを正確に識別します。

簡単な訓練方法

簡潔で効果的な訓練戦略を採用し、高性能なビジュアル定位能力を実現します。

多サイズ画像処理

さまざまな解像度と比率の画像を処理でき、異なるGUIインターフェースに適応します。

多言語対応

英語と中国語に加え、画像内の複数の言語のテキスト内容を理解することもサポートします。

Model Capabilities

GUI要素定位

ビジュアル質問応答

マルチモーダル理解

クロス言語テキスト認識

複雑な推論と決定

Use Cases

自動化テスト

GUI要素自動識別

アプリケーションインターフェース内のボタン、テキストボックスなどの要素を自動的に識別して定位します。

自動化テストの精度と効率を向上させます。

支援技術

ビジュアル支援ツール

視覚障害者がGUIインターフェースを理解し操作するのを支援します。

バリアフリーなアクセス体験を向上させます。

ロボット制御

ビジョンベースのロボット操作

GUIインターフェースを通じてロボットにタスクを実行させます。

より自然なロボットインタラクション方式を実現します。

## 🚀 UGround-V1-2B （Qwen2-VLベース）

UGroundは、シンプルな手法で学習された強力なGUIビジュアルグラウンディングモデルです。詳細については、ホームページと論文をご確認ください。この研究は、[OSUNLP](https://x.com/osunlp)と[Orby AI](https://www.orby.ai/)の共同研究です。
![radar](https://osu-nlp-group.github.io/UGround/static/images/radar.png)
- **ホームページ:** https://osu-nlp-group.github.io/UGround/
- **リポジトリ:** https://github.com/OSU-NLP-Group/UGround
- **論文 (ICLR'25 Oral):** https://arxiv.org/abs/2410.05243
- **デモ:** https://huggingface.co/spaces/orby-osu/UGround
- **担当者:** [Boyu Gou](mailto:gou.43@osu.edu)

## ✨ 主な機能

### モデル
- モデル-V1:
  - [Initial UGround](https://huggingface.co/osunlp/UGround): 
  - [UGround-V1-2B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-2B)
  - [UGround-V1-7B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-7B)
  - [UGround-V1-72B (Qwen2-VL)](https://huggingface.co/osunlp/UGround-V1-72B)
  - [学習データ](https://huggingface.co/datasets/osunlp/UGround-V1-Data)

### リリース計画
- [x] [モデルの重み](https://huggingface.co/collections/osunlp/uground-677824fc5823d21267bc9812)
  - [x] 初期バージョン（論文で使用されたもの）
  - [x] Qwen2-VLベースのV1 (2B, 7B, 72B)
- [x] コード
  - [x] [UGroundの推論コード (初期版とQwen2-VLベース版)](https://github.com/boyugou/llava_uground/)
  - [x] オフライン実験 (コード、結果、有用なリソース)
    - [x] [ScreenSpot](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/ScreenSpot)
    - [x] [Multimodal-Mind2Web](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/Multimodal-Mind2Web)
    - [x] [OmniAct](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/OmniACT)
    - [x] [Android Control](https://github.com/OSU-NLP-Group/UGround/tree/main/offline_evaluation/AndroidControl)
  - [x] オンライン実験
    - [x] [Mind2Web-Live-SeeAct-V](https://github.com/boyugou/Mind2Web_Live_SeeAct_V)
    - [x] [AndroidWorld-SeeAct-V](https://github.com/boyugou/android_world_seeact_v)
  - [ ] データ合成パイプライン (近日公開)
- [x] [学習データ (V1)](https://huggingface.co/datasets/osunlp/UGround-V1-Data)
- [x] オンラインデモ (HF Spaces)

### 主な結果
#### GUIビジュアルグラウンディング: ScreenSpot (標準設定)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6500870f1e14749e84f8f887/hVwF_cOjLiUF0W0VUyxtp.png)

| ScreenSpot (標準)           | アーキテクチャ             | SFTデータ           | モバイル-テキスト | モバイル-アイコン | デスクトップ-テキスト | デスクトップ-アイコン | Web-テキスト | Web-アイコン | 平均      |
| ------------------------------- | ---------------- | ------------------ | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| InternVL-2-4B                   | InternVL-2       |                    | 9.2         | 4.8         | 4.6          | 4.3          | 0.9      | 0.1      | 4.0      |
| Groma                           | Groma            |                    | 10.3        | 2.6         | 4.6          | 4.3          | 5.7      | 3.4      | 5.2      |
| Qwen-VL                         | Qwen-VL          |                    | 9.5         | 4.8         | 5.7          | 5.0          | 3.5      | 2.4      | 5.2      |
| MiniGPT-v2                      | MiniGPT-v2       |                    | 8.4         | 6.6         | 6.2          | 2.9          | 6.5      | 3.4      | 5.7      |
| GPT-4                           |                  |                    | 22.6        | 24.5        | 20.2         | 11.8         | 9.2      | 8.8      | 16.2     |
| GPT-4o                          |                  |                    | 20.2        | 24.9        | 21.1         | 23.6         | 12.2     | 7.8      | 18.3     |
| Fuyu                            | Fuyu             |                    | 41.0        | 1.3         | 33.0         | 3.6          | 33.9     | 4.4      | 19.5     |
| Qwen-GUI                        | Qwen-VL          | GUICourse          | 52.4        | 10.9        | 45.9         | 5.7          | 43.0     | 13.6     | 28.6     |
| Ferret-UI-Llama8b               | Ferret-UI        |                    | 64.5        | 32.3        | 45.9         | 11.4         | 28.3     | 11.7     | 32.3     |
| Qwen2-VL                        | Qwen2-VL         |                    | 61.3        | 39.3        | 52.0         | 45.0         | 33.0     | 21.8     | 42.1     |
| CogAgent                        | CogAgent         |                    | 67.0        | 24.0        | 74.2         | 20.0         | 70.4     | 28.6     | 47.4     |
| SeeClick                        | Qwen-VL          | SeeClick           | 78.0        | 52.0        | 72.2         | 30.0         | 55.7     | 32.5     | 53.4     |
| OS-Atlas-Base-4B                | InternVL-2       | OS-Atlas           | 85.7        | 58.5        | 72.2         | 45.7         | 82.6     | 63.1     | 68.0     |
| OmniParser                      |                  |                    | 93.9        | 57.0        | 91.3         | 63.6         | 81.3     | 51.0     | 73.0     |
| **UGround**                     | LLaVA-UGround-V1 | UGround-V1         | 82.8        | **60.3**    | 82.5         | **63.6**     | 80.4     | **70.4** | **73.3** |
| Iris                            | Iris             | SeeClick           | 85.3        | 64.2        | 86.7         | 57.5         | 82.6     | 71.2     | 74.6     |
| ShowUI-G                        | ShowUI           | ShowUI             | 91.6        | 69.0        | 81.8         | 59.0         | 83.0     | 65.5     | 75.0     |
| ShowUI                          | ShowUI           | ShowUI             | 92.3        | 75.5        | 76.3         | 61.1         | 81.7     | 63.6     | 75.1     |
| Molmo-7B-D                      |                  |                    | 85.4        | 69.0        | 79.4         | 70.7         | 81.3     | 65.5     | 75.2     |
| **UGround-V1-2B (Qwen2-VL)**    | Qwen2-VL         | UGround-V1         | 89.4        | 72.0        | 88.7         | 65.7         | 81.3     | 68.9     | 77.7     |
| Molmo-72B                       |                  |                    | 92.7        | 79.5        | 86.1         | 64.3         | 83.0     | 66.0     | 78.6     |
| Aguvis-G-7B                     | Qwen2-VL         | Aguvis-Stage-1     | 88.3        | 78.2        | 88.1         | 70.7         | 85.7     | 74.8     | 81.0     |
| OS-Atlas-Base-7B                | Qwen2-VL         | OS-Atlas           | 93.0        | 72.9        | 91.8         | 62.9         | 90.9     | 74.3     | 81.0     |
| Aria-UI                         | Aria             | Aria-UI            | 92.3        | 73.8        | 93.3         | 64.3         | 86.5     | 76.2     | 81.1     |
| Claude (Computer-Use)           |                  |                    | **98.2**    | **85.6**    | 79.9         | 57.1         | **92.2** | **84.5** | 82.9     |
| Aguvis-7B                       | Qwen2-VL         | Aguvis-Stage-1&2   | 95.6        | 77.7        | **93.8**     | 67.1         | 88.3     | 75.2     | 83.0     |
| Project Mariner                 |                  |                    |             |             |              |              |          |          | 84.0     |
| **UGround-V1-7B (Qwen2-VL)**    | Qwen2-VL         | UGround-V1         | 93.0        | 79.9        | **93.8**     | **76.4**     | 90.9     | 84.0     | **86.3** |
| *AGUVIS-72B*                    | *Qwen2-VL*       | *Aguvis-Stage-1&2* | *94.5*      | *85.2*      | *95.4*       | *77.9*       | *91.3*   | *85.9*   | *88.4*   |
| ***UGround-V1-72B (Qwen2-VL)*** | *Qwen2-VL*       | *UGround-V1*       | *94.1*      | *83.4*      | *94.9*       | *85.7*       | *90.4*   | *87.9*   | *89.4*   |

#### GUIビジュアルグラウンディング: ScreenSpot (エージェント設定)
| プランナー | エージェント-ScreenSpot             | アーキテクチャ             | SFTデータ         | モバイル-テキスト | モバイル-アイコン | デスクトップ-テキスト | デスクトップ-アイコン | Web-テキスト | Web-アイコン | 平均      |
| ------- | ---------------------------- | ---------------- | ---------------- | ----------- | ----------- | ------------ | ------------ | -------- | -------- | -------- |
| GPT-4o  | Qwen-VL                      | Qwen-VL          |                  | 21.3        | 21.4        | 18.6         | 10.7         | 9.1      | 5.8      | 14.5     |
| GPT-4o  | Qwen-GUI                     | Qwen-VL          | GUICourse        | 67.8        | 24.5        | 53.1         | 16.4         | 50.4     | 18.5     | 38.5     |
| GPT-4o  | SeeClick                     | Qwen-VL          | SeeClick         | 81.0        | 59.8        | 69.6         | 33.6         | 43.9     | 26.2     | 52.4     |
| GPT-4o  | OS-Atlas-Base-4B             | InternVL-2       | OS-Atlas         | **94.1**    | 73.8        | 77.8         | 47.1         | 86.5     | 65.3     | 74.1     |
| GPT-4o  | OS-Atlas-Base-7B             | Qwen2-VL         | OS-Atlas         | 93.8        | **79.9**    | 90.2         | 66.4         | **92.6** | **79.1** | 83.7     |
| GPT-4o  | **UGround-V1**               | LLaVA-UGround-V1 | UGround-V1       | 93.4        | 76.9        | 92.8         | 67.9         | 88.7     | 68.9     | 81.4     |
| GPT-4o  | **UGround-V1-2B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | **94.1**    | 77.7        | 92.8         | 63.6         | 90.0     | 70.9     | 81.5     |
| GPT-4o  | **UGround-V1-7B (Qwen2-VL)** | Qwen2-VL         | UGround-V1       | **94.1**    | **79.9**    | **93.3**     | **73.6**     | 89.6     | 73.3     | **84.0** |

## 💻 使用例
### 推論
#### vLLMサーバー
```bash
vllm serve osunlp/UGround-V1-7B  --api-key token-abc123 --dtype float16

または

python -m vllm.entrypoints.openai.api_server --served-model-name osunlp/UGround-V1-7B --model osunlp/UGround-V1-7B --dtype float16

学習と推論に関する詳細な指示は、Qwen2-VLの公式リポジトリを参照してください。

ビジュアルグラウンディングのプロンプト

def format_openai_template(description: str, base64_image):
    return [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},
                },
                {
                    "type": "text",
                    "text": f"""
  Your task is to help the user identify the precise coordinates (x, y) of a specific area/element/object on the screen based on a description.

  - Your response should aim to point to the center or a representative point within the described area/element/object as accurately as possible.
  - If the description is unclear or ambiguous, infer the most relevant area or element based on its likely context or purpose.
  - Your answer should be a single string (x, y) corresponding to the point of the interest.

  Description: {description}

  Answer:"""
                },
            ],
        },
    ]

messages = format_openai_template(description, base64_image)

completion = await client.chat.completions.create(
    model=args.model_path,
    messages=messages,
    temperature=0  # REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
# REMEMBER to set temperature to ZERO!
)

# The output will be in the range of [0,1000), which is compatible with the original Qwen2-VL
# So the actual coordinates should be (x/1000*width, y/1000*height)

image/png

📚 ドキュメント

引用情報

この研究が役に立った場合は、以下の論文を引用してください。

@article{gou2024uground,
        title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
        author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2410.05243},
        year={2024},
        url={https://arxiv.org/abs/2410.05243},
      }

@article{zheng2023seeact,
        title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
        author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
        journal={arXiv preprint arXiv:2401.01614},
        year={2024},
      }

🚀 クイックスタート

Qwen2-VL-2B-Instruct

紹介

私たちは、約1年間の革新的な開発の成果である最新版のQwen-VLモデルであるQwen2-VLを発表することを嬉しく思います。

Qwen2-VLの新機能

主要な機能強化

様々な解像度と比率の画像の最先端の理解: Qwen2-VLは、MathVista、DocVQA、RealWorldQA、MTVQAなどのビジュアル理解ベンチマークで最先端の性能を達成しています。
20分以上の動画の理解: Qwen2-VLは、20分以上の動画を理解し、高品質な動画ベースの質問応答、対話、コンテンツ作成などを行うことができます。
モバイルやロボットなどを操作できるエージェント: 複雑な推論と意思決定能力を備えたQwen2-VLは、モバイルフォンやロボットなどのデバイスと統合して、ビジュアル環境とテキスト指示に基づいて自動操作を行うことができます。
多言語対応: グローバルなユーザーに対応するため、Qwen2-VLは英語と中国語に加えて、画像内の様々な言語のテキストの理解をサポートしており、ほとんどのヨーロッパ言語、日本語、韓国語、アラビア語、ベトナム語などが含まれます。

モデルアーキテクチャの更新

ナイーブな動的解像度: これまでとは異なり、Qwen2-VLは任意の画像解像度を処理することができ、それを動的な数のビジュアルトークンにマッピングし、より人間に近いビジュアル処理体験を提供します。

* **マルチモーダル回転位置埋め込み (M-ROPE)**: 位置埋め込みを分解して、1次元のテキスト、2次元のビジュアル、3次元の動画の位置情報を捕捉し、マルチモーダル処理能力を強化しています。

私たちは、20億、70億、720億のパラメータを持つ3つのモデルを用意しています。このリポジトリには、命令調整された2BのQwen2-VLモデルが含まれています。詳細については、[ブログ](https://qwenlm.github.io/blog/qwen2-vl/)と[GitHub](https://github.com/QwenLM/Qwen2-VL)をご覧ください。

評価

画像ベンチマーク

ベンチマーク	InternVL2-2B	MiniCPM-V 2.0	Qwen2-VL-2B
MMMU_val	36.3	38.2	41.1
DocVQA_test	86.9	-	90.1
InfoVQA_test	58.9	-	65.5
ChartQA_test	76.2	-	73.5
TextVQA_val	73.4	-	79.7
OCRBench	781	605	794
MTVQA	-	-	20.0
VCR_{en easy}	-	-	81.45
VCR_{zh easy}	-	-	46.16
RealWorldQA	57.3	55.8	62.9
MME_sum	1876.8	1808.6	1872.0
MMBench-EN_test	73.2	69.1	74.9
MMBench-CN_test	70.9	66.5	73.5
MMBench-V1.1_test	69.6	65.8	72.2
MMT-Bench_test	-	-	54.5
MMStar	49.8	39.1	48.0
MMVet_GPT-4-Turbo	39.7	41.0	49.5
HallBench_avg	38.0	36.1	41.7
MathVista_testmini	46.0	39.8	43.0
MathVision	-	-	12.4

動画ベンチマーク

ベンチマーク	Qwen2-VL-2B
MVBench	63.2
PerceptionTest_test	53.9
EgoSchema_test	54.9
Video-MME_{wo/w subs}	55.6/60.4

必要条件

Qwen2-VLのコードは最新のHugging face transformersに含まれています。ソースからビルドすることをお勧めします。以下のコマンドを使用してインストールできます。

pip install git+https://github.com/huggingface/transformers

そうしないと、以下のエラーが発生する可能性があります。

KeyError: 'qwen2_vl'

クイックスタート

私たちは、様々なタイプのビジュアル入力をより便利に扱うためのツールキットを提供しています。これには、base64、URL、画像と動画のインターリーブが含まれます。以下のコマンドを使用してインストールできます。

pip install qwen-vl-utils

以下は、transformersとqwen_vl_utilsを使用してチャットモデルを使用するコードスニペットです。

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2-VL-2B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

qwen_vl_utilsを使用しない場合

```python from PIL import Image import requests import torch from torchvision import io from typing import Dict from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

Load the model in half-precision on the available device(s)

model = Qwen2VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct")

Image

url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" image = Image.open(requests.get(url, stream=True).raw)

conversation = [ { "role": "user", "content": [ { "type": "image", }, {"type": "text", "text": "Describe this image."}, ], } ]

Preprocess the inputs

text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>\n<|im_start|>assistant\n'

inputs = processor( text=[text_prompt], images=[image], padding=True, return_tensors="pt" ) inputs = inputs.to("cuda")

Inference: Generation

</details>

## 📄 ライセンス
このプロジェクトは、Apache-2.0ライセンスの下で公開されています。

Featured Recommended AI Models

Llama 3 Typhoon V1.5x 8b Instruct

タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化

大規模言語モデル

Transformers Supports Multiple Languages

Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2％です。

Roberta Base Chinese Extractive Qa

RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。

質問応答システム Chinese

uer

2,694

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご