Fintor-GUI-S2オープンソースモデル - GUIマルチモーダルタスクに特化し、無料でインターフェイス操作をサポート

ホーム

Fintor GUI S2

Fintorによって開発

Fintor-GUI-S2はUI-TARS-7B-DPOを微調整したGUI基礎モデルで、グラフィカルユーザーインターフェース(GUI)のマルチモーダルタスクに特化しています。

画像生成テキスト

Transformers

オープンソースライセンス:Apache-2.0 #GUIマルチモーダル理解 #画面要素の位置特定 #命令微調整強化

ダウンロード数 190

リリース時間 : 3/12/2025

モデル概要

このモデルはグラフィカルユーザーインターフェース(GUI)に最適化されたマルチモーダルモデルで、GUI関連のテキストや画像コンテンツを理解し生成できます。

モデル特徴

GUI最適化

グラフィカルユーザーインターフェースタスクに特化して微調整されており、GUI関連タスクで優れた性能を発揮します。

マルチモーダル能力

画像とテキスト情報を同時に処理し、クロスモーダルな理解と生成を実現します。

性能向上

Screenspotベンチマークテストでベースモデルと比べて顕著な性能向上を示しています。

モデル能力

GUI画像理解

クロスモーダルテキスト生成

GUI要素認識

マルチモーダル推論

使用事例

GUI自動化

GUI要素記述生成

GUIスクリーンショットからインターフェース要素の記述テキストを生成

Screenspot v2ベンチマークで91.8の精度を達成

GUI操作ガイド

GUI画像から操作手順の説明を生成

🚀 Fintor-GUI-S2

Fintor-GUI-S2は、GUI接地モデルであり、UI-TARS-7B-DPO からファインチューニングされています。このモデルは、画像とテキストを入力としてテキストを出力するタスクに特化しています。

📄 ライセンス

このモデルはApache-2.0ライセンスの下で提供されています。

属性	详情
モデルタイプ	マルチモーダルGUI接地モデル
訓練データ	OS-Copilot/OS-Atlas-data
ベースモデル	bytedance-research/UI-TARS-7B-DPO
パイプラインタグ	画像-テキストからテキスト
ライブラリ名	transformers
タグ	マルチモーダル、GUI

🚀 クイックスタート

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Fintor/Ui-Tars-7B-Instruct-Finetuned-Os-Atlas", 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)
# default processer
processor = AutoProcessor.from_pretrained("Fintor/Ui-Tars-7B-Instruct-Finetuned-Os-Atlas")
# Example input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/image.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)