サラシナ2 - ビジョン - 14Bオープンソース日本語ビジュアル言語モデル - 画像エンコーディングに優れ、ベンチマークテストで良好な結果

Sarashina2 Vision 14b

sbintuitionsによって開発

Sarashina2-Vision-14BはSB Intuitionsによって開発された日本の大規模視覚言語モデルで、Sarashina2-13BとQwen2-VL-7Bの画像エンコーダーを組み合わせており、複数のベンチマークテストで優れた性能を示しています。

画像生成テキスト

Transformers

複数言語対応オープンソースライセンス:MIT #日本語視覚質問応答 #マルチモーダル推論 #高精度画像理解

ダウンロード数 192

リリース時間 : 3/9/2025

モデル概要

このモデルはマルチモーダルな視覚言語モデルで、画像に関連するテキストコンテンツを理解し生成することができ、画像分析や視覚質問応答などのタスクに適しています。

モデル特徴

高性能視覚言語モデル

複数のベンチマークテストで最高レベルのスコアを獲得し、同類のモデルを上回る性能を示しています。

マルチモーダル対応

画像とテキスト入力を同時に処理でき、視覚と言語を統合します。

多段階トレーニング

プロジェクター、視覚エンコーダー、大規模言語モデルの調整を含む3段階の学習プロセスを通じてモデル性能を最適化します。

モデル能力

画像分析

視覚質問応答

マルチモーダル理解

テキスト生成

使用事例

画像理解

有名な建築物の識別

写真中の有名な建築物を識別し、その位置を説明します。

東京タワーなどの有名な建築物を正確に識別し、その位置を説明できます。

物体識別

写真中の特定の物体を識別します。

クレーンなどの物体を正確に識別できます。

視覚質問応答

画像に関する質問に回答

画像内容に基づいてユーザーからの質問に回答します。

詳細かつ正確な回答を生成できます。

🚀 Sarashina2-Vision-14B

Sarashina2-Vision-14B は、SB Intuitions によって学習された日本語の大規模ビジョン言語モデルです。このモデルは、Sarashina2-13B と Qwen2-VL-7B の画像エンコーダーをベースに構築されています。2025年3月7日現在、他の日本語のビジョン言語モデルと比較して、4つのベンチマークで最高レベルのスコアを達成しています。

🚀 クイックスタート

📦 インストール

依存関係をインストールするには、次のコマンドを実行します。

pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate

💻 使用例

基本的な使用法

以下のスクリプトは、モデルをロードして推論を行うことができます。

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

# Define model path
model_path = "sbintuitions/sarashina2-vision-14b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？"}]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？
### Assistant:"""

sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-14b/resolve/main/sample.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
stopping_criteria = processor.get_stopping_criteria(["\n###"])

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.0,
    do_sample=False,
    stopping_criteria=stopping_criteria,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。"""

実際の例

プロンプト	出力
この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？	この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。
真ん中に映っている赤と白の物は何ですか？	赤と白の物はクレーンです。

📚 ドキュメント

🔧 技術詳細

Sarashina2-Vision は、以下の3段階の学習プロセスを通じて作成されています。

キャプションデータセットを使用して、プロジェクターのパラメータを調整します。
キャプションデータセットを使用して、ビジョンエンコーダーとプロジェクターのパラメータを調整します。
ビジュアル命令データセットを使用して、プロジェクターと大規模言語モデル（LLM）のパラメータを調整します。

評価結果

モデル	モデルサイズ	JMMMU^*1	Heron-Bench^*2	JDocQA
heron-chat-git-ja-stablelm-base-7b-v1	7B	0.294	0.461	0.069
llava-calm2-siglip	7B	0.07	0.521	0.084
Llama-3-EvoVLM-JP-v2	8B	0.389	0.509	0.103
Asagi-14B	14B	0.302	0.433	0.06
llm-jp-3-vila-14b	14B	0.23	0.665	0.176
EZO-InternVL2-26B	26B	0.389	0.609	0.196
Sarashina2-Vision-8B	8B	0.393	0.648	0.229
Sarashina2-Vision-14B	14B	0.433	0.644	0.245