Sarashina2-vision-14b開源日本視覺語言模型 - 圖像編碼出色，基準測試表現優異

首頁

Sarashina2 Vision 14b

由sbintuitions開發

Sarashina2-Vision-14B是由SB Intuitions開發的日本大型視覺語言模型，結合了Sarashina2-13B和Qwen2-VL-7B的圖像編碼器，在多個基準測試中表現優異。

圖像生成文本

Transformers

支持多種語言開源協議:MIT #日語視覺問答 #多模態推理 #高精度圖像理解

下載量 192

發布時間 : 3/9/2025

模型概述

該模型是一個多模態視覺語言模型，能夠理解和生成與圖像相關的文本內容，適用於圖像分析和視覺問答等任務。

模型特點

高性能視覺語言模型

在多個基準測試中取得最高水平的分數，表現優於同類模型。

多模態支持

能夠同時處理圖像和文本輸入，實現視覺與語言的結合。

多階段訓練

通過三個階段的學習過程優化模型性能，包括投影儀、視覺編碼器和大型語言模型的調整。

模型能力

圖像分析

視覺問答

多模態理解

文本生成

使用案例

圖像理解

識別著名建築

識別照片中的著名建築並描述其位置。

能夠準確識別東京塔等著名建築並描述其位置。

物體識別

識別照片中的特定物體。

能夠準確識別起重機等物體。

視覺問答

回答關於圖像的問題

根據圖像內容回答用戶提出的問題。

能夠生成詳細且準確的回答。

🚀 さらしな2視覺14B模型

さらしな2視覺14B模型（Sarashina2-Vision-14B） 是由 SB直覺公司訓練的日本大型視覺語言模型。該模型基於さらしな2-13B模型（Sarashina2-13B）以及通義千問2視覺7B模型（Qwen2-VL-7B）的圖像編碼器構建。截至2025年3月7日，在四項基準測試中，該模型相較於其他日本視覺語言模型取得了最高分。

🚀 快速開始

✨ 主要特性

基於先進的基礎模型和圖像編碼器構建，具備強大的視覺語言處理能力。
在多項基準測試中表現優異，展現出較高的性能水平。

📦 安裝指南

1. 安裝依賴項

pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate

💻 使用示例

基礎用法

以下腳本用於加載模型並進行推理：

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

# Define model path
model_path = "sbintuitions/sarashina2-vision-14b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

message = [{"role": "user", "content": "この寫真に寫っているもので、最も有名と考えられる建築物は何でどこに寫っていますか？"}]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: この寫真に寫っているもので、最も有名と考えられる建築物は何でどこに寫っていますか？
### Assistant:"""

sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-14b/resolve/main/sample.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
stopping_criteria = processor.get_stopping_criteria(["\n###"])

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.0,
    do_sample=False,
    stopping_criteria=stopping_criteria,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""この寫真に寫っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に寫っています。"""

示例展示

提示	輸出
この寫真に寫っているもので、最も有名と考えられる建築物は何でどこに寫っていますか？	この寫真に寫っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に寫っています。
真ん中に映っている赤と白の物は何ですか？	赤と白の物はクレーンです。

🔧 技術細節

訓練過程

さらしな2視覺模型（Sarashina2-Vision） 通過以下三個階段的學習過程創建：

利用字幕數據集調整投影器中的參數。
利用字幕數據集調整視覺編碼器和投影器中的參數。
利用視覺指令數據集調整投影器和大語言模型中的參數。

📚 詳細文檔

評估結果

模型	模型大小	JMMMU^*1	Heron-Bench^*2	JDocQA
heron-chat-git-ja-stablelm-base-7b-v1	7B	0.294	0.461	0.069
llava-calm2-siglip	7B	0.07	0.521	0.084
Llama-3-EvoVLM-JP-v2	8B	0.389	0.509	0.103
Asagi-14B	14B	0.302	0.433	0.06
llm-jp-3-vila-14b	14B	0.23	0.665	0.176
EZO-InternVL2-26B	26B	0.389	0.609	0.196
さらしな2視覺8B模型（Sarashina2-Vision-8B）	8B	0.393	0.648	0.229
さらしな2視覺14B模型（Sarashina2-Vision-14B）	14B	0.433	0.644	0.245