Sarashina2-vision-14b开源日本视觉语言模型 - 图像编码出色，基准测试表现优异

Home

Sarashina2 Vision 14b

Developed by sbintuitions

Sarashina2-Vision-14B是由SB Intuitions开发的日本大型视觉语言模型，结合了Sarashina2-13B和Qwen2-VL-7B的图像编码器，在多个基准测试中表现优异。

图像生成文本

Transformers

Supports Multiple LanguagesOpen Source License:MIT #日语视觉问答 #多模态推理 #高精度图像理解

Downloads 192

Release Time : 3/9/2025

Model Overview

该模型是一个多模态视觉语言模型，能够理解和生成与图像相关的文本内容，适用于图像分析和视觉问答等任务。

Model Features

高性能视觉语言模型

在多个基准测试中取得最高水平的分数，表现优于同类模型。

多模态支持

能够同时处理图像和文本输入，实现视觉与语言的结合。

多阶段训练

通过三个阶段的学习过程优化模型性能，包括投影仪、视觉编码器和大型语言模型的调整。

Model Capabilities

图像分析

视觉问答

多模态理解

文本生成

Use Cases

图像理解

识别著名建筑

识别照片中的著名建筑并描述其位置。

能够准确识别东京塔等著名建筑并描述其位置。

物体识别

识别照片中的特定物体。

能够准确识别起重机等物体。

视觉问答

回答关于图像的问题

根据图像内容回答用户提出的问题。

能够生成详细且准确的回答。

🚀 さらしな2视觉14B模型

さらしな2视觉14B模型（Sarashina2-Vision-14B） 是由 SB直觉公司训练的日本大型视觉语言模型。该模型基于さらしな2-13B模型（Sarashina2-13B）以及通义千问2视觉7B模型（Qwen2-VL-7B）的图像编码器构建。截至2025年3月7日，在四项基准测试中，该模型相较于其他日本视觉语言模型取得了最高分。

🚀 快速开始

✨ 主要特性

基于先进的基础模型和图像编码器构建，具备强大的视觉语言处理能力。
在多项基准测试中表现优异，展现出较高的性能水平。

📦 安装指南

1. 安装依赖项

pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate

💻 使用示例

基础用法

以下脚本用于加载模型并进行推理：

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

# Define model path
model_path = "sbintuitions/sarashina2-vision-14b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？"}]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？
### Assistant:"""

sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-14b/resolve/main/sample.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
stopping_criteria = processor.get_stopping_criteria(["\n###"])

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.0,
    do_sample=False,
    stopping_criteria=stopping_criteria,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。"""

示例展示

提示	输出
この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？	この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。
真ん中に映っている赤と白の物は何ですか？	赤と白の物はクレーンです。

🔧 技术细节

训练过程

さらしな2视觉模型（Sarashina2-Vision） 通过以下三个阶段的学习过程创建：

利用字幕数据集调整投影器中的参数。
利用字幕数据集调整视觉编码器和投影器中的参数。
利用视觉指令数据集调整投影器和大语言模型中的参数。

📚 详细文档

评估结果

模型	模型大小	JMMMU^*1	Heron-Bench^*2	JDocQA
heron-chat-git-ja-stablelm-base-7b-v1	7B	0.294	0.461	0.069
llava-calm2-siglip	7B	0.07	0.521	0.084
Llama-3-EvoVLM-JP-v2	8B	0.389	0.509	0.103
Asagi-14B	14B	0.302	0.433	0.06
llm-jp-3-vila-14b	14B	0.23	0.665	0.176
EZO-InternVL2-26B	26B	0.389	0.609	0.196
さらしな2视觉8B模型（Sarashina2-Vision-8B）	8B	0.393	0.648	0.229
さらしな2视觉14B模型（Sarashina2-Vision-14B）	14B	0.433	0.644	0.245