Sarashina2-vision-14b Open Source Japanese Vision-Language Model - Excellent Image Encoding and Outstanding Performance in Benchmark Tests

Sarashina2 Vision 14b

Developed by sbintuitions

Sarashina2-Vision-14B is a large Japanese visual language model developed by SB Intuitions, combining Sarashina2-13B with Qwen2-VL-7B's image encoder, achieving excellent performance in multiple benchmarks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Japanese Visual Question Answering #Multimodal Reasoning #High-Precision Image Understanding

Downloads 192

Release Time : 3/9/2025

Model Overview

This model is a multimodal vision-language model capable of understanding and generating text related to images, suitable for tasks such as image analysis and visual question answering.

Model Features

High-Performance Vision-Language Model

Achieves top-tier scores in multiple benchmarks, outperforming similar models.

Multimodal Support

Capable of processing both image and text inputs, integrating vision and language.

Multi-Stage Training

Optimizes model performance through a three-stage learning process, including adjustments to the projector, visual encoder, and large language model.

Model Capabilities

Image Analysis

Visual Question Answering

Multimodal Understanding

Text Generation

Use Cases

Image Understanding

Recognizing Famous Buildings

Identify famous buildings in photos and describe their locations.

Can accurately recognize landmarks like Tokyo Tower and describe their locations.

Object Recognition

Identify specific objects in photos.

Can accurately recognize objects such as cranes.

Visual Question Answering

Answering Questions About Images

Answer user questions based on image content.

Can generate detailed and accurate responses.

🚀 Sarashina2-Vision-14B

Sarashina2-Vision-14B is a Japanese Large Vision Language Model trained by SB Intuitions. It combines the power of Sarashina2-13B and the Image Encoder of Qwen2-VL-7B. As of 2025/03/07, it has achieved the highest scores in 4 benchmarks compared to other Japanese VLMs.

🚀 Quick Start

📦 Installation

First, you need to install the necessary dependencies:

pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate

💻 Usage Examples

Basic Usage

The following script demonstrates how to load the model and perform inference:

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

# Define model path
model_path = "sbintuitions/sarashina2-vision-14b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？"}]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？
### Assistant:"""

sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-14b/resolve/main/sample.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
stopping_criteria = processor.get_stopping_criteria(["\n###"])

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.0,
    do_sample=False,
    stopping_criteria=stopping_criteria,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。"""

Example

Here is an example of input and output:

Prompt	Output
この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？	この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京の街並みの右側に写っています。
真ん中に映っている赤と白の物は何ですか？	赤と白の物はクレーンです。

🔧 Technical Details

Training

Sarashina2-Vision is developed through a three-stage learning process:

Tune the parameters in the projector using caption datasets.
Tune the parameters in the Vision Encoder and projector using caption datasets.
Tune the parameters in the projector and LLM using Visual Instruction datasets.

Evaluation Results

The following table shows the evaluation results of different models on several benchmarks:

Model	Model Size	JMMMU^*1	Heron-Bench^*2	JDocQA
heron-chat-git-ja-stablelm-base-7b-v1	7B	0.294	0.461	0.069
llava-calm2-siglip	7B	0.07	0.521	0.084
Llama-3-EvoVLM-JP-v2	8B	0.389	0.509	0.103
Asagi-14B	14B	0.302	0.433	0.06
llm-jp-3-vila-14b	14B	0.23	0.665	0.176
EZO-InternVL2-26B	26B	0.389	0.609	0.196
Sarashina2-Vision-8B	8B	0.393	0.648	0.229
Sarashina2-Vision-14B	14B	0.433	0.644	0.245

Evaluated only single image samples (1,286 samples). If answer extraction failed, it was treated as incorrect (score 0) instead of making a random choice to eliminate stochasticity.
GPT-4o (gpt-4o-2024-08-06) was used for LLM-as-a-Judge.

📚 Documentation

Ethical Considerations and Limitations

Sarashina2-Vision may generate some meaningless sequences, inaccurate instances, or biased/objectionable outputs. Developers are advised to tune the models based on human preferences and safety considerations before use.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご