Sarashina2-vision-8B Open-Source Visual Language Model - Japan AI Based on Multi-Models to Boost Image Understanding Applications

Sarashina2 Vision 8b

Developed by sbintuitions

Sarashina2-Vision-8B is a large Japanese vision-language model trained by SB Intuitions, based on the Sarashina2-7B and Qwen2-VL-7B image encoders, achieving excellent performance in multiple benchmarks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Japanese Visual Question Answering #Multimodal Reasoning #Architecture Recognition

Downloads 1,233

Release Time : 3/9/2025

Model Overview

This model is a multimodal vision-language model capable of understanding and generating text descriptions related to images, suitable for both Japanese and English environments.

Model Features

Multimodal Support

Combines visual and language processing capabilities to understand and generate text descriptions related to images.

High Performance

Achieves top scores in multiple benchmarks, outperforming similar models.

Japanese Optimization

Specially optimized for Japanese environments, suitable for Japanese vision-language tasks.

Model Capabilities

Image Understanding

Text Generation

Multimodal Reasoning

Visual Question Answering

Use Cases

Visual Question Answering

Recognizing Famous Architecture

Identify famous buildings in images and describe their locations.

Can accurately identify and describe famous buildings such as Tokyo Tower in images.

Image Description

Describing Image Content

Generate detailed textual descriptions of images.

Can generate accurate and detailed image descriptions.

🚀 Sarashina2-Vision-8B

Sarashina2-Vision-8B is a Japanese Large Vision Language Model trained by SB Intuitions. It combines the power of Sarashina2-7B and the Image Encoder of Qwen2-VL-7B, achieving top scores in 4 benchmarks (as of 2025/03/07) among other Japanese VLMs.

🚀 Quick Start

✨ Features

Based on Sarashina2-7B and Image Encoder of Qwen2-VL-7B.
Achieved the highest level of scores in 4 benchmarks (as of 2025/03/07) compared to other Japanese VLMs.

📦 Installation

1. Install dependencies

pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate

💻 Usage Examples

Basic Usage

The following script loads the model and allows inference.

import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

# Define model path
model_path = "sbintuitions/sarashina2-vision-8b"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)

message = [{"role": "user", "content": "この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？"}]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

### Human: この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？
### Assistant:"""

sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
inputs = processor(
    text=[text_prompt],
    images=[image],
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")
stopping_criteria = processor.get_stopping_criteria(["\n###"])

# Inference: Generation of the output
output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.0,
    do_sample=False,
    stopping_criteria=stopping_criteria,
)
generated_ids = [
    output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、高層ビル群の向こう側に写っています。"""

Example

Prompt	Output
この写真に写っているもので、最も有名と考えられる建築物は何でどこに写っていますか？	この写真に写っているもので、最も有名と考えられる建築物は東京タワーです。東京タワーは、東京のランドマークであり、この写真では、高層ビル群の向こう側に写っています。
真ん中に映っている赤と白の物は何ですか？	真ん中に映っている赤と白のものはクレーンです。

🔧 Technical Details

Sarashina2-Vision is created through the following three-stage learning process:

Tune the parameters in the projector by caption datasets.
Tune the parameters in the Vision Encoder and projector by caption datasets.
Tune the parameters in the projector and LLM by Visual Instruction datasets.

📚 Documentation

Evaluation Results

Model	Model Size	JMMMU^*1	Heron-Bench^*2	JDocQA
heron-chat-git-ja-stablelm-base-7b-v1	7B	0.294	0.461	0.069
llava-calm2-siglip	7B	0.07	0.521	0.084
Llama-3-EvoVLM-JP-v2	8B	0.389	0.509	0.103
Asagi-14B	14B	0.302	0.433	0.06
llm-jp-3-vila-14b	14B	0.23	0.665	0.176
EZO-InternVL2-26B	26B	0.389	0.609	0.196
Sarashina2-Vision-8B	8B	0.393	0.648	0.229
Sarashina2-Vision-14B	14B	0.433	0.644	0.245

Evaluated only single image samples (1,286 samples). If answer extraction failed, we treated it as incorrect (score 0) instead of making a random choice to eliminate stochasticity.
GPT-4o (gpt-4o-2024-08-06) was used for LLM-as-a-Judge.

⚠️ Important Note

Sarashina2-Vision might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs. Before using Sarashina2-Vision, we would like developers to tune models based on human preferences and safety considerations.

📄 License

MIT License

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご