đ Sarashina2-Vision-8B
Sarashina2-Vision-8B is a Japanese Large Vision Language Model trained by SB Intuitions. It combines the power of Sarashina2-7B and the Image Encoder of Qwen2-VL-7B, achieving top scores in 4 benchmarks (as of 2025/03/07) among other Japanese VLMs.
đ Quick Start
⨠Features
- Based on Sarashina2-7B and Image Encoder of Qwen2-VL-7B.
- Achieved the highest level of scores in 4 benchmarks (as of 2025/03/07) compared to other Japanese VLMs.
đĻ Installation
1. Install dependencies
pip install -U transformers==4.47.0 torch torchvision pillow protobuf sentencepiece accelerate
đģ Usage Examples
Basic Usage
The following script loads the model and allows inference.
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
model_path = "sbintuitions/sarashina2-vision-8b"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
message = [{"role": "user", "content": "ããŽåįãĢåãŖãĻããããŽã§ãæãæåã¨čããããåģēį¯įŠã¯äŊã§ãŠããĢåãŖãĻããžããīŧ"}]
text_prompt = processor.apply_chat_template(message, add_generation_prompt=True)
"""text_prompt: <s><|prefix|><|file|><|suffix|>A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
### Human: ããŽåįãĢåãŖãĻããããŽã§ãæãæåã¨čããããåģēį¯įŠã¯äŊã§ãŠããĢåãŖãĻããžããīŧ
### Assistant:"""
sample_image_url = "https://huggingface.co/sbintuitions/sarashina2-vision-8b/resolve/main/sample.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw).convert("RGB")
inputs = processor(
text=[text_prompt],
images=[image],
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
stopping_criteria = processor.get_stopping_criteria(["\n###"])
output_ids = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.0,
do_sample=False,
stopping_criteria=stopping_criteria,
)
generated_ids = [
output_ids[len(input_ids) :] for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
print(output_text[0])
"""ããŽåįãĢåãŖãĻããããŽã§ãæãæåã¨čããããåģēį¯įŠã¯æąäēŦãŋã¯ãŧã§ããæąäēŦãŋã¯ãŧã¯ãæąäēŦãŽãŠãŗãããŧã¯ã§ãããããŽåįã§ã¯ãéĢåą¤ããĢįž¤ãŽåããå´ãĢåãŖãĻããžãã"""
Example
Prompt |
Output |
ããŽåįãĢåãŖãĻããããŽã§ãæãæåã¨čããããåģēį¯įŠã¯äŊã§ãŠããĢåãŖãĻããžããīŧ |
ããŽåįãĢåãŖãĻããããŽã§ãæãæåã¨čããããåģēį¯įŠã¯æąäēŦãŋã¯ãŧã§ããæąäēŦãŋã¯ãŧã¯ãæąäēŦãŽãŠãŗãããŧã¯ã§ãããããŽåįã§ã¯ãéĢåą¤ããĢįž¤ãŽåããå´ãĢåãŖãĻããžãã |
įãä¸ãĢæ ãŖãĻããčĩ¤ã¨įŊãŽįŠã¯äŊã§ããīŧ |
įãä¸ãĢæ ãŖãĻããčĩ¤ã¨įŊãŽããŽã¯ã¯ãŦãŧãŗã§ãã |
đ§ Technical Details
Sarashina2-Vision is created through the following three-stage learning process:
- Tune the parameters in the projector by caption datasets.
- Tune the parameters in the Vision Encoder and projector by caption datasets.
- Tune the parameters in the projector and LLM by Visual Instruction datasets.
đ Documentation
Evaluation Results
- Evaluated only single image samples (1,286 samples). If answer extraction failed, we treated it as incorrect (score 0) instead of making a random choice to eliminate stochasticity.
- GPT-4o (gpt-4o-2024-08-06) was used for LLM-as-a-Judge.
â ī¸ Important Note
Sarashina2-Vision might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs. Before using Sarashina2-Vision, we would like developers to tune models based on human preferences and safety considerations.
đ License
MIT License