Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Asagi-4B: A Large-scale Japanese Vision & Language Model
Asagi-4B is a large-scale Japanese Vision & Language Model (VLM). It's trained on a vast Japanese dataset from diverse sources, offering high - quality image - text processing capabilities.
🚀 Quick Start
Requirements
transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0
How to use
import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig
transformers.set_seed(42)
model_path = "MIL-UT/Asagi-4B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
model_path, trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
generation_config = GenerationConfig(
do_sample=True,
num_beams=5,
max_new_tokens=256,
temperature=0.7,
repetition_penalty=1.5
)
prompt = ("The following is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
"### Instruction:\n<image>\nPlease look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.\n\n### Response:\n")
# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)
inputs = processor(
text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
if v.dtype == torch.float32:
inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}
generate_ids = model.generate(
**inputs,
generation_config=generation_config
)
generated_text = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
# do not print the prompt
if "<image>" in prompt:
prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")
print(f"Generated text: {generated_text}")
# >>> Generated text: This photo was taken in Shibuya, Tokyo. It shows the cityscape of Shibuya at night, with high - rise buildings lined up and streetlights and neon signs shining.
✨ Features
This repository provides Asagi-4B, a large-scale Japanese Vision & Language Model (VLM). Asagi-4B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.
A significant portion of the training data is synthesized using models such as the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct).
Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.
Model components | Model / Architecture | Parameters |
---|---|---|
Vision encoder | siglip-so400m-patch14-384 | 428M |
Projector | 2-layer MLP | 64M |
LLM | llm-jp-3-3.7b-instruct | 3.7B |
💻 Usage Examples
Basic Usage
# The above code for basic usage
Advanced Usage
Click to expand
Example 1

Prompt | Output |
---|---|
Please look at this image and answer the following instruction in detail and specifically. Please tell me in detail about the content of this photo. | This photo was taken of the Rainbow Bridge in Tokyo. The bridge is painted white and gracefully arches over the calm water. In the background, there is an urban landscape with high - rise buildings. The sky is a clear blue, dotted with white clouds. |
Example 2

Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Please evaluate this photo. | This photo captures an autumn landscape with Tokyo Tower in the background. Tokyo Tower is characterized by its red and white striped pattern, and there are maple trees with red leaves around it. The sky is partially cloudy, and the sunlight shines through the clouds, creating a beautiful contrast. In the foreground, you can see the silhouettes of streetlights and buildings, giving a sense of an urban atmosphere. |
Example 3

Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo. | The shrine in this photo reflects the traditional Japanese architectural style. It is characterized by white walls and bright orange roofs, creating a sacred atmosphere. At the entrance of the shrine, there is a vermilion torii, which is a symbolic gate indicating the entrance to a sacred space in Shinto. In front of the torii, there is a small shrine hall, which serves as a place for worshipers to offer prayers. There is a rope beside the shrine hall, suggesting that it is a sacred place. The shrine is located among modern buildings, symbolizing the coexistence of tradition and modernity in Japanese culture. |
Example 4

Prompt | Output |
---|---|
Please look at this image and answer the following instruction in detail and specifically. What type of garden is this among Japanese gardens? And what's the reason? | This is a type of garden called a "rock garden" among Japanese gardens. A rock garden is a form that skillfully arranges natural rocks to pursue tranquility and beauty and is one of the traditional Japanese garden designs. |
Example 5

Prompt | Output |
---|---|
Please look at this image and answer the following question in detail. Please write a synopsis of a novel set in the location of this image. | Once upon a time, in a small village, there was a stone staircase leading to the center of the village. That staircase was the entrance to a sacred place for the villagers, and they climbed the staircase to offer their prayers at the sacred place. Around the staircase, there were lush green plants, spreading a scenery that made people feel in harmony with nature. Every time the villagers climbed the staircase, they showed their gratitude for nature's blessings and respect for the sacred place. |
🔧 Technical Details
Training Data
Dataset | Preprocess | Stage1 | Stage2 | Size |
---|---|---|---|---|
ROIS (Ours) | Synthesized | ✓ | ✓ | 8.4M |
Japanese image text pairs | Synthesized | ✓ | ✓ | 4.4M |
Wikipedia | Synthesized | ✓ | ✓ | 2.5M |
Open Images | Translated | ✓ | ✓ | 680K |
DCI | Translated | ✓ | ✓ | 7K |
CommonCatalog CC - BY | Translated | ✓ | ✓ | 3.5M |
LLaVA - Pretrain - JA | ✓ | ✓ | 550K | |
STAIR Captions | ✓ | ✓ | 410K | |
Flickr - JP | ✓ | ✓ | 160K | |
YJ Captions | ✓ | ✓ | 130K | |
Japanese Pascal | ✓ | ✓ | 5K | |
ArtBench | Synthesized | ✓ | 100K | |
GQA | Translated | ✓ | 1.9M | |
VQA v2 | Translated | ✓ | 880K | |
A - OKVQA | Translated | ✓ | 34K | |
OK - VQA | Translated | ✓ | 18K | |
Japanese Visual Genome | Translated | ✓ | 1.6M | |
PangeaInstruct | ✓ | 93K |
Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.
Evaluation
We evaluated our model using Heron - Bench, JA - VLM - Bench - in - the - Wild, and JA - VG - VQA - 500. We used eval - mm library for this evaluation.
Here, models with "†" are not trained with GPT - generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT - generated data.
Model | LM Size | Heron - Bench (LLM (%)) | JA - VLM - Bench - In - the - Wild (ROUGE - L) | JA - VLM - Bench - In - the - Wild (LLM (/5.0)) | JA - VG - VQA - 500 (ROUGE - L) | JA - VG - VQA - 500 (LLM (/5.0)) |
---|---|---|---|---|---|---|
Japanese InstructBLIP Alpha† | 7B | 14.0 | 20.8 | 2.42 | - | - |
Japanese Stable VLM† | 7B | 24.2 | 23.3 | 2.47 | - | - |
LLaVA - CALM2 - SigLIP† | 7B | 43.3 | 47.2 | 3.15 | 17.4 | 3.21 |
Llama - 3 - EvoVLM - JP - v2 | 8B | 39.3 | 41.4 | 2.92 | 23.5 | 2.96 |
VILA - jp | 13B | 57.2 | 52.3 | 3.69 | 16.2 | 3.62 |
Asagi - 2B† | 1.8B | 44.7 | 48.8 | 3.26 | 53.7 | 3.69 |
Asagi - 4B† | 3.7B | 49.3 | 49.6 | 3.38 | 55.6 | 3.78 |
Asagi - 8B† | 7.2B | 54.7 | 49.4 | 3.45 | 56.43 | 3.84 |
Asagi - 14B† | 13B | 55.8 | 50.8 | 3.44 | 56.8 | 3.84 |
GPT - 4o | - | 87.6 | 37.6 | 3.85 | 12.1 | 3.58 |
📄 License
The model is released under the Apache - 2.0 license.
⚠️ Important Note
The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
Model Card Authors
Kohei Uehara






