Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Asagi-14B: A Large-scale Japanese Vision & Language Model
This repository offers Asagi-14B, a large-scale Japanese Vision & Language Model (VLM). Asagi-14B is trained on a vast Japanese dataset with diverse data sources, enabling it to handle image-text-to-text tasks effectively.
🚀 Quick Start
Requirements
transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0
How to use
import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig
transformers.set_seed(42)
model_path = "MIL-UT/Asagi-14B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
model_path, trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
generation_config = GenerationConfig(
do_sample=True,
num_beams=5,
max_new_tokens=256,
temperature=0.7,
repetition_penalty=1.5
)
prompt = ("The following is an instruction describing a task. Write a response that appropriately fulfills the request.\n\n"
"### Instruction:\n<image>\nPlease look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.\n\n### Response:\n")
# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)
inputs = processor(
text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
if v.dtype == torch.float32:
inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}
generate_ids = model.generate(
**inputs,
generation_config=generation_config
)
generated_text = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
# do not print the prompt
if "<image>" in prompt:
prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")
print(f"Generated text: {generated_text}")
# >>> Generated text: This photo captures the night view of Shibuya Scramble Crossing. Shibuya Scramble Crossing is a famous intersection located in Shibuya Ward, Tokyo, known as a place where many people come and go. The photo shows high-rise buildings, neon signs, streetlights, etc., allowing you to feel the vitality and energy of the city.
Example

Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail. | This photo captures the night view of Shibuya Scramble Crossing. Shibuya Scramble Crossing is a famous intersection located in Shibuya Ward, Tokyo, known as a place where many people come and go. The photo shows high-rise buildings, neon signs, streetlights, etc., allowing you to feel the vitality and energy of the city. |
More Examples
Click to expand

Prompt | Output |
---|---|
Please look at this image and answer the following instruction in detail and specifically. Please tell me in detail about the content of this photo. | This photo captures an urban landscape with the Rainbow Bridge in Tokyo as the backdrop. The bridge is painted white and gray and elegantly arches over the water. The sky is a clear blue with a few clouds floating. In the distance, you can see the cityscape with high-rise buildings lined up. |

Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Please evaluate this photo. | This photo is very beautiful. The contrast between the red and white of Tokyo Tower is vivid, and the contrast with the blue sky and white clouds is impressive. The surrounding lush green trees and streetlights create an urban atmosphere, and the overall composition is well-balanced. |

Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo. | The shrine in this photo reflects the traditional Japanese architectural style. It is characterized by white walls and orange pillars, and the roof is covered with green tiles. At the entrance of the shrine, there is a vermilion torii gate, which is a symbolic gate indicating the entrance to a sacred place. Shime-nawa ropes are decorated on both sides of the torii gate, creating a sacred atmosphere. In front of the shrine, there are stone pavements, which are used to show respect when worshipers step into the sacred place. This shrine, although located in a modern urban area, plays an important role in passing on traditional Japanese culture and beliefs to the present. |

Prompt | Output |
---|---|
Please look at this image and answer the following instruction in detail and specifically. What type of garden does this belong to among Japanese gardens? And what is the reason? | This belongs to a type of garden called a pond-circuit style garden among Japanese gardens. A pond-circuit style garden is designed with a pond at the center and allows visitors to stroll around it. The purpose of this type of garden is to maximize the beauty of nature and provide tranquility and peace to visitors. |

Prompt | Output |
---|---|
Please look at this image and answer the following question in detail. Please write the plot of a novel set in the location of this image. | The title of the novel set in the location of this image is "The Secret of the Stairs". The story unfolds on an old stone staircase in a quiet alley away from the city's noise. The protagonist has a mysterious experience where the past and the present intersect while climbing the stairs. Along the way, there are mossy stone walls that give a sense of history and traditional wooden buildings lined up. Every time the protagonist climbs the stairs, they touch the memories of past events and people and face their own past. And when they reach the top of the stairs, the hidden secret is revealed. |
✨ Features
This repository provides Asagi-14B, a large-scale Japanese Vision & Language Model (VLM). Asagi-14B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.
A significant portion of the training data is synthesized using models such as the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct).
Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.
Property | Details |
---|---|
Model Type | Asagi-14B, a large-scale Japanese Vision & Language Model (VLM) |
Training Data | A large part is synthesized using models like CALM3-22B-Chat and Phi3.5-vision-instruct, and also includes various datasets such as ROIS, Japanese image text pairs, Wikipedia, etc. |
Model components | Vision encoder: siglip-so400m-patch14-384 (428M parameters); Projector: 2 - layer MLP (64M parameters); LLM: llm-jp-3-13b-instruct (13B parameters) |
🔧 Technical Details
Training Data
Dataset | Preprocess | Stage1 | Stage2 | Size |
---|---|---|---|---|
ROIS (Ours) | Synthesized | ✓ | ✓ | 8.4M |
Japanese image text pairs | Synthesized | ✓ | ✓ | 4.4M |
Wikipedia | Synthesized | ✓ | ✓ | 2.5M |
Open Images | Translated | ✓ | ✓ | 680K |
DCI | Translated | ✓ | ✓ | 7K |
CommonCatalog CC - BY | Translated | ✓ | ✓ | 3.5M |
LLaVA - Pretrain - JA | ✓ | ✓ | 550K | |
STAIR Captions | ✓ | ✓ | 410K | |
Flickr - JP | ✓ | ✓ | 160K | |
YJ Captions | ✓ | ✓ | 130K | |
Japanese Pascal | ✓ | ✓ | 5K | |
ArtBench | Synthesized | ✓ | 100K | |
GQA | Translated | ✓ | 1.9M | |
VQA v2 | Translated | ✓ | 880K | |
A - OKVQA | Translated | ✓ | 34K | |
OK - VQA | Translated | ✓ | 18K | |
Japanese Visual Genome | Translated | ✓ | 1.6M | |
PangeaInstruct | ✓ | 93K |
Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.
📚 Documentation
Evaluation
We evaluated our model using Heron - Bench, JA - VLM - Bench - in - the - Wild, and JA - VG - VQA - 500. We used eval - mm library for this evaluation.
Here, models with "†" are not trained with GPT - generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT - generated data.
Model | LM Size | Heron - Bench (LLM (%)) | JA - VLM - Bench - In - the - Wild (ROUGE - L) | JA - VLM - Bench - In - the - Wild (LLM (/5.0)) | JA - VG - VQA - 500 (ROUGE - L) | JA - VG - VQA - 500 (LLM (/5.0)) |
---|---|---|---|---|---|---|
Japanese InstructBLIP Alpha† | 7B | 14.0 | 20.8 | 2.42 | - | - |
Japanese Stable VLM† | 7B | 24.2 | 23.3 | 2.47 | - | - |
LLaVA - CALM2 - SigLIP† | 7B | 43.3 | 47.2 | 3.15 | 17.4 | 3.21 |
Llama - 3 - EvoVLM - JP - v2 | 8B | 39.3 | 41.4 | 2.92 | 23.5 | 2.96 |
VILA - jp | 13B | 57.2 | 52.3 | 3.69 | 16.2 | 3.62 |
Asagi - 2B† | 1.8B | 44.7 | 48.8 | 3.26 | 53.7 | 3.69 |
Asagi - 4B† | 3.7B | 49.3 | 49.6 | 3.38 | 55.6 | 3.78 |
Asagi - 8B† | 7.2B | 54.7 | 49.4 | 3.45 | 56.43 | 3.84 |
Asagi - 14B† | 13B | 55.8 | 50.8 | 3.44 | 56.8 | 3.84 |
GPT - 4o | - | 87.6 | 37.6 | 3.85 | 12.1 | 3.58 |
📄 License
The model is licensed under Apache - 2.0.
⚠️ Important Note
The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
Model Card Authors
Kohei Uehara






