Asagi 8B
Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Asagi-8B: A Large-scale Japanese Vision & Language Model
This repository offers Asagi-8B, a large-scale Japanese Vision & Language Model (VLM). Trained on an extensive Japanese dataset with diverse data sources, it synthesizes a significant portion of training data using models like the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct). Notably, it avoids using LLMs with output usage restrictions in their license terms (e.g., GPT-4) for data synthesis.
✨ Features
- Large-scale Japanese VLM: Asagi-8B is specifically designed for Japanese language processing in vision and language tasks.
- Diverse Training Data: Utilizes a wide range of data sources, including synthesized and translated datasets.
- Skipped Stage-1 Training: Unlike other models in the Asagi series, Asagi-8B skips stage-1 training.
📦 Installation
Requirements
transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0
💻 Usage Examples
Basic Usage
import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig
transformers.set_seed(42)
model_path = "MIL-UT/Asagi-8B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
model_path, trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto"
)
generation_config = GenerationConfig(
do_sample=True,
num_beams=5,
max_new_tokens=256,
temperature=0.7,
repetition_penalty=1.5
)
prompt = ("The following is an instruction describing a task. Write a response that appropriately fulfills the request.\n\n"
"### Instruction:\n<image>\nPlease look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.\n\n### Response:\n")
# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)
inputs = processor(
text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
if v.dtype == torch.float32:
inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}
generate_ids = model.generate(
**inputs,
generation_config=generation_config
)
generated_text = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
# do not print the prompt
if "<image>" in prompt:
prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")
print(f"Generated text: {generated_text}")
# >>> Generated text: This photo captures the night view of Ginza, Tokyo. Ginza is one of the representative bustling streets in Tokyo, lined with many high-rise buildings shining with the lights of neon signs and streetlights. Particularly striking is the "Ginza Crystal Building" towering in the center, with its characteristic clock tower being quite impressive. The cars and people moving between the buildings give a sense of the city's vitality.
Example
Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail. | This photo captures the night view of Ginza, Tokyo. Ginza is one of the representative bustling streets in Tokyo, lined with many high-rise buildings shining with the lights of neon signs and streetlights. Particularly striking is the "Ginza Crystal Building" towering in the center, with its characteristic clock tower being quite impressive. The cars and people moving between the buildings give a sense of the city's vitality. |
Note: This image was actually taken in Shibuya, not Ginza.
More Examples
Click to expand
Prompt | Output |
---|---|
Please look at this image and answer the following instruction in detail and specifically. Please tell me in detail about the content of this photo. | This photo captures the Rainbow Bridge in the Odaiba area of Tokyo. The Rainbow Bridge is an important transportation route connecting Minato Ward and Koto Ward in Tokyo, with its characteristic white towers and cables being quite impressive. In the background, there are rows of high-rise buildings, spreading out the urban landscape of Tokyo. The sky is partially covered with clouds, suggesting a morning or evening time. |
Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Please evaluate this photo. | This photo captures the magnificent scenery of the Tokyo Skytree, with its red and white structure towering towards the sky. Since it was taken from a low angle, the tower looks even more imposing. In the background, there is a clear blue sky dotted with white clouds, and in the foreground, there are lush green trees and streetlights, with natural elements blending into the urban scenery. Overall, it is a beautiful photo where artificial objects and nature are perfectly in harmony. |
Prompt | Output |
---|---|
Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo. | The shrine in this photo reflects the traditional Japanese architectural style. The contrast between the vermilion torii and the white wall is beautiful, creating a sacred atmosphere. The torii is located at the entrance of the shrine, serving the role of separating the sacred and secular worlds. The roof of the shrine is tiled with green tiles, which is one of the traditional Japanese architectural styles. Also, there are modern buildings lined up around the shrine, symbolizing the coexistence of tradition and modernity in the Japanese landscape. |
Prompt | Output |
---|---|
Please look at this image and answer the following instruction in detail and specifically. What kind of style does this Japanese garden belong to? And what is the reason? | This garden belongs to the style called "pond-circuit style garden". A pond-circuit style garden refers to a garden with a pond or water scenery, also known as an aquatic garden or a garden with water. This style of garden is designed to maximize the beauty of nature, with the pond and water scenery being the central elements of the garden. |
Prompt | Output |
---|---|
Please look at this image and answer the following question in detail. Please write a synopsis of a novel set in the location of this image. | The title of the novel set in the location of this image is "The Secret of the Stone Steps". The story begins with the protagonist escaping the city's noise and climbing an old stone staircase. There are lush green plants growing around the staircase, and there are buildings that give a sense of history lined up. As the protagonist climbs the staircase, they have a mysterious experience of traveling between the past and the future, and finally uncover the secret hidden under the staircase. |
📚 Documentation
Model Components
Property | Details |
---|---|
Vision encoder | siglip-so400m-patch14-384 (428M parameters) |
Projector | 2-layer MLP (43M parameters) |
LLM | llm-jp-3-7.2b-instruct (7.2B parameters) |
Training Details
Note: we skipped stage-1 training for Asagi-8B.
Dataset | Preprocess | Stage1 | Stage2 | Size |
---|---|---|---|---|
ROIS (Ours) | Synthesized | ✓ | ✓ | 8.4M |
Japanese image text pairs | Synthesized | ✓ | ✓ | 4.4M |
Wikipedia | Synthesized | ✓ | ✓ | 2.5M |
Open Images | Translated | ✓ | ✓ | 680K |
DCI | Translated | ✓ | ✓ | 7K |
CommonCatalog CC-BY | Translated | ✓ | ✓ | 3.5M |
LLaVA-Pretrain-JA | ✓ | ✓ | 550K | |
STAIR Captions | ✓ | ✓ | 410K | |
Flickr-JP | ✓ | ✓ | 160K | |
YJ Captions | ✓ | ✓ | 130K | |
Japanese Pascal | ✓ | ✓ | 5K | |
ArtBench | Synthesized | ✓ | 100K | |
GQA | Translated | ✓ | 1.9M | |
VQA v2 | Translated | ✓ | 880K | |
A-OKVQA | Translated | ✓ | 34K | |
OK-VQA | Translated | ✓ | 18K | |
Japanese Visual Genome | Translated | ✓ | 1.6M | |
PangeaInstruct | ✓ | 93K |
Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.
Evaluation
We evaluated our model using Heron-Bench, JA-VLM-Bench-in-the-Wild, and JA-VG-VQA-500. We used the eval-mm library for this evaluation.
Here, models with "†" are not trained with GPT-generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT-generated data.
Model | LM Size | Heron-Bench (LLM (%)) | JA-VLM-Bench-In-the-Wild (ROUGE-L) | JA-VLM-Bench-In-the-Wild (LLM (/5.0)) | JA-VG-VQA-500 (ROUGE-L) | JA-VG-VQA-500 (LLM (/5.0)) |
---|---|---|---|---|---|---|
Japanese InstructBLIP Alpha† | 7B | 14.0 | 20.8 | 2.42 | - | - |
Japanese Stable VLM† | 7B | 24.2 | 23.3 | 2.47 | - | - |
LLaVA-CALM2-SigLIP† | 7B | 43.3 | 47.2 | 3.15 | 17.4 | 3.21 |
Llama-3-EvoVLM-JP-v2 | 8B | 39.3 | 41.4 | 2.92 | 23.5 | 2.96 |
VILA-jp | 13B | 57.2 | 52.3 | 3.69 | 16.2 | 3.62 |
Asagi-2B† | 1.8B | 44.7 | 48.8 | 3.26 | 53.7 | 3.69 |
Asagi-4B† | 3.7B | 49.3 | 49.6 | 3.38 | 55.6 | 3.78 |
Asagi-8B† | 7.2B | 54.7 | 49.4 | 3.45 | 56.43 | 3.84 |
Asagi-14B† | 13B | 55.8 | 50.8 | 3.44 | 56.8 | 3.84 |
GPT-4o | - | 87.6 | 37.6 | 3.85 | 12.1 | 3.58 |
🔧 Technical Details
The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
📄 License
This project is licensed under the Apache-2.0 license.
Model Card Authors
Kohei Uehara






