Asagi - 4B Open - source Japanese Vision - Language Model: Relying on Diverse Data for Precise Service in Japanese Application Scenarios

Asagi 4B

Developed by MIL-UT

Asagi-4B is a large-scale Japanese Vision-Language Model (VLM) trained on extensive Japanese datasets, incorporating diverse data sources.

Image-to-Text

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Visual Language Understanding #Multimodal Synthetic Data Training #High-Precision Image Captioning

Downloads 29

Release Time : 1/27/2025

Model Overview

This model is a Japanese vision-language model capable of understanding and generating Japanese text related to images, suitable for image understanding and generation tasks.

Model Features

Japanese Vision-Language Model

A vision-language model specifically optimized for Japanese, capable of understanding and generating Japanese text related to images.

Diverse Training Data

Incorporates multiple data sources, including synthetic and translated data, covering a wide range of Japanese application scenarios.

High Performance

Performs excellently on multiple evaluation benchmarks, particularly in Japanese vision-language tasks.

Model Capabilities

Image Understanding

Japanese Text Generation

Visual Question Answering

Image Caption Generation

Use Cases

Visual Question Answering

Image Content Question Answering

Answers detailed questions about image content, such as the location where the image was taken, descriptions of the content, etc.

Can generate detailed and specific answers, accurately describing the image content.

Image Caption Generation

Image Content Description

Generates detailed descriptions of images, including scenes, objects, and backgrounds.

The generated natural language descriptions are accurate and detailed, matching the image content.

🚀 Asagi-4B: A Large-scale Japanese Vision & Language Model

Asagi-4B is a large-scale Japanese Vision & Language Model (VLM). It's trained on a vast Japanese dataset from diverse sources, offering high - quality image - text processing capabilities.

🚀 Quick Start

Requirements

transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0

How to use

import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig

transformers.set_seed(42)
model_path = "MIL-UT/Asagi-4B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
    model_path, trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

generation_config = GenerationConfig(
    do_sample=True,
    num_beams=5,
    max_new_tokens=256,
    temperature=0.7,
    repetition_penalty=1.5
)

prompt = ("The following is an instruction that describes a task. Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n<image>\nPlease look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.\n\n### Response:\n")

# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)

inputs = processor(
    text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
    if v.dtype == torch.float32:
        inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}

generate_ids = model.generate(
    **inputs,
    generation_config=generation_config
)
generated_text = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# do not print the prompt
if "<image>" in prompt:
    prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")

print(f"Generated text: {generated_text}")

# >>> Generated text:  This photo was taken in Shibuya, Tokyo. It shows the cityscape of Shibuya at night, with high - rise buildings lined up and streetlights and neon signs shining.

✨ Features

This repository provides Asagi-4B, a large-scale Japanese Vision & Language Model (VLM). Asagi-4B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.

A significant portion of the training data is synthesized using models such as the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct).

Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.

Model components	Model / Architecture	Parameters
Vision encoder	siglip-so400m-patch14-384	428M
Projector	2-layer MLP	64M
LLM	llm-jp-3-3.7b-instruct	3.7B

💻 Usage Examples

Basic Usage

# The above code for basic usage

Advanced Usage

Click to expand

Example 1

Prompt	Output
Please look at this image and answer the following instruction in detail and specifically. Please tell me in detail about the content of this photo.	This photo was taken of the Rainbow Bridge in Tokyo. The bridge is painted white and gracefully arches over the calm water. In the background, there is an urban landscape with high - rise buildings. The sky is a clear blue, dotted with white clouds.

Example 2

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Please evaluate this photo.	This photo captures an autumn landscape with Tokyo Tower in the background. Tokyo Tower is characterized by its red and white striped pattern, and there are maple trees with red leaves around it. The sky is partially cloudy, and the sunlight shines through the clouds, creating a beautiful contrast. In the foreground, you can see the silhouettes of streetlights and buildings, giving a sense of an urban atmosphere.

Prompt

Output

Please look at this image and answer the following question in detail and specifically. Please evaluate this photo.

This photo captures an autumn landscape with Tokyo Tower in the background. Tokyo Tower is characterized by its red and white striped pattern, and there are maple trees with red leaves around it. The sky is partially cloudy, and the sunlight shines through the clouds, creating a beautiful contrast. In the foreground, you can see the silhouettes of streetlights and buildings, giving a sense of an urban atmosphere.

Example 3

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo.	The shrine in this photo reflects the traditional Japanese architectural style. It is characterized by white walls and bright orange roofs, creating a sacred atmosphere. At the entrance of the shrine, there is a vermilion torii, which is a symbolic gate indicating the entrance to a sacred space in Shinto. In front of the torii, there is a small shrine hall, which serves as a place for worshipers to offer prayers. There is a rope beside the shrine hall, suggesting that it is a sacred place. The shrine is located among modern buildings, symbolizing the coexistence of tradition and modernity in Japanese culture.

Prompt

Output

Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo.

The shrine in this photo reflects the traditional Japanese architectural style. It is characterized by white walls and bright orange roofs, creating a sacred atmosphere. At the entrance of the shrine, there is a vermilion torii, which is a symbolic gate indicating the entrance to a sacred space in Shinto. In front of the torii, there is a small shrine hall, which serves as a place for worshipers to offer prayers. There is a rope beside the shrine hall, suggesting that it is a sacred place. The shrine is located among modern buildings, symbolizing the coexistence of tradition and modernity in Japanese culture.

Example 4

Prompt	Output
Please look at this image and answer the following instruction in detail and specifically. What type of garden is this among Japanese gardens? And what's the reason?	This is a type of garden called a "rock garden" among Japanese gardens. A rock garden is a form that skillfully arranges natural rocks to pursue tranquility and beauty and is one of the traditional Japanese garden designs.

Example 5

Prompt	Output
Please look at this image and answer the following question in detail. Please write a synopsis of a novel set in the location of this image.	Once upon a time, in a small village, there was a stone staircase leading to the center of the village. That staircase was the entrance to a sacred place for the villagers, and they climbed the staircase to offer their prayers at the sacred place. Around the staircase, there were lush green plants, spreading a scenery that made people feel in harmony with nature. Every time the villagers climbed the staircase, they showed their gratitude for nature's blessings and respect for the sacred place.

Prompt

Output

Please look at this image and answer the following question in detail. Please write a synopsis of a novel set in the location of this image.

Once upon a time, in a small village, there was a stone staircase leading to the center of the village. That staircase was the entrance to a sacred place for the villagers, and they climbed the staircase to offer their prayers at the sacred place. Around the staircase, there were lush green plants, spreading a scenery that made people feel in harmony with nature. Every time the villagers climbed the staircase, they showed their gratitude for nature's blessings and respect for the sacred place.

🔧 Technical Details

Training Data

Dataset	Preprocess	Stage1	Stage2	Size
ROIS (Ours)	Synthesized	✓	✓	8.4M
Japanese image text pairs	Synthesized	✓	✓	4.4M
Wikipedia	Synthesized	✓	✓	2.5M
Open Images	Translated	✓	✓	680K
DCI	Translated	✓	✓	7K
CommonCatalog CC - BY	Translated	✓	✓	3.5M
LLaVA - Pretrain - JA		✓	✓	550K
STAIR Captions		✓	✓	410K
Flickr - JP		✓	✓	160K
YJ Captions		✓	✓	130K
Japanese Pascal		✓	✓	5K
ArtBench	Synthesized		✓	100K
GQA	Translated		✓	1.9M
VQA v2	Translated		✓	880K
A - OKVQA	Translated		✓	34K
OK - VQA	Translated		✓	18K
Japanese Visual Genome	Translated		✓	1.6M
PangeaInstruct			✓	93K

Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.

Evaluation

We evaluated our model using Heron - Bench, JA - VLM - Bench - in - the - Wild, and JA - VG - VQA - 500. We used eval - mm library for this evaluation.

Here, models with "†" are not trained with GPT - generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT - generated data.

Model	LM Size	Heron - Bench (LLM (%))	JA - VLM - Bench - In - the - Wild (ROUGE - L)	JA - VLM - Bench - In - the - Wild (LLM (/5.0))	JA - VG - VQA - 500 (ROUGE - L)	JA - VG - VQA - 500 (LLM (/5.0))
Japanese InstructBLIP Alpha†	7B	14.0	20.8	2.42	-	-
Japanese Stable VLM†	7B	24.2	23.3	2.47	-	-
LLaVA - CALM2 - SigLIP†	7B	43.3	47.2	3.15	17.4	3.21
Llama - 3 - EvoVLM - JP - v2	8B	39.3	41.4	2.92	23.5	2.96
VILA - jp	13B	57.2	52.3	3.69	16.2	3.62
Asagi - 2B†	1.8B	44.7	48.8	3.26	53.7	3.69
Asagi - 4B†	3.7B	49.3	49.6	3.38	55.6	3.78
Asagi - 8B†	7.2B	54.7	49.4	3.45	56.43	3.84
Asagi - 14B†	13B	55.8	50.8	3.44	56.8	3.84
GPT - 4o	-	87.6	37.6	3.85	12.1	3.58

📄 License

The model is released under the Apache - 2.0 license.

⚠️ Important Note

The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

Model Card Authors

Kohei Uehara

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご