Asagi-8B Open-source Japanese Vision and Language Model - Supports Diverse Applications Based on Extensive Datasets

Asagi 8B

Developed by MIL-UT

Asagi-8B is a large-scale Japanese Vision-Language Model (VLM) trained on extensive Japanese datasets, integrating diverse data sources.

Image-to-Text

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Visual Language Model #Multimodal Q&A #Image Scene Understanding

Downloads 58

Release Time : 2/19/2025

Model Overview

Asagi-8B is a Japanese Vision-Language Model capable of handling image-to-text tasks, suitable for multimodal applications in Japanese environments.

Model Features

Multimodal Capability

Combines visual encoders and language models to understand and generate Japanese text related to images.

Japanese Optimization

Specifically trained on Japanese data, suitable for vision-language tasks in Japanese environments.

Synthetic Data Training

Trained using synthetic data to avoid using LLM-generated data with restrictive output licenses.

Model Capabilities

Image Caption Generation

Visual Question Answering

Multimodal Reasoning

Use Cases

Visual Question Answering

Image Content Description

Generate detailed Japanese descriptions based on images.

Produces detailed image descriptions, including scenes, objects, and activities.

Location Identification

Identify shooting locations in images and provide detailed explanations.

Provides possible location identifications and detailed environmental descriptions.

Creative Writing

Story Generation

Generate storylines based on images.

Creates storylines related to image content.

🚀 Asagi-8B: A Large-scale Japanese Vision & Language Model

This repository offers Asagi-8B, a large-scale Japanese Vision & Language Model (VLM). Trained on an extensive Japanese dataset with diverse data sources, it synthesizes a significant portion of training data using models like the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct). Notably, it avoids using LLMs with output usage restrictions in their license terms (e.g., GPT-4) for data synthesis.

✨ Features

Large-scale Japanese VLM: Asagi-8B is specifically designed for Japanese language processing in vision and language tasks.
Diverse Training Data: Utilizes a wide range of data sources, including synthesized and translated datasets.
Skipped Stage-1 Training: Unlike other models in the Asagi series, Asagi-8B skips stage-1 training.

📦 Installation

Requirements

transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0

💻 Usage Examples

Basic Usage

import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig

transformers.set_seed(42)
model_path = "MIL-UT/Asagi-8B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
    model_path, trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

generation_config = GenerationConfig(
    do_sample=True,
    num_beams=5,
    max_new_tokens=256,
    temperature=0.7,
    repetition_penalty=1.5
)

prompt = ("The following is an instruction describing a task. Write a response that appropriately fulfills the request.\n\n"
            "### Instruction:\n<image>\nPlease look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.\n\n### Response:\n")

# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)

inputs = processor(
    text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
    if v.dtype == torch.float32:
        inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}

generate_ids = model.generate(
    **inputs,
    generation_config=generation_config
)
generated_text = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# do not print the prompt
if "<image>" in prompt:
    prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")

print(f"Generated text: {generated_text}")

# >>> Generated text:  This photo captures the night view of Ginza, Tokyo. Ginza is one of the representative bustling streets in Tokyo, lined with many high-rise buildings shining with the lights of neon signs and streetlights. Particularly striking is the "Ginza Crystal Building" towering in the center, with its characteristic clock tower being quite impressive. The cars and people moving between the buildings give a sense of the city's vitality.

Example

Example Image

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.	This photo captures the night view of Ginza, Tokyo. Ginza is one of the representative bustling streets in Tokyo, lined with many high-rise buildings shining with the lights of neon signs and streetlights. Particularly striking is the "Ginza Crystal Building" towering in the center, with its characteristic clock tower being quite impressive. The cars and people moving between the buildings give a sense of the city's vitality.

Prompt

Output

Please look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.

This photo captures the night view of Ginza, Tokyo. Ginza is one of the representative bustling streets in Tokyo, lined with many high-rise buildings shining with the lights of neon signs and streetlights. Particularly striking is the "Ginza Crystal Building" towering in the center, with its characteristic clock tower being quite impressive. The cars and people moving between the buildings give a sense of the city's vitality.

Note: This image was actually taken in Shibuya, not Ginza.

More Examples

Click to expand

Bridge Image

Prompt	Output
Please look at this image and answer the following instruction in detail and specifically. Please tell me in detail about the content of this photo.	This photo captures the Rainbow Bridge in the Odaiba area of Tokyo. The Rainbow Bridge is an important transportation route connecting Minato Ward and Koto Ward in Tokyo, with its characteristic white towers and cables being quite impressive. In the background, there are rows of high-rise buildings, spreading out the urban landscape of Tokyo. The sky is partially covered with clouds, suggesting a morning or evening time.

Prompt

Output

Please look at this image and answer the following instruction in detail and specifically. Please tell me in detail about the content of this photo.

This photo captures the Rainbow Bridge in the Odaiba area of Tokyo. The Rainbow Bridge is an important transportation route connecting Minato Ward and Koto Ward in Tokyo, with its characteristic white towers and cables being quite impressive. In the background, there are rows of high-rise buildings, spreading out the urban landscape of Tokyo. The sky is partially covered with clouds, suggesting a morning or evening time.

Tower Image

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Please evaluate this photo.	This photo captures the magnificent scenery of the Tokyo Skytree, with its red and white structure towering towards the sky. Since it was taken from a low angle, the tower looks even more imposing. In the background, there is a clear blue sky dotted with white clouds, and in the foreground, there are lush green trees and streetlights, with natural elements blending into the urban scenery. Overall, it is a beautiful photo where artificial objects and nature are perfectly in harmony.

Prompt

Output

Please look at this image and answer the following question in detail and specifically. Please evaluate this photo.

This photo captures the magnificent scenery of the Tokyo Skytree, with its red and white structure towering towards the sky. Since it was taken from a low angle, the tower looks even more imposing. In the background, there is a clear blue sky dotted with white clouds, and in the foreground, there are lush green trees and streetlights, with natural elements blending into the urban scenery. Overall, it is a beautiful photo where artificial objects and nature are perfectly in harmony.

Shrine Image

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo.	The shrine in this photo reflects the traditional Japanese architectural style. The contrast between the vermilion torii and the white wall is beautiful, creating a sacred atmosphere. The torii is located at the entrance of the shrine, serving the role of separating the sacred and secular worlds. The roof of the shrine is tiled with green tiles, which is one of the traditional Japanese architectural styles. Also, there are modern buildings lined up around the shrine, symbolizing the coexistence of tradition and modernity in the Japanese landscape.

Prompt

Output

Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo.

The shrine in this photo reflects the traditional Japanese architectural style. The contrast between the vermilion torii and the white wall is beautiful, creating a sacred atmosphere. The torii is located at the entrance of the shrine, serving the role of separating the sacred and secular worlds. The roof of the shrine is tiled with green tiles, which is one of the traditional Japanese architectural styles. Also, there are modern buildings lined up around the shrine, symbolizing the coexistence of tradition and modernity in the Japanese landscape.

Garden Image

Prompt	Output
Please look at this image and answer the following instruction in detail and specifically. What kind of style does this Japanese garden belong to? And what is the reason?	This garden belongs to the style called "pond-circuit style garden". A pond-circuit style garden refers to a garden with a pond or water scenery, also known as an aquatic garden or a garden with water. This style of garden is designed to maximize the beauty of nature, with the pond and water scenery being the central elements of the garden.

Slope Image

Prompt	Output
Please look at this image and answer the following question in detail. Please write a synopsis of a novel set in the location of this image.	The title of the novel set in the location of this image is "The Secret of the Stone Steps". The story begins with the protagonist escaping the city's noise and climbing an old stone staircase. There are lush green plants growing around the staircase, and there are buildings that give a sense of history lined up. As the protagonist climbs the staircase, they have a mysterious experience of traveling between the past and the future, and finally uncover the secret hidden under the staircase.

Prompt

Output

Please look at this image and answer the following question in detail. Please write a synopsis of a novel set in the location of this image.

The title of the novel set in the location of this image is "The Secret of the Stone Steps". The story begins with the protagonist escaping the city's noise and climbing an old stone staircase. There are lush green plants growing around the staircase, and there are buildings that give a sense of history lined up. As the protagonist climbs the staircase, they have a mysterious experience of traveling between the past and the future, and finally uncover the secret hidden under the staircase.

📚 Documentation

Model Components

Property	Details
Vision encoder	siglip-so400m-patch14-384 (428M parameters)
Projector	2-layer MLP (43M parameters)
LLM	llm-jp-3-7.2b-instruct (7.2B parameters)

Training Details

Note: we skipped stage-1 training for Asagi-8B.

Dataset	Preprocess	Stage1	Stage2	Size
ROIS (Ours)	Synthesized	✓	✓	8.4M
Japanese image text pairs	Synthesized	✓	✓	4.4M
Wikipedia	Synthesized	✓	✓	2.5M
Open Images	Translated	✓	✓	680K
DCI	Translated	✓	✓	7K
CommonCatalog CC-BY	Translated	✓	✓	3.5M
LLaVA-Pretrain-JA		✓	✓	550K
STAIR Captions		✓	✓	410K
Flickr-JP		✓	✓	160K
YJ Captions		✓	✓	130K
Japanese Pascal		✓	✓	5K
ArtBench	Synthesized		✓	100K
GQA	Translated		✓	1.9M
VQA v2	Translated		✓	880K
A-OKVQA	Translated		✓	34K
OK-VQA	Translated		✓	18K
Japanese Visual Genome	Translated		✓	1.6M
PangeaInstruct			✓	93K

Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.

Evaluation

We evaluated our model using Heron-Bench, JA-VLM-Bench-in-the-Wild, and JA-VG-VQA-500. We used the eval-mm library for this evaluation.

Here, models with "†" are not trained with GPT-generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT-generated data.

Model	LM Size	Heron-Bench (LLM (%))	JA-VLM-Bench-In-the-Wild (ROUGE-L)	JA-VLM-Bench-In-the-Wild (LLM (/5.0))	JA-VG-VQA-500 (ROUGE-L)	JA-VG-VQA-500 (LLM (/5.0))
Japanese InstructBLIP Alpha†	7B	14.0	20.8	2.42	-	-
Japanese Stable VLM†	7B	24.2	23.3	2.47	-	-
LLaVA-CALM2-SigLIP†	7B	43.3	47.2	3.15	17.4	3.21
Llama-3-EvoVLM-JP-v2	8B	39.3	41.4	2.92	23.5	2.96
VILA-jp	13B	57.2	52.3	3.69	16.2	3.62
Asagi-2B†	1.8B	44.7	48.8	3.26	53.7	3.69
Asagi-4B†	3.7B	49.3	49.6	3.38	55.6	3.78
Asagi-8B†	7.2B	54.7	49.4	3.45	56.43	3.84
Asagi-14B†	13B	55.8	50.8	3.44	56.8	3.84
GPT-4o	-	87.6	37.6	3.85	12.1	3.58

🔧 Technical Details

The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

📄 License

This project is licensed under the Apache-2.0 license.

Model Card Authors

Kohei Uehara

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご