Asagi-14B Open-Source Japanese Vision and Language Model - Free to Use for Japanese Content Understanding

Asagi 14B

Developed by MIL-UT

Asagi-14B is a large-scale Japanese Vision and Language Model (VLM) trained on a wide range of Japanese datasets, integrating diverse data sources.

Image-to-Text

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Visual Language Model #Multimodal Image Understanding #High-Precision Japanese Descriptions

Downloads 83

Release Time : 1/9/2025

Model Overview

Asagi-14B is a vision-language model combining a visual encoder and a large language model, primarily used for image-text-to-text tasks with support for Japanese.

Model Features

Multimodal Capability

Combines a visual encoder and language model to understand and generate text descriptions based on images.

Japanese Optimization

Specifically trained and optimized for Japanese, suitable for vision and language tasks in Japanese contexts.

Data Synthesis

Most training data is synthesized by models, avoiding the use of LLM outputs with restrictive licenses.

Model Capabilities

Image Caption Generation

Visual Question Answering

Multimodal Reasoning

Japanese Text Generation

Use Cases

Image Understanding and Description

Scene Recognition

Recognizes scenes in images and generates detailed descriptions.

For example, identifies that a photo was taken at a famous location in Tokyo and describes its details.

Object Recognition

Recognizes objects in images and generates descriptions.

For example, identifies buildings, natural landscapes, etc., in the image.

Visual Question Answering

Image Content Q&A

Answers detailed questions about image content.

For example, answers questions about the shooting location of the image or objects within it.

🚀 Asagi-14B: A Large-scale Japanese Vision & Language Model

This repository offers Asagi-14B, a large-scale Japanese Vision & Language Model (VLM). Asagi-14B is trained on a vast Japanese dataset with diverse data sources, enabling it to handle image-text-to-text tasks effectively.

🚀 Quick Start

Requirements

transformers==4.45.1
accelerate==0.34.2
torch==2.4.0
torchvision==0.19.0

How to use

import requests
import torch
import transformers
from PIL import Image
from transformers import AutoModel, AutoProcessor, GenerationConfig

transformers.set_seed(42)
model_path = "MIL-UT/Asagi-14B"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModel.from_pretrained(
    model_path, trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

generation_config = GenerationConfig(
    do_sample=True,
    num_beams=5,
    max_new_tokens=256,
    temperature=0.7,
    repetition_penalty=1.5
)

prompt = ("The following is an instruction describing a task. Write a response that appropriately fulfills the request.\n\n"
            "### Instruction:\n<image>\nPlease look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.\n\n### Response:\n")

# sample image
sample_image_url = "https://raw.githubusercontent.com/uehara-mech/uehara-mech.github.io/refs/heads/master/images/shibuya.jpg"
image = Image.open(requests.get(sample_image_url, stream=True).raw)

inputs = processor(
    text=prompt, images=image, return_tensors="pt"
)
inputs_text = processor.tokenizer(prompt, return_tensors="pt")
inputs['input_ids'] = inputs_text['input_ids']
inputs['attention_mask'] = inputs_text['attention_mask']
for k, v in inputs.items():
    if v.dtype == torch.float32:
        inputs[k] = v.to(model.dtype)
inputs = {k: inputs[k].to(model.device) for k in inputs if k != "token_type_ids"}

generate_ids = model.generate(
    **inputs,
    generation_config=generation_config
)
generated_text = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# do not print the prompt
if "<image>" in prompt:
    prompt = prompt.replace("<image>", " ")
generated_text = generated_text.replace(prompt, "")

print(f"Generated text: {generated_text}")

# >>> Generated text:  This photo captures the night view of Shibuya Scramble Crossing. Shibuya Scramble Crossing is a famous intersection located in Shibuya Ward, Tokyo, known as a place where many people come and go. The photo shows high-rise buildings, neon signs, streetlights, etc., allowing you to feel the vitality and energy of the city.

Example

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Where was this photo taken? Also, please describe the content of the image in detail.	This photo captures the night view of Shibuya Scramble Crossing. Shibuya Scramble Crossing is a famous intersection located in Shibuya Ward, Tokyo, known as a place where many people come and go. The photo shows high-rise buildings, neon signs, streetlights, etc., allowing you to feel the vitality and energy of the city.

More Examples

Click to expand

Prompt	Output
Please look at this image and answer the following instruction in detail and specifically. Please tell me in detail about the content of this photo.	This photo captures an urban landscape with the Rainbow Bridge in Tokyo as the backdrop. The bridge is painted white and gray and elegantly arches over the water. The sky is a clear blue with a few clouds floating. In the distance, you can see the cityscape with high-rise buildings lined up.

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Please evaluate this photo.	This photo is very beautiful. The contrast between the red and white of Tokyo Tower is vivid, and the contrast with the blue sky and white clouds is impressive. The surrounding lush green trees and streetlights create an urban atmosphere, and the overall composition is well-balanced.

Prompt	Output
Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo.	The shrine in this photo reflects the traditional Japanese architectural style. It is characterized by white walls and orange pillars, and the roof is covered with green tiles. At the entrance of the shrine, there is a vermilion torii gate, which is a symbolic gate indicating the entrance to a sacred place. Shime-nawa ropes are decorated on both sides of the torii gate, creating a sacred atmosphere. In front of the shrine, there are stone pavements, which are used to show respect when worshipers step into the sacred place. This shrine, although located in a modern urban area, plays an important role in passing on traditional Japanese culture and beliefs to the present.

Prompt

Output

Please look at this image and answer the following question in detail and specifically. Please explain in detail about the shrine in this photo.

The shrine in this photo reflects the traditional Japanese architectural style. It is characterized by white walls and orange pillars, and the roof is covered with green tiles. At the entrance of the shrine, there is a vermilion torii gate, which is a symbolic gate indicating the entrance to a sacred place. Shime-nawa ropes are decorated on both sides of the torii gate, creating a sacred atmosphere. In front of the shrine, there are stone pavements, which are used to show respect when worshipers step into the sacred place. This shrine, although located in a modern urban area, plays an important role in passing on traditional Japanese culture and beliefs to the present.

Prompt	Output
Please look at this image and answer the following instruction in detail and specifically. What type of garden does this belong to among Japanese gardens? And what is the reason?	This belongs to a type of garden called a pond-circuit style garden among Japanese gardens. A pond-circuit style garden is designed with a pond at the center and allows visitors to stroll around it. The purpose of this type of garden is to maximize the beauty of nature and provide tranquility and peace to visitors.

Prompt	Output
Please look at this image and answer the following question in detail. Please write the plot of a novel set in the location of this image.	The title of the novel set in the location of this image is "The Secret of the Stairs". The story unfolds on an old stone staircase in a quiet alley away from the city's noise. The protagonist has a mysterious experience where the past and the present intersect while climbing the stairs. Along the way, there are mossy stone walls that give a sense of history and traditional wooden buildings lined up. Every time the protagonist climbs the stairs, they touch the memories of past events and people and face their own past. And when they reach the top of the stairs, the hidden secret is revealed.

Prompt

Output

Please look at this image and answer the following question in detail. Please write the plot of a novel set in the location of this image.

The title of the novel set in the location of this image is "The Secret of the Stairs". The story unfolds on an old stone staircase in a quiet alley away from the city's noise. The protagonist has a mysterious experience where the past and the present intersect while climbing the stairs. Along the way, there are mossy stone walls that give a sense of history and traditional wooden buildings lined up. Every time the protagonist climbs the stairs, they touch the memories of past events and people and face their own past. And when they reach the top of the stairs, the hidden secret is revealed.

✨ Features

This repository provides Asagi-14B, a large-scale Japanese Vision & Language Model (VLM). Asagi-14B has been trained on an extensive Japanese dataset, incorporating a diverse range of data sources.

A significant portion of the training data is synthesized using models such as the Japanese large language model (CALM3-22B-Chat) and the English Vision & Language Model (Phi3.5-vision-instruct).

Importantly, we do not use LLMs that restrict the usage of their outputs in the license terms (e.g., GPT-4) to synthesize the training data.

Property	Details
Model Type	Asagi-14B, a large-scale Japanese Vision & Language Model (VLM)
Training Data	A large part is synthesized using models like CALM3-22B-Chat and Phi3.5-vision-instruct, and also includes various datasets such as ROIS, Japanese image text pairs, Wikipedia, etc.
Model components	Vision encoder: siglip-so400m-patch14-384 (428M parameters); Projector: 2 - layer MLP (64M parameters); LLM: llm-jp-3-13b-instruct (13B parameters)

🔧 Technical Details

Training Data

Dataset	Preprocess	Stage1	Stage2	Size
ROIS (Ours)	Synthesized	✓	✓	8.4M
Japanese image text pairs	Synthesized	✓	✓	4.4M
Wikipedia	Synthesized	✓	✓	2.5M
Open Images	Translated	✓	✓	680K
DCI	Translated	✓	✓	7K
CommonCatalog CC - BY	Translated	✓	✓	3.5M
LLaVA - Pretrain - JA		✓	✓	550K
STAIR Captions		✓	✓	410K
Flickr - JP		✓	✓	160K
YJ Captions		✓	✓	130K
Japanese Pascal		✓	✓	5K
ArtBench	Synthesized		✓	100K
GQA	Translated		✓	1.9M
VQA v2	Translated		✓	880K
A - OKVQA	Translated		✓	34K
OK - VQA	Translated		✓	18K
Japanese Visual Genome	Translated		✓	1.6M
PangeaInstruct			✓	93K

Note: ROIS (Ours) is a newly collected dataset crawled from the web specifically for this project. The dataset consists of image and raw text pairs, which are used to synthesize the training data.

📚 Documentation

Evaluation

We evaluated our model using Heron - Bench, JA - VLM - Bench - in - the - Wild, and JA - VG - VQA - 500. We used eval - mm library for this evaluation.

Here, models with "†" are not trained with GPT - generated data. Bold numbers indicate the best performance among all models, and underlined numbers indicate the best performance among models not trained with GPT - generated data.

Model	LM Size	Heron - Bench (LLM (%))	JA - VLM - Bench - In - the - Wild (ROUGE - L)	JA - VLM - Bench - In - the - Wild (LLM (/5.0))	JA - VG - VQA - 500 (ROUGE - L)	JA - VG - VQA - 500 (LLM (/5.0))
Japanese InstructBLIP Alpha†	7B	14.0	20.8	2.42	-	-
Japanese Stable VLM†	7B	24.2	23.3	2.47	-	-
LLaVA - CALM2 - SigLIP†	7B	43.3	47.2	3.15	17.4	3.21
Llama - 3 - EvoVLM - JP - v2	8B	39.3	41.4	2.92	23.5	2.96
VILA - jp	13B	57.2	52.3	3.69	16.2	3.62
Asagi - 2B†	1.8B	44.7	48.8	3.26	53.7	3.69
Asagi - 4B†	3.7B	49.3	49.6	3.38	55.6	3.78
Asagi - 8B†	7.2B	54.7	49.4	3.45	56.43	3.84
Asagi - 14B†	13B	55.8	50.8	3.44	56.8	3.84
GPT - 4o	-	87.6	37.6	3.85	12.1	3.58

📄 License

The model is licensed under Apache - 2.0.

⚠️ Important Note

The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.

Model Card Authors

Kohei Uehara

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご