Japanese-Stable-VLM Open-Source Vision-Language Model - Generate Japanese Descriptions for Images and Process Text for Free

Japanese Stable Vlm

Developed by stabilityai

A vision-language instruction-following model capable of generating Japanese descriptions for input images and optionally processing input text (e.g., questions).

Image-to-Text

Transformers

JapaneseOpen Source License:Other #Japanese Image Captioning #Visual Question Answering System #Multimodal Instruction Following

Downloads 122

Release Time : 11/1/2023

Model Overview

The Japanese Stable Vision-Language Model integrates visual and language processing capabilities, primarily designed for image captioning and visual question answering tasks, with special optimization for Japanese scenarios.

Model Features

Japanese Vision-Language Understanding

Specialized vision-language processing optimized for Japanese, capable of accurately understanding Japanese instructions and generating Japanese descriptions.

Multi-Task Support

Supports various vision-language tasks including image captioning, label-assisted description, and visual question answering.

Two-Stage Training

Employs a two-stage training strategy, first training the MLP projection layer, then fine-tuning the language model and projection layer to enhance model performance.

Model Capabilities

Image Captioning

Visual Question Answering

Japanese Text Processing

Multimodal Understanding

Use Cases

Content Generation

Automatic Image Tagging

Generates detailed Japanese descriptions for images

Produces natural language descriptions that match the image content

Intelligent Q&A

Visual Question Answering System

Answers Japanese questions about image content

Provides accurate image-related Q&A

🚀 Japanese Stable VLM

A vision - language instruction - following model that generates Japanese descriptions for images and texts.

Please note: for commercial usage of this model, please see https://stability.ai/license

For Japanese inquiries regarding commercial use, please contact partners - jp@stability.ai.

🚀 Quick Start

This section provides a quick guide on how to use the Japanese Stable VLM model. The following Python code demonstrates the basic steps to generate descriptions for an input image.

import torch
from transformers import AutoTokenizer, AutoModelForVision2Seq, AutoImageProcessor
from PIL import Image
import requests

# helper function to format input prompts
TASK2INSTRUCTION = {
    "caption": "画像を詳細に述べてください。",
    "tag": "与えられた単語を使って、画像を詳細に述べてください。",
    "vqa": "与えられた画像を下に、質問に答えてください。",
}


def build_prompt(task="caption", input=None, sep="\n\n### "):
    assert (
        task in TASK2INSTRUCTION
    ), f"Please choose from {list(TASK2INSTRUCTION.keys())}"
    if task in ["tag", "vqa"]:
        assert input is not None, "Please fill in `input`!"
        if task == "tag" and isinstance(input, list):
            input = "、".join(input)
    else:
        assert input is None, f"`{task}` mode doesn't support to input questions"
    sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
    p = sys_msg
    roles = ["指示", "応答"]
    instruction = TASK2INSTRUCTION[task]
    msgs = [": \n" + instruction, ": \n"]
    if input:
        roles.insert(1, "入力")
        msgs.insert(1, ": \n" + input)
    for role, msg in zip(roles, msgs):
        p += sep + role + msg
    return p

# load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-stable-vlm", trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained("stabilityai/japanese-stable-vlm")
tokenizer = AutoTokenizer.from_pretrained("stabilityai/japanese-stable-vlm")
model.to(device)

# prepare inputs
url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = build_prompt(task="caption")
# prompt = build_prompt(task="tag", input=["河津桜", "青空"])
# prompt = build_prompt(task="vqa", input="季節はいつですか？")

inputs = processor(images=image, return_tensors="pt")
text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
inputs.update(text_encoding)

# generate
outputs = model.generate(
    **inputs.to(device, dtype=model.dtype),
    do_sample=False,
    num_beams=5,
    max_new_tokens=128,
    min_length=1,
    repetition_penalty=1.5,
)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
# 桜越しの東京スカイツリー

✨ Features

Japanese Stable VLM is a vision - language instruction - following model. It can generate Japanese descriptions for input images and optionally input texts such as questions.

📚 Documentation

Model Details

Property	Details
Developed by	Stability AI
Model Type	Auto - regressive Vision Language Model
Language(s)	Japanese
License	STABILITY AI COMMUNITY LICENSE

Training

This model is a vision - language instruction - following model with the LLaVA 1.5 architecture. It uses [stabilityai/japanese - stablelm - instruct - gamma - 7b](https://huggingface.co/stabilityai/japanese - stablelm - instruct - gamma - 7b) as a language model and [openai/clip - vit - large - patch14](https://huggingface.co/openai/clip - vit - large - patch14) as an image encoder. During training, the MLP projection was trained from scratch at the first stage and the language model and the MLP projection were further trained at the second stage.

Training Dataset

The training dataset includes the following public datasets:

[CC12M](https://github.com/google - research - datasets/conceptual - 12m) with captions translated into Japanese
MS - COCO with STAIR Captions
[Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja - vg - vqa)

Use and Limitations

Intended Use

This model is intended to be used by the open - source community in vision - language applications.

Limitations and bias

The training dataset may have contained offensive or inappropriate content even though data filters were applied. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups.

How to cite

@misc{JapaneseStableVLM, 
    url    = {[https://huggingface.co/stabilityai/japanese-stable-vlm](https://huggingface.co/stabilityai/japanese-stable-vlm)}, 
    title  = {Japanese Stable VLM}, 
    author = {Shing, Makoto and Akiba, Takuya}
}

Contact

For questions and comments about the model, please join Stable Community Japan.
For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAI_JP.
For business and partnership inquiries, please contact partners - jp@stability.ai. For Japanese inquiries regarding business and partnerships, please contact sales - jp@stability.ai.

📄 License

This model is licensed under the STABILITY AI COMMUNITY LICENSE.

⚠️ Important Note

By clicking "Agree", you agree to the License Agreement and acknowledge Stability AI's Privacy Policy.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご