🚀 Japanese Stable VLM
A vision - language instruction - following model that generates Japanese descriptions for images and texts.
Please note: for commercial usage of this model, please see https://stability.ai/license
For Japanese inquiries regarding commercial use, please contact partners - jp@stability.ai.
🚀 Quick Start
This section provides a quick guide on how to use the Japanese Stable VLM model. The following Python code demonstrates the basic steps to generate descriptions for an input image.
import torch
from transformers import AutoTokenizer, AutoModelForVision2Seq, AutoImageProcessor
from PIL import Image
import requests
TASK2INSTRUCTION = {
"caption": "画像を詳細に述べてください。",
"tag": "与えられた単語を使って、画像を詳細に述べてください。",
"vqa": "与えられた画像を下に、質問に答えてください。",
}
def build_prompt(task="caption", input=None, sep="\n\n### "):
assert (
task in TASK2INSTRUCTION
), f"Please choose from {list(TASK2INSTRUCTION.keys())}"
if task in ["tag", "vqa"]:
assert input is not None, "Please fill in `input`!"
if task == "tag" and isinstance(input, list):
input = "、".join(input)
else:
assert input is None, f"`{task}` mode doesn't support to input questions"
sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
p = sys_msg
roles = ["指示", "応答"]
instruction = TASK2INSTRUCTION[task]
msgs = [": \n" + instruction, ": \n"]
if input:
roles.insert(1, "入力")
msgs.insert(1, ": \n" + input)
for role, msg in zip(roles, msgs):
p += sep + role + msg
return p
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-stable-vlm", trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained("stabilityai/japanese-stable-vlm")
tokenizer = AutoTokenizer.from_pretrained("stabilityai/japanese-stable-vlm")
model.to(device)
url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = build_prompt(task="caption")
inputs = processor(images=image, return_tensors="pt")
text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
inputs.update(text_encoding)
outputs = model.generate(
**inputs.to(device, dtype=model.dtype),
do_sample=False,
num_beams=5,
max_new_tokens=128,
min_length=1,
repetition_penalty=1.5,
)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
✨ Features
Japanese Stable VLM is a vision - language instruction - following model. It can generate Japanese descriptions for input images and optionally input texts such as questions.
📚 Documentation
Model Details
Training
This model is a vision - language instruction - following model with the LLaVA 1.5 architecture. It uses [stabilityai/japanese - stablelm - instruct - gamma - 7b](https://huggingface.co/stabilityai/japanese - stablelm - instruct - gamma - 7b) as a language model and [openai/clip - vit - large - patch14](https://huggingface.co/openai/clip - vit - large - patch14) as an image encoder. During training, the MLP projection was trained from scratch at the first stage and the language model and the MLP projection were further trained at the second stage.
Training Dataset
The training dataset includes the following public datasets:
- [CC12M](https://github.com/google - research - datasets/conceptual - 12m) with captions translated into Japanese
- MS - COCO with STAIR Captions
- [Japanese Visual Genome VQA dataset](https://github.com/yahoojapan/ja - vg - vqa)
Use and Limitations
Intended Use
This model is intended to be used by the open - source community in vision - language applications.
Limitations and bias
The training dataset may have contained offensive or inappropriate content even though data filters were applied. We recommend users exercise reasonable caution when using these models in production systems. Do not use the model for any applications that may cause harm or distress to individuals or groups.
How to cite
@misc{JapaneseStableVLM,
url = {[https://huggingface.co/stabilityai/japanese-stable-vlm](https://huggingface.co/stabilityai/japanese-stable-vlm)},
title = {Japanese Stable VLM},
author = {Shing, Makoto and Akiba, Takuya}
}
Contact
- For questions and comments about the model, please join Stable Community Japan.
- For future announcements / information about Stability AI models, research, and events, please follow https://twitter.com/StabilityAI_JP.
- For business and partnership inquiries, please contact partners - jp@stability.ai. For Japanese inquiries regarding business and partnerships, please contact sales - jp@stability.ai.
📄 License
This model is licensed under the STABILITY AI COMMUNITY LICENSE.
⚠️ Important Note
By clicking "Agree", you agree to the License Agreement and acknowledge Stability AI's Privacy Policy.