japanese-instructblip-alpha開源視覺語言模型 - 免費為圖像生成日語描述

首頁

Japanese Instructblip Alpha

由stabilityai開發

一個視覺語言指令跟隨模型，能夠為輸入圖像和可選的輸入文本生成日語描述

圖像生成文本

Transformers

日語開源協議:其他 #日語圖像描述生成 #視覺語言指令跟隨 #多模態AI

下載量 141

發布時間 : 8/15/2023

模型概述

日本指令BLIP Alpha版是基於指令BLIP架構的視覺語言模型，專門針對日語優化，能夠根據圖像和文本提示生成描述性內容。

模型特點

日語優化

專門針對日語進行優化，能夠生成高質量的日語描述

多模態輸入

支持同時處理圖像和文本輸入，實現更靈活的交互

指令跟隨

能夠理解並遵循用戶指令，生成符合要求的輸出

輕量級訓練

僅訓練Q-Former部分，視覺編碼器和LLM保持凍結狀態

模型能力

圖像描述生成

視覺問答

多模態理解

日語文本生成

使用案例

內容生成

圖像描述生成

為輸入的圖像生成詳細的日語描述

例如輸入一張東京天空樹的照片，輸出'桜と東京スカイツリー'

輔助工具

視覺問答

回答關於圖像內容的特定問題

🚀 日本版InstructBLIP Alpha

日本版InstructBLIP Alpha是一款視覺語言指令跟隨模型，能夠為輸入的圖像以及可選的輸入文本（如問題）生成日語描述。

🚀 快速開始

首先，安裝requirements.txt文件中的額外依賴項：

pip install sentencepiece einops

import torch
from transformers import LlamaTokenizer, AutoModelForVision2Seq, BlipImageProcessor
from PIL import Image
import requests

# helper function to format input prompts
def build_prompt(prompt="", sep="\n\n### "):
    sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
    p = sys_msg
    roles = ["指示", "応答"]
    user_query = "與えられた畫像について、詳細に述べてください。"
    msgs = [": \n" + user_query, ": "]
    if prompt:
        roles.insert(1, "入力")
        msgs.insert(1, ": \n" + prompt)
    for role, msg in zip(roles, msgs):
        p += sep + role + msg
    return p

# load model
model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-instructblip-alpha", trust_remote_code=True)
processor = BlipImageProcessor.from_pretrained("stabilityai/japanese-instructblip-alpha")
tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# prepare inputs
url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "" # input empty string for image captioning. You can also input questions as prompts 
prompt = build_prompt(prompt)
inputs = processor(images=image, return_tensors="pt")
text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
text_encoding["qformer_input_ids"] = text_encoding["input_ids"].clone()
text_encoding["qformer_attention_mask"] = text_encoding["attention_mask"].clone()
inputs.update(text_encoding)

# generate
outputs = model.generate(
    **inputs.to(device, dtype=model.dtype),
    num_beams=5,
    max_new_tokens=32,
    min_length=1,
)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
# 桜と東京スカイツリー

✨ 主要特性

視覺語言處理：能夠處理圖像和文本輸入，為圖像生成日語描述。
指令跟隨：可以根據輸入的指令生成相應的輸出。

📦 安裝指南

見快速開始部分，安裝requirements.txt文件中的額外依賴項：

pip install sentencepiece einops

💻 使用示例

基礎用法

import torch
from transformers import LlamaTokenizer, AutoModelForVision2Seq, BlipImageProcessor
from PIL import Image
import requests

# helper function to format input prompts
def build_prompt(prompt="", sep="\n\n### "):
    sys_msg = "以下は、タスクを説明する指示と、文脈のある入力の組み合わせです。要求を適切に満たす応答を書きなさい。"
    p = sys_msg
    roles = ["指示", "応答"]
    user_query = "與えられた畫像について、詳細に述べてください。"
    msgs = [": \n" + user_query, ": "]
    if prompt:
        roles.insert(1, "入力")
        msgs.insert(1, ": \n" + prompt)
    for role, msg in zip(roles, msgs):
        p += sep + role + msg
    return p

# load model
model = AutoModelForVision2Seq.from_pretrained("stabilityai/japanese-instructblip-alpha", trust_remote_code=True)
processor = BlipImageProcessor.from_pretrained("stabilityai/japanese-instructblip-alpha")
tokenizer = LlamaTokenizer.from_pretrained("novelai/nerdstash-tokenizer-v1", additional_special_tokens=['▁▁'])
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

# prepare inputs
url = "https://images.unsplash.com/photo-1582538885592-e70a5d7ab3d3?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1770&q=80"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
prompt = "" # input empty string for image captioning. You can also input questions as prompts 
prompt = build_prompt(prompt)
inputs = processor(images=image, return_tensors="pt")
text_encoding = tokenizer(prompt, add_special_tokens=False, return_tensors="pt")
text_encoding["qformer_input_ids"] = text_encoding["input_ids"].clone()
text_encoding["qformer_attention_mask"] = text_encoding["attention_mask"].clone()
inputs.update(text_encoding)

# generate
outputs = model.generate(
    **inputs.to(device, dtype=model.dtype),
    num_beams=5,
    max_new_tokens=32,
    min_length=1,
)
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()
print(generated_text)
# 桜と東京スカイツリー

📚 詳細文檔

模型詳情

屬性	詳情
開發者	Stability AI
模型類型	InstructBLIP
語言	日語
許可證	JAPANESE STABLELM RESEARCH LICENSE AGREEMENT

訓練

日本版InstructBLIP Alpha採用了InstructBLIP架構，由三個組件組成：一個凍結的視覺圖像編碼器、一個Q-Former和一個凍結的大語言模型（LLM）。視覺編碼器和Q-Former使用Salesforce/instructblip-vicuna-7b進行初始化。對於凍結的LLM，則使用了Japanese-StableLM-Instruct-Alpha-7B模型。在訓練過程中，僅對Q-Former進行訓練。

訓練數據集

訓練數據集包括以下公開數據集：

CC12M，其字幕已翻譯成日語。
MS-COCO，搭配STAIR Captions。
Japanese Visual Genome VQA dataset

使用與限制

預期用途

該模型旨在供開源社區在遵循研究許可證的前提下，用於類似聊天的應用程序。

限制與偏差

儘管上述數據集有助於引導基礎語言模型生成更“安全”的文本分佈，但並非所有的偏差和毒性都能通過微調得到緩解。我們提醒用戶注意生成響應中可能出現的此類潛在問題。請勿將模型輸出視為人類判斷的替代品或事實來源，請謹慎使用。

🔧 技術細節

日本版InstructBLIP Alpha採用了InstructBLIP架構，在訓練時僅對Q-Former進行訓練，利用多個公開數據集進行微調，以實現為圖像和文本輸入生成日語描述的功能。

📄 許可證

本模型遵循JAPANESE STABLELM RESEARCH LICENSE AGREEMENT。

📚 引用方式

@misc{JapaneseInstructBLIPAlpha, 
    url    = {[https://huggingface.co/stabilityai/japanese-instructblip-alpha](https://huggingface.co/stabilityai/japanese-instructblip-alpha)}, 
    title  = {Japanese InstructBLIP Alpha}, 
    author = {Shing, Makoto and Akiba, Takuya}
}

📚 參考文獻

@misc{dai2023instructblip,
    title         = {InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning}, 
    author        = {Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven Hoi},
    year          = {2023},
    eprint        = {2305.06500},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CV}
}