mblip-mt0-xl開源多語言視覺-語言模型 - 支持96種語言圖像描述及問答

首頁

Mblip Mt0 Xl

由Gregor開發

mBLIP是一個多語言視覺-語言模型，基於BLIP-2架構，支持96種語言的圖像描述生成和視覺問答任務。

圖像生成文本

Transformers

支持多種語言開源協議:MIT #多語言視覺問答 #零樣本圖像描述 #跨模態對齊

下載量 374

發布時間 : 7/10/2023

模型概述

mBLIP是一個BLIP-2模型，由視覺變換器(ViT)、查詢變換器(Q-Former)和大型語言模型(LLM)組成，通過多語言任務混合重新對齊到多語言LLM(mt0-xl)，支持圖像描述生成和視覺問答任務。

模型特點

多語言支持

支持96種語言的圖像理解和生成任務

高效對齊

通過多語言任務混合重新對齊視覺和語言組件

零樣本能力

可在零樣本設置下進行條件文本生成

模型能力

圖像轉文本

多語言圖像描述生成

視覺問答

多語言理解

使用案例

內容生成

多語言圖像描述

為圖像生成不同語言的描述

可生成96種語言的圖像描述

問答系統

多語言視覺問答

回答關於圖像內容的問題

支持96種語言的問答

🚀 mBLIP mT0-XL

mBLIP mT0-XL是一個用於多語言視覺任務的模型，它基於BLIP-2架構，能在96種語言下執行圖像描述、視覺問答等任務，為多語言視覺處理提供了高效的解決方案。

🚀 快速開始

本模型可用於96種語言的圖像描述、視覺問答等任務。若要使用原始模型進行零樣本條件文本生成，或對其進行微調以用於下游應用，可參考我們的代碼倉庫。

✨ 主要特性

多語言支持：mBLIP可在96種語言下執行圖像描述、視覺問答等任務。
高效架構：基於BLIP-2架構，由視覺變換器（ViT）、查詢變換器（Q-Former）和大語言模型（LLM）組成。
靈活使用：可直接使用原始模型進行零樣本推理，也可進行微調以適應特定任務。

📦 安裝指南

文檔未提及安裝步驟，暫不展示。

💻 使用示例

基礎用法

在CPU上運行模型

import requests
from PIL import Image
from transformers import BlipProcessor, Blip2ForConditionalGeneration

processor = BlipProcessor.from_pretrained("Gregor/mblip-mt0-xl")
model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "Describe the image in German."
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

高級用法

在GPU上以全精度運行模型

# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Gregor/mblip-mt0-xl")
model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "Describe the image in German."
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

在GPU上以半精度（`bfloat16`）運行模型

# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Gregor/mblip-mt0-xl")
model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl", torch_dtype=torch.bfloat16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "Describe the image in German."
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.bfloat16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

在GPU上以8位精度（`int8`）運行模型

⚠️ 重要提示

論文結果僅對大語言模型（LLM）的權重使用int8，而此代碼會將所有權重加載為int8。我們發現這樣的結果稍差，目前HuggingFace不支持對部分模型使用int8。

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Gregor/mblip-mt0-xl")
model = Blip2ForConditionalGeneration.from_pretrained("Gregor/mblip-mt0-xl", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "Describe the image in German."
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.bfloat16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

📚 詳細文檔

模型描述

mBLIP是一個基於BLIP-2的模型，由3個子模型組成：視覺變換器（ViT）、查詢變換器（Q-Former）和大語言模型（LLM）。

Q-Former和ViT均由英文BLIP-2檢查點（blip2-flan-t5-xl）初始化，然後使用多語言任務混合數據集與多語言大語言模型（mt0-xl）重新對齊。

mBLIP架構

這使得該模型可用於以下任務：

圖像描述
視覺問答（VQA）

支持96種語言。

支持語言

mBLIP在以下96種語言上進行了訓練： af, am, ar, az, be, bg, bn, ca, ceb, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fil, fr, ga, gd, gl, gu, ha, hi, ht, hu, hy, id, ig, is, it, iw, ja, jv, ka, kk, km, kn, ko, ku, ky, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tr, uk, ur, uz, vi, xh, yi, yo, zh, zu

直接使用和下游應用

你可以在零樣本設置下，使用原始模型根據圖像和提示文本進行條件文本生成，或者對其進行微調以用於下游應用。我們強烈建議在微調時對大語言模型（LLM）應用低秩自適應（LoRA），並使用bf16作為數據類型，因為標準的fp16可能會導致損失出現NaN。

偏差、風險、侷限性和倫理考量

雖然mBLIP理論上可以處理多達100種語言，但在實踐中，我們預計在英語、德語、西班牙語等高資源語言的提示下，模型會取得最佳效果。

mBLIP繼承了用於初始化它的模型的風險、侷限性和偏差。該模型尚未在現實世界的應用中進行測試，因此不應直接部署到任何應用程序中。研究人員應首先仔細評估模型在特定部署環境中的安全性和公平性。

🔧 技術細節

文檔未提及技術實現細節，暫不展示。

📄 許可證

本項目採用MIT許可證。

📖 引用

如果您使用了我們的模型，請引用以下文獻：

@article{geigle2023mblip,
  author       = {Gregor Geigle and
                  Abhay Jain and
                  Radu Timofte and
                  Goran Glava\v{s}},
  title        = {mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs},
  journal      = {arXiv},
  volume       = {abs/2307.06930},
  year         = {2023},
  url          = {https://arxiv.org/abs/2307.06930},
  eprinttype    = {arXiv},
  eprint       = {2307.06930},
}