BLIP-2开源视觉-语言模型 - 免费实现图像到文本的智能生成

首页

Blip2 Test

由 advaitadasein 开发

BLIP-2是基于OPT-2.7b的视觉-语言模型，通过冻结图像编码器和大型语言模型，训练查询变换器实现图像到文本的生成。

图像生成文本

Transformers

英语开源协议:MIT #图像描述生成 #视觉问答 #多模态预训练

下载量 18

发布时间 : 9/15/2023

模型简介

BLIP-2是一种先进的视觉-语言模型，能够执行图像描述生成、视觉问答等任务。它通过查询变换器连接图像编码器和大型语言模型，实现高效的跨模态理解。

模型特点

冻结预训练模型

保持图像编码器和大型语言模型冻结，仅训练轻量级查询变换器，提高训练效率

跨模态理解

通过查询变换器桥接视觉和语言模态，实现高质量的图像到文本转换

多功能应用

支持图像描述生成、视觉问答和类聊天交互等多种任务

模型能力

图像描述生成

视觉问答(VQA)

图像对话交互

跨模态理解

使用案例

内容生成

自动图像标注

为图像生成详细的文字描述

可用于辅助视障人士或内容管理系统

智能交互

视觉问答系统

回答关于图像内容的自然语言问题

可用于教育、零售等场景的智能助手

🚀 BLIP-2，OPT-2.7b，仅预训练版本

BLIP-2 模型借助了 OPT-2.7b（一个拥有 27 亿参数的大语言模型）。该模型由 Li 等人在论文 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 中提出，并首次在此仓库发布。

声明：发布 BLIP-2 的团队并未为此模型撰写模型卡片，此模型卡片由 Hugging Face 团队撰写。

✨ 主要特性

适用于图像描述、视觉问答、类聊天对话等多种视觉相关任务。
结合了图像编码器、查询变换器（Q-Former）和大语言模型，通过查询嵌入弥合图像编码器和大语言模型嵌入空间的差距。

📚 详细文档

模型描述

BLIP-2 由 3 个模型组成：一个类似 CLIP 的图像编码器、一个查询变换器（Q-Former）和一个大语言模型。

作者从预训练检查点初始化图像编码器和大语言模型的权重，并在训练查询变换器时保持它们冻结。查询变换器是一个类似 BERT 的 Transformer 编码器，它将一组“查询令牌”映射到查询嵌入，这些嵌入弥合了图像编码器和大语言模型嵌入空间之间的差距。

该模型的目标很简单，即根据查询嵌入和之前的文本预测下一个文本令牌。

模型架构

这使得该模型可用于以下任务：

图像描述
视觉问答（VQA）
通过将图像和之前的对话作为提示输入模型进行类聊天对话

直接使用和下游使用

你可以使用原始模型根据图像和可选文本进行条件文本生成。请查看模型中心以查找针对你感兴趣的任务进行微调的版本。

偏差、风险、局限性和伦理考量

BLIP2-OPT 使用现成的 OPT 作为语言模型，它继承了 Meta 模型卡片中提到的相同风险和局限性。

与其他大型语言模型一样，训练数据的多样性（或缺乏多样性）会对模型质量产生下游影响，OPT-175B 在偏差和安全性方面存在局限性。OPT-175B 在生成多样性和幻觉方面也可能存在质量问题。一般来说，OPT-175B 无法避免困扰现代大型语言模型的诸多问题。

BLIP2 在从互联网收集的图像 - 文本数据集（如 LAION）上进行了微调。因此，该模型本身可能容易生成不适当的内容，或者复制底层数据中固有的偏差。

BLIP2 尚未在实际应用中进行测试，不应直接部署到任何应用程序中。研究人员应首先仔细评估该模型在其部署的特定环境中的安全性和公平性。

💻 使用示例

基础用法

代码示例请参考文档。

在 CPU 上运行模型

点击展开

import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

高级用法

在 GPU 上以全精度运行模型

点击展开

# pip install accelerate
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

在 GPU 上以半精度（`float16`）运行模型

点击展开

# pip install accelerate
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

在 GPU 上以 8 位精度（`int8`）运行模型

点击展开

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())