vsft-llava-1.5-7b-hf-trl開源多模態模型 - 實現圖像理解與對話生成

首頁

Vsft Llava 1.5 7b Hf Trl

由HuggingFaceH4開發

基於LLaVA-1.5-7B模型通過視覺監督微調(VSFT)訓練的多模態視覺語言模型，支持圖像理解和對話生成

圖像生成文本

Transformers

英語#多模態對話 #視覺指令微調 #圖像問答

下載量 65

發布時間 : 4/11/2024

模型概述

該模型是一個開源聊天機器人，通過基於LLaMA/Vicuna在GPT生成的多模態指令跟隨數據上進行微調訓練而成，能夠理解圖像內容並進行自然語言對話

模型特點

多圖像支持

支持在單個提示中處理多張圖像，實現更復雜的多模態理解

指令跟隨

經過指令微調訓練，能夠遵循用戶指令進行詳細、有幫助的回答

視覺監督微調

使用26萬張圖像和對話對進行VSFT訓練，增強了視覺理解能力

模型能力

圖像內容理解

多模態對話生成

視覺問答

圖像描述生成

使用案例

教育

科學圖表解釋

幫助學生理解科學圖表中的標籤和概念

能準確識別圖表中的元素並解釋其含義

內容分析

圖像內容描述

為視覺障礙用戶生成圖像的詳細文字描述

提供準確且詳細的圖像內容描述

🚀 HuggingFaceH4/vsft-llava-1.5-7b-hf-trl 視覺語言模型

HuggingFaceH4/vsft-llava-1.5-7b-hf-trl 是一款視覺語言模型，它通過對 llava-hf/llava-1.5-7b-hf 模型進行 VSFT 處理得到，使用了來自 HuggingFaceH4/llava-instruct-mix-vsft 數據集的 260k 圖像和對話對。

點擊查看我們的 Spaces 演示！

image/png

🚀 快速開始

本模型支持多圖像和多提示生成，即你可以在提示中傳入多張圖像。同時，請確保遵循正確的提示模板 (USER: xxx\nASSISTANT:)，並在需要查詢圖像的位置添加 <image> 標記。

✨ 主要特性

多模態處理：支持多圖像和多提示生成。
遵循特定模板：需遵循 USER: xxx\nASSISTANT: 提示模板，並使用 <image> 標記查詢圖像。

📚 詳細文檔

模型詳情

屬性	詳情
模型類型	LLaVA 是一個開源聊天機器人，通過在 GPT 生成的多模態指令跟隨數據上微調 LLaMA/Vicuna 訓練得到。它是基於 Transformer 架構的自迴歸語言模型。
模型日期	該模型於 2024 年 4 月 11 日完成訓練。
示例訓練腳本	使用我們的 TRL 示例自行訓練 VLM

如何使用模型

💻 使用示例

基礎用法

使用 pipeline：

from transformers import pipeline
from PIL import Image    
import requests

model_id = "HuggingFaceH4/vsft-llava-1.5-7b-hf-trl"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> {"generated_text": "\nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT: Lava"}

高級用法

使用純 transformers：

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "HuggingFaceH4/vsft-llava-1.5-7b-hf-trl"

prompt = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)


raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

模型優化

4 位量化

通過 bitsandbytes 庫進行 4 位量化：

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

需先安裝 bitsandbytes (pip install bitsandbytes)，並確保有支持 CUDA 的 GPU 設備。

使用 Flash-Attention 2

使用 Flash-Attention 2 進一步加速生成：

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

需先安裝 flash-attn，安裝方法參考 Flash Attention 原倉庫。

📄 許可證

Llama 2 遵循 LLAMA 2 社區許可證，版權歸 Meta Platforms, Inc. 所有。

🔖 引用

@misc{vonwerra2022trl,
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},
  title = {TRL: Transformer Reinforcement Learning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/trl}}
}