vsft-llava-1.5-7b-hf-trlオープンソース多モーダルモデル

Home

Vsft Llava 1.5 7b Hf Trl

Developed by HuggingFaceH4

LLaVA-1.5-7Bモデルを基に視覚的監督ファインチューニング(VSFT)で訓練されたマルチモーダル視覚言語モデルで、画像理解と対話生成をサポート

画像生成テキスト

Transformers

English#マルチモーダル対話 #視覚的指示ファインチューニング #画像質問応答

Downloads 65

Release Time : 4/11/2024

Model Overview

このモデルはオープンソースのチャットボットで、LLaMA/Vicunaを基にGPT生成のマルチモーダル指示追従データでファインチューニングされており、画像内容を理解し自然言語で対話可能

Model Features

複数画像サポート

単一プロンプトで複数画像を処理可能で、より複雑なマルチモーダル理解を実現

指示追従

指示ファインチューニング訓練済みで、ユーザーの指示に従って詳細かつ有益な回答が可能

視覚的監督ファインチューニング

26万枚の画像と対話ペアでVSFT訓練を行い、視覚理解能力を強化

Model Capabilities

画像内容理解

マルチモーダル対話生成

視覚的質問応答

画像説明生成

Use Cases

教育

科学図表の解釈

学生が科学図表のラベルや概念を理解するのを支援

図表中の要素を正確に識別しその意味を説明可能

コンテンツ分析

画像内容の説明

視覚障害ユーザー向けに画像の詳細な文章説明を生成

正確かつ詳細な画像内容の説明を提供

🚀 HuggingFaceH4/vsft-llava-1.5-7b-hf-trl

HuggingFaceH4/vsft-llava-1.5-7b-hf-trlは、画像とテキストを組み合わせた多様体言語モデルで、画像からテキストを生成するタスクに特化しています。

image/png

デモを試してみましょう！

🚀 クイックスタート

HuggingFaceH4/vsft-llava-1.5-7b-hf-trlは、ビジョン言語モデルです。このモデルは、llava-hf/llava-1.5-7b-hfモデルに対してVSFTを実行し、HuggingFaceH4/llava-instruct-mix-vsftデータセットから260kの画像と会話のペアを用いて作成されました。

✨ 主な機能

モデルは、多画像と多プロンプトの生成をサポートしています。つまり、プロンプトに複数の画像を渡すことができます。
正しいプロンプトテンプレート (USER: xxx\nASSISTANT:) を使用し、画像をクエリしたい場所にトークン <image> を追加する必要があります。

📚 ドキュメント

モデルの詳細

属性	详情
モデルタイプ	LLaVAは、GPT生成のマルチモーダル命令追従データでLLaMA/Vicunaをファインチューニングすることでトレーニングされたオープンソースのチャットボットです。これは、トランスフォーマーアーキテクチャに基づく自己回帰型言語モデルです。
モデルの日付	モデルは2024年4月11日にトレーニングされました。
トレーニングスクリプトの例	TRLの例を使って独自にVLMをトレーニングする

モデルの使用方法

基本的な使用法

# Using `pipeline`:
from transformers import pipeline
from PIL import Image    
import requests

model_id = "HuggingFaceH4/vsft-llava-1.5-7b-hf-trl"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)
>>> {"generated_text": "\nUSER: What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT: Lava"}

高度な使用法

# Using pure `transformers`:
import requests
from PIL import Image

import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "HuggingFaceH4/vsft-llava-1.5-7b-hf-trl"

prompt = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat are these?\nASSISTANT:"
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = AutoProcessor.from_pretrained(model_id)


raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

モデルの最適化

`bitsandbytes` ライブラリを使用した4ビット量子化

最初に bitsandbytes をインストールしてください (pip install bitsandbytes)。CUDA互換のGPUデバイスにアクセスできることを確認してください。上記のスニペットを以下のように変更します。

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   load_in_4bit=True
)

Flash-Attention 2を使用して生成をさらに高速化する

最初に flash-attn をインストールしてください。このパッケージのインストールについては、Flash Attentionのオリジナルリポジトリを参照してください。上記のスニペットを以下のように変更します。

model = LlavaForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True,
+   use_flash_attention_2=True
).to(0)

📄 ライセンス

引用

@misc{vonwerra2022trl,
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang},
  title = {TRL: Transformer Reinforcement Learning},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/trl}}
}