Llama-3.1-8B-Dragonfly-v2オープンソース多モーダルモデル - 画像とテキストの連携理解と生成を実現

ホーム

Llama 3.1 8B Dragonfly V2

togethercomputerによって開発

トンボはLlama 3.1をベースに命令微調整で訓練されたマルチモーダル視覚言語モデルで、画像とテキストの統合的理解と生成をサポートします

画像生成テキスト

PyTorch

英語#マルチモーダル視覚言語 #高解像度画像理解 #芸術画像解析

ダウンロード数 113

リリース時間 : 10/10/2024

モデル概要

このモデルは主に視覚言語タスクの研究に使用され、画像とテキストの統合入力を処理し、関連するテキスト記述や回答を生成できます

モデル特徴

マルチ解像度画像処理

LLaVA-UHD高解像度画像処理ソリューションを採用し、視覚的詳細の捕捉能力を強化

命令微調整最適化

Llama 3.1をベースに命令微調整を行い、複雑な視覚言語タスクの理解能力を向上

マルチモーダル融合

CLIP視覚特徴とLlama言語モデルを効果的に統合し、画像とテキストの深い相互作用を実現

モデル能力

画像内容理解

視覚的質問応答

画像記述生成

マルチモーダル推論

使用事例

芸術と創造

芸術作品分析

芸術作品の内容、スタイル、創作意図を分析

芸術スタイルを正確に識別し、洞察に富んだ分析を生成可能

教育

視覚的補助学習

画像を通じて複雑な概念を説明

直感的で分かりやすいマルチモーダル説明を提供

🚀 ドラゴンフライモデルカード

ドラゴンフライは、多モーダルな視覚言語モデルで、Llama 3.1 をベースにインストラクションチューニングを行って学習されました。主に大規模な視覚言語モデルの研究に使用され、自然言語処理、機械学習、人工知能の研究者や愛好家を対象としています。

⚠️ 重要提示

ユーザーは、Llama 3.1 コミュニティライセンス契約に従ってこのモデルを使用することが許可されています。

✨ 主な機能

ドラゴンフライは、トランスフォーマーアーキテクチャに基づく自己回帰型の視覚言語モデルです。このモデルは、画像とテキストを入力として受け取り、自然な言語での応答を生成することができます。

属性	详情
開発元	Together AI
モデルタイプ	トランスフォーマーアーキテクチャに基づく自己回帰型視覚言語モデル
ライセンス	Llama 3.1 コミュニティライセンス契約
ファインチューニング元のモデル	Llama 3.1
リポジトリ	https://github.com/togethercomputer/Dragonfly
論文	https://arxiv.org/abs/2406.00977

📦 インストール

コンダ環境の作成と必要なパッケージのインストール

conda env create -f environment.yml
conda activate dragonfly_env

Flash Attention のインストール

pip install flash-attn --no-build-isolation

最後に、以下のコマンドを実行します。

pip install --upgrade -e .

💻 使用例

基本的な使用法

質問: この画像に何が面白いのですか？

Monalisa Dog

必要なパッケージを読み込みます。

import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer

from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed

トークナイザー、プロセッサー、モデルをインスタンス化します。

device = torch.device("cuda:0")

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")

model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
model = model.to(torch.bfloat16)
model = model.to(device)

画像を読み込み、処理します。

image = Image.open("./test_images/skateboard.png")
image = image.convert("RGB")
images = [image]
# images = [None] # 画像を渡さない場合

text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)

最後に、モデルから応答を生成します。

temperature = 0

with torch.inference_mode():
    generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)

generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)

応答の例:

The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
humerous effect that is likely to elicit laughter<|eot_id|>

📚 ドキュメント

学習の詳細や評価については、論文の「Implementation」セクションと「Results」セクションを参照してください。

🏆 クレジット

ドラゴンフライの開発に大きく貢献した以下のリソースを認めておきます。

Meta Llama 3.1: 基礎となる言語モデルとして Llama 3 モデルを利用しました。
CLIP: ビジョンバックボーンとして OpenAI の CLIP モデルを使用しています。
コードベースは、以下の 2 つのコードベースに基づいて構築されています。
- Otter: A Multi-Modal Model with In-Context Instruction Tuning
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

📄 ライセンス

このモデルは、Llama 3.1 コミュニティライセンス契約の下で提供されています。

📖 BibTeX

@misc{thapa2024dragonfly,
      title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models}, 
      author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
      year={2024},
      eprint={2406.00977},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}