Llama-3.1-8B-Dragonfly-v2開源多模態模型 - 實現圖像與文本聯合理解生成

首頁

Llama 3.1 8B Dragonfly V2

由togethercomputer開發

蜻蜓是基於Llama 3.1通過指令微調訓練的多模態視覺語言模型，支持圖像與文本的聯合理解與生成

圖像生成文本

PyTorch

英語#多模態視覺語言 #高分辨率圖像理解 #藝術圖像解析

下載量 113

發布時間 : 10/10/2024

模型概述

該模型主要用於視覺語言任務研究，能夠處理圖像與文本的聯合輸入，生成相關的文本描述或回答

模型特點

多分辨率圖像處理

採用LLaVA-UHD高分辨率圖像處理方案，增強對視覺細節的捕捉能力

指令微調優化

基於Llama 3.1進行指令微調，提升對複雜視覺語言任務的理解能力

多模態融合

有效整合CLIP視覺特徵與Llama語言模型，實現圖像與文本的深度交互

模型能力

圖像內容理解

視覺問答

圖像描述生成

多模態推理

使用案例

藝術與創意

藝術作品分析

分析藝術作品的內容、風格和創作意圖

能準確識別藝術風格並生成富有洞察力的分析

教育

視覺輔助學習

通過圖像輔助解釋複雜概念

提供直觀易懂的多模態解釋

🚀 蜻蜓模型卡片

蜻蜓（Dragonfly）是一個多模態視覺語言模型，基於Llama 3.1進行指令微調訓練。該模型主要用於大型視覺語言模型的研究，面向自然語言處理、機器學習和人工智能領域的研究人員和愛好者。

✨ 主要特性

多模態融合：能夠處理圖像和文本輸入，實現跨模態的信息交互。
基於Transformer架構：採用自迴歸機制，具備強大的語言生成能力。
指令微調：在Llama 3.1基礎上進行指令微調，提升了模型在特定任務上的性能。

📦 安裝指南

創建Conda環境並安裝必要的包

conda env create -f environment.yml
conda activate dragonfly_env

安裝Flash Attention

pip install flash-attn --no-build-isolation

最後一步，運行以下命令

pip install --upgrade -e .

💻 使用示例

基礎用法

加載必要的包

import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer

from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed

實例化分詞器、處理器和模型

device = torch.device("cuda:0")

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")

model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
model = model.to(torch.bfloat16)
model = model.to(device)

加載圖像並進行處理

image = Image.open("./test_images/skateboard.png")
image = image.convert("RGB")
images = [image]
# images = [None] # 如果不想傳入任何圖像

text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)

生成模型響應

temperature = 0

with torch.inference_mode():
    generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)

generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)

示例響應

The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
humerous effect that is likely to elicit laughter<|eot_id|>

📚 詳細文檔

模型詳情

屬性	詳情
開發團隊	Together AI
模型類型	基於Transformer架構的自迴歸視覺語言模型
許可證	Llama 3.1社區許可協議
微調基礎模型	Llama 3.1

模型來源

倉庫：https://github.com/togethercomputer/Dragonfly
論文：https://arxiv.org/abs/2406.00977

用途

蜻蜓模型主要用於大型視覺語言模型的研究，目標用戶為自然語言處理、機器學習和人工智能領域的研究人員和愛好者。

訓練詳情

更多訓練細節請參考論文Implementation部分。

評估

更多評估細節請參考論文Results部分。

🏆 致謝

我們感謝以下資源對蜻蜓模型開發的重要貢獻：

Meta Llama 3.1：作為基礎語言模型。
CLIP：作為視覺骨幹模型。
代碼庫基於以下兩個項目構建：
- Otter: A Multi-Modal Model with In-Context Instruction Tuning
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

📄 許可證

本模型遵循Llama 3.1社區許可協議。用戶可根據該協議使用此模型。

📖 BibTeX引用

@misc{thapa2024dragonfly,
      title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models}, 
      author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
      year={2024},
      eprint={2406.00977},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}