🚀 蜻蜓模型卡片
蜻蜓(Dragonfly)是一個多模態視覺語言模型,基於Llama 3.1進行指令微調訓練。該模型主要用於大型視覺語言模型的研究,面向自然語言處理、機器學習和人工智能領域的研究人員和愛好者。
✨ 主要特性
- 多模態融合:能夠處理圖像和文本輸入,實現跨模態的信息交互。
- 基於Transformer架構:採用自迴歸機制,具備強大的語言生成能力。
- 指令微調:在Llama 3.1基礎上進行指令微調,提升了模型在特定任務上的性能。
📦 安裝指南
創建Conda環境並安裝必要的包
conda env create -f environment.yml
conda activate dragonfly_env
安裝Flash Attention
pip install flash-attn --no-build-isolation
最後一步,運行以下命令
pip install --upgrade -e .
💻 使用示例
基礎用法
加載必要的包
import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed
實例化分詞器、處理器和模型
device = torch.device("cuda:0")
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")
model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
model = model.to(torch.bfloat16)
model = model.to(device)
加載圖像並進行處理
image = Image.open("./test_images/skateboard.png")
image = image.convert("RGB")
images = [image]
text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)
生成模型響應
temperature = 0
with torch.inference_mode():
generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)
generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)
示例響應
The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
humerous effect that is likely to elicit laughter<|eot_id|>
📚 詳細文檔
模型詳情
模型來源
- 倉庫:https://github.com/togethercomputer/Dragonfly
- 論文:https://arxiv.org/abs/2406.00977
用途
蜻蜓模型主要用於大型視覺語言模型的研究,目標用戶為自然語言處理、機器學習和人工智能領域的研究人員和愛好者。
訓練詳情
更多訓練細節請參考論文Implementation部分。
評估
更多評估細節請參考論文Results部分。
🏆 致謝
我們感謝以下資源對蜻蜓模型開發的重要貢獻:
📄 許可證
本模型遵循Llama 3.1社區許可協議。用戶可根據該協議使用此模型。
📖 BibTeX引用
@misc{thapa2024dragonfly,
title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models},
author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
year={2024},
eprint={2406.00977},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
模型卡片作者
Rahul Thapa, Kezhen Chen, Rahul Chalamala
模型卡片聯繫方式
Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai)