🚀 蜻蜓模型卡片
蜻蜓(Dragonfly)是一个多模态视觉语言模型,基于Llama 3.1进行指令微调训练。该模型主要用于大型视觉语言模型的研究,面向自然语言处理、机器学习和人工智能领域的研究人员和爱好者。
✨ 主要特性
- 多模态融合:能够处理图像和文本输入,实现跨模态的信息交互。
- 基于Transformer架构:采用自回归机制,具备强大的语言生成能力。
- 指令微调:在Llama 3.1基础上进行指令微调,提升了模型在特定任务上的性能。
📦 安装指南
创建Conda环境并安装必要的包
conda env create -f environment.yml
conda activate dragonfly_env
安装Flash Attention
pip install flash-attn --no-build-isolation
最后一步,运行以下命令
pip install --upgrade -e .
💻 使用示例
基础用法
加载必要的包
import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed
实例化分词器、处理器和模型
device = torch.device("cuda:0")
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")
model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
model = model.to(torch.bfloat16)
model = model.to(device)
加载图像并进行处理
image = Image.open("./test_images/skateboard.png")
image = image.convert("RGB")
images = [image]
text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)
生成模型响应
temperature = 0
with torch.inference_mode():
generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)
generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)
示例响应
The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
humerous effect that is likely to elicit laughter<|eot_id|>
📚 详细文档
模型详情
模型来源
- 仓库:https://github.com/togethercomputer/Dragonfly
- 论文:https://arxiv.org/abs/2406.00977
用途
蜻蜓模型主要用于大型视觉语言模型的研究,目标用户为自然语言处理、机器学习和人工智能领域的研究人员和爱好者。
训练详情
更多训练细节请参考论文Implementation部分。
评估
更多评估细节请参考论文Results部分。
🏆 致谢
我们感谢以下资源对蜻蜓模型开发的重要贡献:
📄 许可证
本模型遵循Llama 3.1社区许可协议。用户可根据该协议使用此模型。
📖 BibTeX引用
@misc{thapa2024dragonfly,
title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models},
author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
year={2024},
eprint={2406.00977},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
模型卡片作者
Rahul Thapa, Kezhen Chen, Rahul Chalamala
模型卡片联系方式
Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai)