Llama-3.1-8B-Dragonfly-v2开源多模态模型 - 实现图像与文本联合理解生成

首页

Llama 3.1 8B Dragonfly V2

由 togethercomputer 开发

蜻蜓是基于Llama 3.1通过指令微调训练的多模态视觉语言模型，支持图像与文本的联合理解与生成

图像生成文本

PyTorch

英语#多模态视觉语言 #高分辨率图像理解 #艺术图像解析

下载量 113

发布时间 : 10/10/2024

模型简介

该模型主要用于视觉语言任务研究，能够处理图像与文本的联合输入，生成相关的文本描述或回答

模型特点

多分辨率图像处理

采用LLaVA-UHD高分辨率图像处理方案，增强对视觉细节的捕捉能力

指令微调优化

基于Llama 3.1进行指令微调，提升对复杂视觉语言任务的理解能力

多模态融合

有效整合CLIP视觉特征与Llama语言模型，实现图像与文本的深度交互

模型能力

图像内容理解

视觉问答

图像描述生成

多模态推理

使用案例

艺术与创意

艺术作品分析

分析艺术作品的内容、风格和创作意图

能准确识别艺术风格并生成富有洞察力的分析

教育

视觉辅助学习

通过图像辅助解释复杂概念

提供直观易懂的多模态解释

🚀 蜻蜓模型卡片

蜻蜓（Dragonfly）是一个多模态视觉语言模型，基于Llama 3.1进行指令微调训练。该模型主要用于大型视觉语言模型的研究，面向自然语言处理、机器学习和人工智能领域的研究人员和爱好者。

✨ 主要特性

多模态融合：能够处理图像和文本输入，实现跨模态的信息交互。
基于Transformer架构：采用自回归机制，具备强大的语言生成能力。
指令微调：在Llama 3.1基础上进行指令微调，提升了模型在特定任务上的性能。

📦 安装指南

创建Conda环境并安装必要的包

conda env create -f environment.yml
conda activate dragonfly_env

安装Flash Attention

pip install flash-attn --no-build-isolation

最后一步，运行以下命令

pip install --upgrade -e .

💻 使用示例

基础用法

加载必要的包

import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer

from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed

实例化分词器、处理器和模型

device = torch.device("cuda:0")

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")

model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
model = model.to(torch.bfloat16)
model = model.to(device)

加载图像并进行处理

image = Image.open("./test_images/skateboard.png")
image = image.convert("RGB")
images = [image]
# images = [None] # 如果不想传入任何图像

text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)

生成模型响应

temperature = 0

with torch.inference_mode():
    generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)

generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)

示例响应

The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
humerous effect that is likely to elicit laughter<|eot_id|>

📚 详细文档

模型详情

属性	详情
开发团队	Together AI
模型类型	基于Transformer架构的自回归视觉语言模型
许可证	Llama 3.1社区许可协议
微调基础模型	Llama 3.1

模型来源

仓库：https://github.com/togethercomputer/Dragonfly
论文：https://arxiv.org/abs/2406.00977

用途

蜻蜓模型主要用于大型视觉语言模型的研究，目标用户为自然语言处理、机器学习和人工智能领域的研究人员和爱好者。

训练详情

更多训练细节请参考论文Implementation部分。

评估

更多评估细节请参考论文Results部分。

🏆 致谢

我们感谢以下资源对蜻蜓模型开发的重要贡献：

Meta Llama 3.1：作为基础语言模型。
CLIP：作为视觉骨干模型。
代码库基于以下两个项目构建：
- Otter: A Multi-Modal Model with In-Context Instruction Tuning
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

📄 许可证

本模型遵循Llama 3.1社区许可协议。用户可根据该协议使用此模型。

📖 BibTeX引用

@misc{thapa2024dragonfly,
      title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models}, 
      author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
      year={2024},
      eprint={2406.00977},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}