Llama-3.1-8B-Dragonfly-v2 Open-source Multimodal Model - Achieving Joint Understanding and Generation of Images and Text

Llama 3.1 8B Dragonfly V2

Developed by togethercomputer

Dragonfly is a multimodal vision-language model fine-tuned with instructions based on Llama 3.1, supporting joint understanding and generation of images and text

Image-to-Text

PyTorch

English#Multimodal Vision-Language #High-Resolution Image Understanding #Artistic Image Analysis

Downloads 113

Release Time : 10/10/2024

Model Overview

This model is primarily used for vision-language research tasks, capable of processing joint image-text inputs to generate relevant textual descriptions or answers

Model Features

Multi-Resolution Image Processing

Utilizes LLaVA-UHD high-resolution image processing solution to enhance visual detail capture capabilities

Instruction Fine-Tuning Optimization

Instruction fine-tuned based on Llama 3.1 to improve comprehension of complex vision-language tasks

Multimodal Fusion

Effectively integrates CLIP visual features with Llama language model for deep image-text interaction

Model Capabilities

Image content understanding

Visual question answering

Image caption generation

Multimodal reasoning

Use Cases

Art & Creativity

Artwork Analysis

Analyze artwork content, style and creative intent

Accurately identifies artistic styles and generates insightful analysis

Education

Visual-Assisted Learning

Explain complex concepts through visual aids

Provides intuitive multimodal explanations

🚀 Dragonfly Model Card

Dragonfly is a multimodal visual - language model, enabling image - text - to - text generation.

🚀 Quick Start

💿 Installation

Create a conda environment and install necessary packages

conda env create -f environment.yml
conda activate dragonfly_env

Install flash attention

pip install flash-attn --no-build-isolation

As a final step, please run the following command.

pip install --upgrade -e .

🧠 Inference

If you have successfully completed the installation process, then you should be able to follow the steps below.

Question: What is so funny about this image?

Monalisa Dog

💻 Usage Examples

Basic Usage

Load necessary packages

import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer

from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed

Instantiate the tokenizer, processor, and model.

device = torch.device("cuda:0")

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")

model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3.1-8B-Dragonfly-v2")
model = model.to(torch.bfloat16)
model = model.to(device)

Now, let's load the image and process them.

image = Image.open("./test_images/skateboard.png")
image = image.convert("RGB")
images = [image]
# images = [None] # if you do not want to pass any images

text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nWhat is so funny about this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = processor(text=[text_prompt], images=images, max_length=4096, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)

Finally, let us generate the responses from the model

temperature = 0

with torch.inference_mode():
    generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)

generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)

Advanced Usage

An example response.

The humor in this image comes from the surreal juxtaposition of a dog's face with the body of the Mona Lisa, a famous painting by Leonardo da Vinci.
The Mona Lisa is known for her enigmatic smile and is often considered one of the most famous paintings in the world. By combining the dog's face with
the body of the Mona Lisa, the artist has created a whimsical and amusing image that plays on the viewer 's expectations and familiarity with the
original paintings. The contrast between the dog's natural, expressive features and the serene, mysterious expression of the Mona Lisa creates a
humerous effect that is likely to elicit laughter<|eot_id|>

✨ Features

Multimodal Capability: Dragonfly is a multimodal visual - language model, trained by instruction tuning on Llama 3.1, enabling image - text - to - text generation.
Research - Oriented: The primary use of Dragonfly is research on large visual - language models, mainly for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.

📦 Installation

The installation process involves creating a conda environment, installing necessary packages, flash attention, and running an upgrade command. See the "Quick Start" section for detailed steps.

📚 Documentation

Model Details

Property	Details
Developed by	Together AI
Model Type	An autoregressive visual - language model based on the transformer architecture
License	Llama 3.1 Community License Agreement
Finetuned from model	Llama 3.1
Repository	https://github.com/togethercomputer/Dragonfly
Paper	https://arxiv.org/abs/2406.00977

Training Details

See more details in the "Implementation" section of our paper.

Evaluation

See more details in the "Results" section of our paper.

🏆 Credits

We would like to acknowledge the following resources that were instrumental in the development of Dragonfly:

Meta Llama 3.1: We utilized the Llama 3 model as our foundational language model.
CLIP: Our vision backbone is CLIP model from OpenAI.
Our codebase is built upon the following two codebases:
- Otter: A Multi - Modal Model with In - Context Instruction Tuning
- LLaVA - UHD: an LMM Perceiving Any Aspect Ratio and High - Resolution Images

📚 BibTeX

@misc{thapa2024dragonfly,
      title={Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models}, 
      author={Rahul Thapa and Kezhen Chen and Ian Covert and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
      year={2024},
      eprint={2406.00977},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Model Card Authors

Rahul Thapa, Kezhen Chen, Rahul Chalamala

Model Card Contact

Rahul Thapa (rahulthapa@together.ai), Kezhen Chen (kezhen@together.ai)

⚠️ Important Note

Users are permitted to use this model in accordance with the Llama 3.1 Community License Agreement.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご