Model Overview
Model Features
Model Capabilities
Use Cases
đ Eagle-2
Eagle-2 is a Vision-Language Model (VLM) that combines multiple base models. It aims to empower the open - source community to develop competitive VLMs with transparent processes, and provides different model sizes for various scenarios.
[đ GitHub] [đ Eagle2 Tech Report] [đ¤ HF Demo]
đ Quick Start
We provide a inference script to help you quickly start using the model. We support different input types:
- pure text input
- single image input
- multiple image input
- video input
đĻ Installation
Install the dependencies
pip install transformers
pip install flash-attn
đģ Usage Examples
Basic Usage
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Advanced Usage
Stream generation
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel, AutoTokenizer
import torch
from transformers import TextIteratorStreamer
import threading
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
streamer=streamer,
max_new_tokens=1024,
do_sample=True,
top_p=0.95,
temperature=0.8
)
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end="", flush=True)
Multiple-images
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe these two images."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Single video
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
},
{"type": "text", "text": "Describe this video."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Multiple videos
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
"nframes": 10,
},
{
"type": "video",
"video": "../Eagle2-8B/video_ocr.mp4",
"nframes": 10,
},
{"type": "text", "text": "Describe these two videos respectively."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Batch inference
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages1 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
messages2 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
) for messages in [messages1, messages2]]
image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
⨠Features
- Model Architecture Update: We update the model arch to
eagle_2_5_vl
to supportgenerate
feature. - Multiple Base Models: Combine multiple base models like
google/paligemma-3b-mix-448
,Qwen/Qwen2.5-0.5B-Instruct
,google/siglip-so400m-patch14-384
through merging. - Multilingual Support: Support multiple languages.
- Multiple Input Types: Support pure text, single image, multiple images, single video, multiple videos, and batch inference.
đ Documentation
Introduction
We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
In this repo, we are open-sourcing Eagle2-1B, a compact and efficient model designed for scenarios requiring fast inference and minimal computational resources, without compromising essential performance
Model Zoo
We provide the following models:
model name | LLM | Vision | Max Length | HF Link |
---|---|---|---|---|
Eagle2-1B | Qwen2.5-0.5B-Instruct | Siglip | 16K | đ¤ link |
Eagle2-2B | Qwen2.5-1.5B-Instruct | Siglip | 16K | đ¤ link |
Eagle2-9B | Qwen2.5-7B-Instruct | Siglip+ConvNext | 16K | đ¤ link |
Benchmark Results
Benchmark | LLaVa-One-Vision-0.5B | InternVL2-1B | InternVL2.5-1B | Qwen2-VL-2B | Eagle2-1B |
---|---|---|---|---|---|
DocVQAtest | 70.0 | 81.7 | 84.8 | 90.1 | 81.8 |
ChartQAtest | 61.4 | 72.9 | 75.9 | 73.0 | 77.0 |
InfoVQAtest | 41.8 | 50.9 | 56.0 | 65.5 | 54.8 |
TextVQAval | - | 70.0 | 72.0 | 79.7 | 76.6 |
OCRBench | 565 | 754 | 785 | 809 | 767 |
MMEsum | 1438.0 | 1794.4 | 1950.5 | 1872.0 | 1790.2 |
RealWorldQA | 55.6 | 50.3 | 57.5 | 62.6 | 55.4 |
AI2Dtest | 57.1 | 64.1 | 69.3 | 74.7 | 70.9 |
MMMUval | 31.4 | 36.7 | 40.9 | 41.1 | 38.8 |
MMVetGPT-4-Turbo | 32.2 | 32.7 | 48.8 | 49.5 | 40.9 |
HallBenchavg | 27.9 | 34.0 | 39.0 | 41.7 | 35.3 |
MathVistatestmini | 33.8 | 37.7 | 43.2 | 43.0 | 45.3 |
MMstar | 37.7 | 45.7 | 50.1 | 48.0 | 48.5 |
TODO
- [ ] Support vLLM Inference
- [ ] Provide AWQ Quantization Weights
- [ ] Provide fine-tuning scripts
đ License
- The code is released under the Apache 2.0 license as found in the LICENSE file.
- The pretrained model weights are released under the Creative Commons Attribution: Non-Commercial 4.0 International
- The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of Qwen2.5-0.5B-Instruct: Apache-2.0
- Model License of PaliGemma: Gemma license
đ§ Technical Details
Citation
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.
Information Table
Property | Details |
---|---|
Pipeline Tag | image-text-to-text |
Library Name | transformers |
Base Model | google/paligemma-3b-mix-448, Qwen/Qwen2.5-0.5B-Instruct, google/siglip-so400m-patch14-384 |
Base Model Relation | merge |
Language | multilingual |
Tags | eagle, VLM |
License | cc-by-nc-4.0 |






