Eagle2.5 8B
Eagle 2.5 is a cutting-edge vision-language model (VLM) designed for long-context multimodal learning, supporting the processing of video sequences up to 512 frames and high-resolution images.
Downloads 2,626
Release Time : 4/12/2025
Model Overview
Eagle 2.5 addresses the challenges of long video understanding and high-resolution image understanding, providing a general framework and performing excellently in multiple benchmark tests.
Model Features
Long-context processing ability
Supports processing video sequences up to 512 frames and high-resolution images, addressing the limitation that most existing VLMs focus on short-context tasks.
Information priority sampling
Optimizes visual and text inputs through Image Area Preservation (IAP) and Automatic Degraded Sampling (ADS) to ensure maximum utilization of context length without losing information.
Progressive hybrid post-training
Gradually increases the context length from 32K to 128K during training to enhance the model's ability to handle different input sizes.
Diversity-driven data recipe
Combines open-source data with the self-curated Eagle-Video-110K dataset to provide rich and diverse training samples.
Efficiency optimization
Significantly improves the model's computational efficiency and inference speed through technologies such as GPU memory optimization, distributed context parallelism, video decoding acceleration, and inference acceleration.
Model Capabilities
Long video understanding
High-resolution image understanding
Multimodal learning
Text generation
Image analysis
Video analysis
Use Cases
Video understanding
Long video content analysis
Analyzes video content up to 512 frames to extract key information and storylines.
Reaches the SOTA level in multiple video benchmark tests.
Video question answering
Answers relevant questions based on video content.
Achieves an accuracy of 72.4% when using 512 input frames on Video-MME.
Image understanding
High-resolution image analysis
Processes high-resolution images to extract fine-grained details.
Performs excellently in multiple image benchmark tests, comparable to Qwen2.5-VL.
Document understanding
Parses multi-page document content to extract key information.
Achieves an accuracy of 94.1% in the DocVQA test.
🚀 Eagle 2.5
Eagle 2.5 is a family of frontier vision - language models (VLMs) designed for long - context multimodal learning. It addresses the challenges of long video comprehension and high - resolution image understanding, offering a generalist framework for both.
[📂Homepage] [📂GitHub] [📜Tech Report] [🤗HF Demo]
🚀 Quick Start
Installation
pip install transformers==4.51.0
Usage Examples
Basic Usage
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Advanced Usage - Stream Generation
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel, AutoTokenizer
import torch
from transformers import TextIteratorStreamer
import threading
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
generation_kwargs = dict(
**inputs,
streamer=streamer,
max_new_tokens=1024,
do_sample=True,
top_p=0.95,
temperature=0.8
)
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for new_text in streamer:
print(new_text, end="", flush=True)
Advanced Usage - Multiple - Images
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe these two images."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Advanced Usage - Single Video
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
},
{"type": "text", "text": "Describe this video."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Advanced Usage - Multiple Videos
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "../Eagle2-8B/space_woaudio.mp4",
"nframes": 10,
},
{
"type": "video",
"video": "../Eagle2-8B/video_ocr.mp4",
"nframes": 10,
},
{"type": "text", "text": "Describe these two videos respectively."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Advanced Usage - Batch Inference
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"
messages1 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.ilankelman.org/stopsigns/australia.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
messages2 = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
},
{"type": "text", "text": "Describe this image."},
],
}
]
text_list = [processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
) for messages in [messages1, messages2]]
image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
✨ Features
🚀Strong Results Across The Board
- SOTA on 6 out of 10 long video benchmarks
- Outperforms GPT - 4o (0806) on 3/5 video tasks
- Outperforms Gemini 1.5 Pro on 4/6 video tasks
- Matches or outperforms Qwen2.5 - VL - 72B on multiple key datasets
- 72.4% on Video - MME with 512 input frames
- Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5 - VL.
🎯Key Innovations
- Information - First Sampling:
- Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine - grained details.
- Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
- Progressive Mixed Post - Training:
- Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
- Diversity - Driven Data Recipe:
- Combines open - source data (human - annotated and synthetic) with the self - curated Eagle - Video - 110K dataset, collected via a diversity - driven strategy and annotated with both story - level and clip - level QA pairs.
⚡Efficiency & Framework Optimization
- GPU Memory Optimization:
- Integrate Triton - based fused operators replacing PyTorch’s MLP, RMSNorm, and RoPE implementations.
- Reduced GPU memory with fused linear layers + cross - entropy loss (removes intermediate logit storage) and CPU - offloading of hidden states.
- Distributed Context Parallelism:
- Adopts a two - layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
- Implements ZigZag Llama3 - style Context Parallelism with all - gather KV to reduce communication latency.
- Video Decoding Acceleration:
- Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
- Inference Acceleration:
- Supports vLLM deployment with reduced memory and accelerated inference.
📚 Documentation
Model Details
Property | Details |
---|---|
Model Type | Long - context vision - language model |
Architecture | Vision encoder: Siglip2 - So400m - Patch16 - 512; Language model: Qwen2.5 - 7B - Instruct; Multimodal base architecture: LLaVA with tiling - based vision input |
Supported Inputs | Long video sequences (up to 512 frames), High - resolution images (up to 4K HD input size), Multi - page documents, Long text |
Training Strategy | Progressive mixed post - training, expanding from 32K to 128K context length; Information - first sampling for optimal visual and textual information retention |
Training Data | Open - source video and document datasets; Eagle - Video - 110K (110K long videos with dual - level annotation) |
Released Models
Model | Date | Download Link | Notes |
---|---|---|---|
Eagle2.5 - 8B | 2025.04.16 | HF link | Long video (512 frames), high - res support |
Video Benchmarks
Benchmark | GPT - 4o | Gemini - 1.5 Pro | InternVL2.5 - 8B | Qwen2.5 - VL - 8B | Eagle2.5 - 8B |
---|---|---|---|---|---|
MVBenchtest | - | - | 72.0 | 69.6 | 74.8 |
Perception_testval | - | - | - | 70.5 | 82.0 |
EgoSchemafullset | - | 72.2 | - | 65.0 | 72.2 |
MMB - Video | 1.63 | 1.30 | 1.68 | 1.79 | 1.94 |
MLVUval | - | - | 68.9 | 70.2 | 77.6 |
LVBenchval | 66.7 | 64.0 | 60.0 | 56.0 | 66.4 |
Video - MMEw/o subtitle | 71.9 | 75.0 | 64.2 | 65.1 | 72.4 |
Video - MMEw subtitle | 77.2 | 81.3 | 66.9 | 71.6 | 75.7 |
CG - BenchClue | 58.6 | 50.9 | - | 44.5 | 55.8 |
CG - BenchLong | 44.9 | 37.8 | - | 35.5 | 46.6 |
CG - BenchmIoU | 5.73 | 3.85 | - | 2.48 | 13.4 |
HourVideoDev | - | 37.2 | - | - | 44.5 |
HourVideoTest | - | 37.4 | - | - | 41.8 |
Charade - STAmIoU | 35.7 | - | - | 43.6 | 65.9 |
HD - EPIC | - | 37.6 | - | - | 42.9 |
HRVideoBench | - | - | - | - | 68.5 |
EgoPlanval | - | - | - | - | 45.3 |
Embodied Benchmarks
Benchmark | GPT - 4o | Gemini - 1.5 Pro | InternVL2.5 - 8B | Qwen2.5 - VL - 8B | Eagle2.5 - 8B |
---|---|---|---|---|---|
OpenEQA | - | - | - | - | 63.5 |
ERQA | 47.0 | 41.8 | - | - | 38.3 |
EgoPlanval | - | - | - | - | 45.3 |
Image Benchmarks
Benchmark | GPT - 4o | Gemini - 1.5 Pro | InternVL2.5 - 8B | Qwen2.5 - VL - 8B | Eagle2.5 - 8B |
---|---|---|---|---|---|
DocVQAtest | 92.8 | 93.1 | 93.0 | 95.7 | 94.1 |
ChartQAtest | 85.7 | 87.2 | 84.8 | 87.3 | 87.5 |
InfoVQAtest | 79.2 | 81.0 | 77.6 | 82.6 | 80.4 |
TextVQAval | 77.4 | 78.8 | 79.1 | 84.9 | 83.7 |
OCRBenchtest | 736 | 754 | 822 | 864 | 869 |
MMstartest | 64.7 | 59.1 | 62.8 | 63.9 | 66.2 |
RWQAtest | 75.4 | 67.5 | 70.1 | 68.5 | 76.7 |
AI2Dtest | 84.6 | 79.1 | 84.5 | 83.9 | 84.5 |
MMMUval | 69.1 | 62.2 | 56.0 | 58.6 | 55.8 |
MMBench_V11test | 83.1 | 74.6 | 83.2 | 82.6 | 81.7 |
MMVetGPT - 4 - Turbo | 69.1 | 64.0 | 62.8 | 67.1 | 62.9 |
HallBenchavg | 55.0 | 45.6 | 50.1 | 52.9 | 54.7 |
MathVistatestmini | 63.8 | 63.9 | 64.4 | 68.2 | 67.8 |
Avg Score | 74.9 | 71.7 | 73.1 | 75.6 | 75.6 |
All numbers are directly extracted from Table 2 and Table 3 of the Eagle 2.5 Tech Report.
Eagle 2.5 - 8B matches or surpasses the performance of much larger models on long - context video and image benchmarks.
📄 License
- License: other
- License Name: nsclv1
Clip Vit Large Patch14 336
A large-scale vision-language pretrained model based on the Vision Transformer architecture, supporting cross-modal understanding between images and text
Text-to-Image
Transformers

C
openai
5.9M
241
Fashion Clip
MIT
FashionCLIP is a vision-language model fine-tuned specifically for the fashion domain based on CLIP, capable of generating universal product representations.
Text-to-Image
Transformers English

F
patrickjohncyh
3.8M
222
Gemma 3 1b It
Gemma 3 is a lightweight advanced open model series launched by Google, built on the same research and technology as the Gemini models. This model is multimodal, capable of processing both text and image inputs to generate text outputs.
Text-to-Image
Transformers

G
google
2.1M
347
Blip Vqa Base
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in visual question answering tasks through joint language-image training to achieve multimodal understanding and generation capabilities
Text-to-Image
Transformers

B
Salesforce
1.9M
154
CLIP ViT H 14 Laion2b S32b B79k
MIT
A vision-language model trained on the LAION-2B English dataset based on the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval tasks
Text-to-Image
Safetensors
C
laion
1.8M
368
CLIP ViT B 32 Laion2b S34b B79k
MIT
A vision-language model trained on the English subset of LAION-2B using the OpenCLIP framework, supporting zero-shot image classification and cross-modal retrieval
Text-to-Image
Safetensors
C
laion
1.1M
112
Pickscore V1
PickScore v1 is a scoring function for text-to-image generation, used to predict human preferences, evaluate model performance, and rank images.
Text-to-Image
Transformers

P
yuvalkirstain
1.1M
44
Owlv2 Base Patch16 Ensemble
Apache-2.0
OWLv2 is a zero-shot text-conditioned object detection model that can localize objects in images through text queries.
Text-to-Image
Transformers

O
google
932.80k
99
Llama 3.2 11B Vision Instruct
Llama 3.2 is a multilingual, multimodal large language model released by Meta, supporting image-to-text and text-to-text conversion tasks with robust cross-modal understanding capabilities.
Text-to-Image
Transformers Supports Multiple Languages

L
meta-llama
784.19k
1,424
Owlvit Base Patch32
Apache-2.0
OWL-ViT is a zero-shot text-conditioned object detection model that can search for objects in images via text queries without requiring category-specific training data.
Text-to-Image
Transformers

O
google
764.95k
129
Featured Recommended AI Models