Eagle2.5-8B Open-Source Vision-Language Model - Free Deployment for Processing Long Videos and High-Resolution Images

Eagle2.5 8B

Developed by nvidia

Eagle 2.5 is a cutting-edge vision-language model (VLM) designed for long-context multimodal learning, supporting the processing of video sequences up to 512 frames and high-resolution images.

Text-to-Image

Transformers

OtherOpen Source License:Other #Long video understanding #High-resolution image processing #Multimodal learning

Downloads 2,626

Release Time : 4/12/2025

Model Overview

Eagle 2.5 addresses the challenges of long video understanding and high-resolution image understanding, providing a general framework and performing excellently in multiple benchmark tests.

Model Features

Long-context processing ability

Supports processing video sequences up to 512 frames and high-resolution images, addressing the limitation that most existing VLMs focus on short-context tasks.

Information priority sampling

Optimizes visual and text inputs through Image Area Preservation (IAP) and Automatic Degraded Sampling (ADS) to ensure maximum utilization of context length without losing information.

Progressive hybrid post-training

Gradually increases the context length from 32K to 128K during training to enhance the model's ability to handle different input sizes.

Diversity-driven data recipe

Combines open-source data with the self-curated Eagle-Video-110K dataset to provide rich and diverse training samples.

Efficiency optimization

Significantly improves the model's computational efficiency and inference speed through technologies such as GPU memory optimization, distributed context parallelism, video decoding acceleration, and inference acceleration.

Model Capabilities

Long video understanding

High-resolution image understanding

Multimodal learning

Text generation

Image analysis

Video analysis

Use Cases

Video understanding

Long video content analysis

Analyzes video content up to 512 frames to extract key information and storylines.

Reaches the SOTA level in multiple video benchmark tests.

Video question answering

Answers relevant questions based on video content.

Achieves an accuracy of 72.4% when using 512 input frames on Video-MME.

Image understanding

High-resolution image analysis

Processes high-resolution images to extract fine-grained details.

Performs excellently in multiple image benchmark tests, comparable to Qwen2.5-VL.

Document understanding

Parses multi-page document content to extract key information.

Achieves an accuracy of 94.1% in the DocVQA test.

🚀 Eagle 2.5

Eagle 2.5 is a family of frontier vision - language models (VLMs) designed for long - context multimodal learning. It addresses the challenges of long video comprehension and high - resolution image understanding, offering a generalist framework for both.

[📂Homepage] [📂GitHub] [📜Tech Report] [🤗HF Demo]

🚀 Quick Start

Installation

pip install transformers==4.51.0

Usage Examples

Basic Usage

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage - Stream Generation

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel, AutoTokenizer
import torch

from transformers import TextIteratorStreamer
import threading


model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

generation_kwargs = dict(
    **inputs,
    streamer=streamer,
    max_new_tokens=1024,
    do_sample=True,
    top_p=0.95,
    temperature=0.8
)
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()


for new_text in streamer:
    print(new_text, end="", flush=True)

Advanced Usage - Multiple - Images

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {
                "type": "image",
                "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
            },
            {"type": "text", "text": "Describe these two images."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs = processor.process_vision_info(messages)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage - Single Video

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "../Eagle2-8B/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)

inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage - Multiple Videos

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "../Eagle2-8B/space_woaudio.mp4",
                "nframes": 10,
            },
            {
                "type": "video",
                "video": "../Eagle2-8B/video_ocr.mp4",
                "nframes": 10,
            },
            {"type": "text", "text": "Describe these two videos respectively."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)]
image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Advanced Usage - Batch Inference

from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch
model = AutoModel.from_pretrained("nvidia/Eagle-2.5-8B",trust_remote_code=True, torch_dtype=torch.bfloat16)
processor = AutoProcessor.from_pretrained("nvidia/Eagle-2.5-8B", trust_remote_code=True, use_fast=True)
processor.tokenizer.padding_side = "left"

messages1 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

messages2 = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text_list = [processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
) for messages in [messages1, messages2]]
image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")
model = model.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output_text = processor.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

✨ Features

🚀Strong Results Across The Board

SOTA on 6 out of 10 long video benchmarks
Outperforms GPT - 4o (0806) on 3/5 video tasks
Outperforms Gemini 1.5 Pro on 4/6 video tasks
Matches or outperforms Qwen2.5 - VL - 72B on multiple key datasets
72.4% on Video - MME with 512 input frames
Strong image understanding with consistent improvement over Eagle 2, matching Qwen2.5 - VL.

🎯Key Innovations

Information - First Sampling:
- Image Area Preservation (IAP): Optimizes image tiling to retain most of the original image area and aspect ratio, preserving fine - grained details.
- Automatic Degrade Sampling (ADS): Dynamically balances visual and textual input, ensuring complete text retention while maximizing visual content within context length constraints.
Progressive Mixed Post - Training:
- Gradually increases context length during training, enhancing the model's ability to process varying input sizes and improving information density over static sampling.
Diversity - Driven Data Recipe:
- Combines open - source data (human - annotated and synthetic) with the self - curated Eagle - Video - 110K dataset, collected via a diversity - driven strategy and annotated with both story - level and clip - level QA pairs.

⚡Efficiency & Framework Optimization

GPU Memory Optimization:
- Integrate Triton - based fused operators replacing PyTorch’s MLP, RMSNorm, and RoPE implementations.
- Reduced GPU memory with fused linear layers + cross - entropy loss (removes intermediate logit storage) and CPU - offloading of hidden states.
Distributed Context Parallelism:
- Adopts a two - layer communication group based on Ulysses and Ring/Context Parallelism building on USP.
- Implements ZigZag Llama3 - style Context Parallelism with all - gather KV to reduce communication latency.
Video Decoding Acceleration:
- Optimized sparse video frame sampling with rapid video metadata parsing, improved long video decoding and reduced memory consumption.
Inference Acceleration:
- Supports vLLM deployment with reduced memory and accelerated inference.

📚 Documentation

Model Details

Property	Details
Model Type	Long - context vision - language model
Architecture	Vision encoder: Siglip2 - So400m - Patch16 - 512; Language model: Qwen2.5 - 7B - Instruct; Multimodal base architecture: LLaVA with tiling - based vision input
Supported Inputs	Long video sequences (up to 512 frames), High - resolution images (up to 4K HD input size), Multi - page documents, Long text
Training Strategy	Progressive mixed post - training, expanding from 32K to 128K context length; Information - first sampling for optimal visual and textual information retention
Training Data	Open - source video and document datasets; Eagle - Video - 110K (110K long videos with dual - level annotation)

Released Models

Model	Date	Download Link	Notes
Eagle2.5 - 8B	2025.04.16	HF link	Long video (512 frames), high - res support

Video Benchmarks

Benchmark	GPT - 4o	Gemini - 1.5 Pro	InternVL2.5 - 8B	Qwen2.5 - VL - 8B	Eagle2.5 - 8B
MVBench_test	-	-	72.0	69.6	74.8
Perception_test_val	-	-	-	70.5	82.0
EgoSchema_fullset	-	72.2	-	65.0	72.2
MMB - Video	1.63	1.30	1.68	1.79	1.94
MLVU_val	-	-	68.9	70.2	77.6
LVBench_val	66.7	64.0	60.0	56.0	66.4
Video - MME_{w/o subtitle}	71.9	75.0	64.2	65.1	72.4
Video - MME_{w subtitle}	77.2	81.3	66.9	71.6	75.7
CG - Bench_Clue	58.6	50.9	-	44.5	55.8
CG - Bench_Long	44.9	37.8	-	35.5	46.6
CG - Bench_mIoU	5.73	3.85	-	2.48	13.4
HourVideo_Dev	-	37.2	-	-	44.5
HourVideo_Test	-	37.4	-	-	41.8
Charade - STA_mIoU	35.7	-	-	43.6	65.9
HD - EPIC	-	37.6	-	-	42.9
HRVideoBench	-	-	-	-	68.5
EgoPlan_val	-	-	-	-	45.3

Embodied Benchmarks

Benchmark	GPT - 4o	Gemini - 1.5 Pro	InternVL2.5 - 8B	Qwen2.5 - VL - 8B	Eagle2.5 - 8B
OpenEQA	-	-	-	-	63.5
ERQA	47.0	41.8	-	-	38.3
EgoPlan_val	-	-	-	-	45.3

Image Benchmarks

Benchmark	GPT - 4o	Gemini - 1.5 Pro	InternVL2.5 - 8B	Qwen2.5 - VL - 8B	Eagle2.5 - 8B
DocVQA_test	92.8	93.1	93.0	95.7	94.1
ChartQA_test	85.7	87.2	84.8	87.3	87.5
InfoVQA_test	79.2	81.0	77.6	82.6	80.4
TextVQA_val	77.4	78.8	79.1	84.9	83.7
OCRBench_test	736	754	822	864	869
MMstar_test	64.7	59.1	62.8	63.9	66.2
RWQA_test	75.4	67.5	70.1	68.5	76.7
AI2D_test	84.6	79.1	84.5	83.9	84.5
MMMU_val	69.1	62.2	56.0	58.6	55.8
MMBench_V11_test	83.1	74.6	83.2	82.6	81.7
MMVet_{GPT - 4 - Turbo}	69.1	64.0	62.8	67.1	62.9
HallBench_avg	55.0	45.6	50.1	52.9	54.7
MathVista_testmini	63.8	63.9	64.4	68.2	67.8
Avg Score	74.9	71.7	73.1	75.6	75.6

All numbers are directly extracted from Table 2 and Table 3 of the Eagle 2.5 Tech Report.

Eagle 2.5 - 8B matches or surpasses the performance of much larger models on long - context video and image benchmarks.

📄 License

License: other
License Name: nsclv1

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご