E

Eagle2.5 8B

Developed by nvidia
Eagle 2.5 is a cutting-edge vision-language model (VLM) designed for long-context multimodal learning, supporting the processing of video sequences up to 512 frames and high-resolution images.
Downloads 2,626
Release Time : 4/12/2025

Model Overview

Eagle 2.5 addresses the challenges of long video understanding and high-resolution image understanding, providing a general framework and performing excellently in multiple benchmark tests.

Model Features

Long-context processing ability
Supports processing video sequences up to 512 frames and high-resolution images, addressing the limitation that most existing VLMs focus on short-context tasks.
Information priority sampling
Optimizes visual and text inputs through Image Area Preservation (IAP) and Automatic Degraded Sampling (ADS) to ensure maximum utilization of context length without losing information.
Progressive hybrid post-training
Gradually increases the context length from 32K to 128K during training to enhance the model's ability to handle different input sizes.
Diversity-driven data recipe
Combines open-source data with the self-curated Eagle-Video-110K dataset to provide rich and diverse training samples.
Efficiency optimization
Significantly improves the model's computational efficiency and inference speed through technologies such as GPU memory optimization, distributed context parallelism, video decoding acceleration, and inference acceleration.

Model Capabilities

Long video understanding
High-resolution image understanding
Multimodal learning
Text generation
Image analysis
Video analysis

Use Cases

Video understanding
Long video content analysis
Analyzes video content up to 512 frames to extract key information and storylines.
Reaches the SOTA level in multiple video benchmark tests.
Video question answering
Answers relevant questions based on video content.
Achieves an accuracy of 72.4% when using 512 input frames on Video-MME.
Image understanding
High-resolution image analysis
Processes high-resolution images to extract fine-grained details.
Performs excellently in multiple image benchmark tests, comparable to Qwen2.5-VL.
Document understanding
Parses multi-page document content to extract key information.
Achieves an accuracy of 94.1% in the DocVQA test.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase