Eagle2 9B
Eagle2-9B是NVIDIA发布的最新视觉语言模型(VLM),在性能和推理速度之间实现了完美平衡。它基于Qwen2.5-7B-Instruct语言模型和Siglip+ConvNext视觉模型构建,支持多语言和多模态任务。
下载量 944
发布时间 : 1/10/2025
模型简介
Eagle2-9B是一个高性能的开源视觉语言模型,专注于从数据中心视角优化VLM后训练。它通过结合稳健的训练方案和模型设计,在多项基准测试中表现出色。
模型特点
高性能平衡
在8.9B参数规模下实现了性能与推理速度的完美平衡
多模态支持
支持文本、图像和视频输入,处理多种模态信息
长上下文处理
支持长达16K的上下文长度
基准测试领先
在多个视觉语言基准测试中表现优于同类模型
模型能力
图像理解
文本生成
多模态对话
文档问答
图表理解
视频分析
使用案例
文档处理
DocVQA文档问答
从文档图像中提取信息并回答问题
在DocVQA测试集上达到92.6分
视觉问答
TextVQA文本视觉问答
回答关于图像中文本内容的问题
在TextVQA验证集上达到83.0分
图表理解
ChartQA图表问答
理解和回答基于图表数据的问题
在ChartQA测试集上达到86.4分
🚀 Eagle-2
Eagle-2是一款最新的视觉语言模型,它结合了多种基础模型的优势,在多语言环境下具有出色的性能。该模型旨在缩小开源视觉语言模型与专有模型之间的差距,通过公开数据策略和实现细节,促进社区的可重复性和创新。
[📂 GitHub] [📜 Eagle2 Tech Report] [🗨️ Chat Demo] [🤗 HF Demo]
🚀 快速开始
我们提供了一个演示推理脚本,帮助你快速开始使用该模型。我们支持以下不同的输入类型:
- 纯文本输入
- 单张图像输入
- 多张图像输入
- 视频输入
0. 安装依赖项
pip install transformers==4.37.2
pip install flash-attn
⚠️ 重要提示
最新版本的
transformers
与该模型不兼容。
1. 准备模型工作器
点击展开
"""
A model worker executes the model.
Copied and modified from https://github.com/OpenGVLab/InternVL/blob/main/streamlit_demo/model_worker.py
"""
# Importing torch before transformers can cause `segmentation fault`
from transformers import AutoModel, AutoTokenizer, TextIteratorStreamer, AutoConfig
import argparse
import base64
import json
import os
import decord
import threading
import time
from io import BytesIO
from threading import Thread
import math
import requests
import torch
import torchvision.transforms as T
from PIL import Image
from torchvision.transforms.functional import InterpolationMode
import numpy as np
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
SIGLIP_MEAN = (0.5, 0.5, 0.5)
SIGLIP_STD = (0.5, 0.5, 0.5)
def get_seq_frames(total_num_frames, desired_num_frames=-1, stride=-1):
"""
Calculate the indices of frames to extract from a video.
Parameters:
total_num_frames (int): Total number of frames in the video.
desired_num_frames (int): Desired number of frames to extract.
Returns:
list: List of indices of frames to extract.
"""
assert desired_num_frames > 0 or stride > 0 and not (desired_num_frames > 0 and stride > 0)
if stride > 0:
return list(range(0, total_num_frames, stride))
# Calculate the size of each segment from which a frame will be extracted
seg_size = float(total_num_frames - 1) / desired_num_frames
seq = []
for i in range(desired_num_frames):
# Calculate the start and end indices of each segment
start = int(np.round(seg_size * i))
end = int(np.round(seg_size * (i + 1)))
# Append the middle index of the segment to the list
seq.append((start + end) // 2)
return seq
def build_video_prompt(meta_list, num_frames, time_position=False):
# if time_position is True, the frame_timestamp is used.
# 1. pass time_position, 2. use env TIME_POSITION
time_position = os.environ.get("TIME_POSITION", time_position)
prefix = f"This is a video:\n"
for i in range(num_frames):
if time_position:
frame_txt = f"Frame {i+1} sampled at {meta_list[i]:.2f} seconds: <image>\n"
else:
frame_txt = f"Frame {i+1}: <image>\n"
prefix += frame_txt
return prefix
def load_video(video_path, num_frames=64, frame_cache_root=None):
if isinstance(video_path, str):
video = decord.VideoReader(video_path)
elif isinstance(video_path, dict):
assert False, 'we not support vidoe: "video_path" as input'
fps = video.get_avg_fps()
sampled_frames = get_seq_frames(len(video), num_frames)
samepld_timestamps = [i / fps for i in sampled_frames]
frames = video.get_batch(sampled_frames).asnumpy()
images = [Image.fromarray(frame) for frame in frames]
return images, build_video_prompt(samepld_timestamps, len(images), time_position=True)
def load_image(image):
if isinstance(image, str) and os.path.exists(image):
return Image.open(image)
elif isinstance(image, dict):
if 'disk_path' in image:
return Image.open(image['disk_path'])
elif 'base64' in image:
return Image.open(BytesIO(base64.b64decode(image['base64'])))
elif 'url' in image:
response = requests.get(image['url'])
return Image.open(BytesIO(response.content))
elif 'bytes' in image:
return Image.open(BytesIO(image['bytes']))
else:
raise ValueError(f'Invalid image: {image}')
else:
raise ValueError(f'Invalid image: {image}')
def build_transform(input_size, norm_type='imagenet'):
if norm_type == 'imagenet':
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
elif norm_type == 'siglip':
MEAN, STD = SIGLIP_MEAN, SIGLIP_STD
transform = T.Compose([
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=MEAN, std=STD)
])
return transform
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
"""
previous version mainly foucs on ratio.
We also consider area ratio here.
"""
best_factor = float('-inf')
best_ratio = (1, 1)
area = width * height
for ratio in target_ratios:
target_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
area_ratio = (ratio[0]*ratio[1]*image_size*image_size)/ area
"""
new area > 60% of original image area is enough.
"""
factor_based_on_area_n_ratio = min((ratio[0]*ratio[1]*image_size*image_size)/ area, 0.6)* \
min(target_aspect_ratio/aspect_ratio, aspect_ratio/target_aspect_ratio)
if factor_based_on_area_n_ratio > best_factor:
best_factor = factor_based_on_area_n_ratio
best_ratio = ratio
return best_ratio
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
# calculate the existing image aspect ratio
target_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
# find the closest aspect ratio to the target
target_aspect_ratio = find_closest_aspect_ratio(
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
# calculate the target width and height
target_width = image_size * target_aspect_ratio[0]
target_height = image_size * target_aspect_ratio[1]
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
# resize the image
resized_img = image.resize((target_width, target_height))
processed_images = []
for i in range(blocks):
box = (
(i % (target_width // image_size)) * image_size,
(i // (target_width // image_size)) * image_size,
((i % (target_width // image_size)) + 1) * image_size,
((i // (target_width // image_size)) + 1) * image_size
)
# split the image
split_img = resized_img.crop(box)
processed_images.append(split_img)
assert len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.append(thumbnail_img)
return processed_images
def split_model(model_path, device):
device_map = {}
world_size = torch.cuda.device_count()
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
num_layers = config.llm_config.num_hidden_layers
print('world_size', world_size)
num_layers_per_gpu_ = math.floor(num_layers / (world_size - 1))
num_layers_per_gpu = [num_layers_per_gpu_] * world_size
num_layers_per_gpu[device] = num_layers - num_layers_per_gpu_ * (world_size-1)
print(num_layers_per_gpu)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = device
device_map['mlp1'] = device
device_map['language_model.model.tok_embeddings'] = device
device_map['language_model.model.embed_tokens'] = device
device_map['language_model.output'] = device
device_map['language_model.model.norm'] = device
device_map['language_model.lm_head'] = device
device_map['language_model.model.rotary_emb'] = device
device_map[f'language_model.model.layers.{num_layers - 1}'] = device
return device_map
class ModelWorker:
def __init__(self, model_path, model_name,
load_8bit, device):
if model_path.endswith('/'):
model_path = model_path[:-1]
if model_name is None:
model_paths = model_path.split('/')
if model_paths[-1].startswith('checkpoint-'):
self.model_name = model_paths[-2] + '_' + model_paths[-1]
else:
self.model_name = model_paths[-1]
else:
self.model_name = model_name
print(f'Loading the model {self.model_name}')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
tokens_to_keep = ['<box>', '</box>', '<ref>', '</ref>']
tokenizer.additional_special_tokens = [item for item in tokenizer.additional_special_tokens if item not in tokens_to_keep]
self.tokenizer = tokenizer
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model_type = config.vision_config.model_type
self.device = torch.cuda.current_device()
if model_type == 'siglip_vision_model':
self.norm_type = 'siglip'
elif model_type == 'MOB':
self.norm_type = 'siglip'
else:
self.norm_type = 'imagenet'
if any(x in model_path.lower() for x in ['34b']):
device_map = split_model(model_path, self.device)
else:
device_map = None
if device_map is not None:
self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
device_map=device_map,
trust_remote_code=True,
load_in_8bit=load_8bit).eval()
else:
self.model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16,
trust_remote_code=True,
load_in_8bit=load_8bit).eval()
if not load_8bit and device_map is None:
self.model = self.model.to(device)
self.load_8bit = load_8bit
self.model_path = model_path
self.image_size = self.model.config.force_image_size
self.context_len = tokenizer.model_max_length
self.per_tile_len = 256
def reload_model(self):
del self.model
torch.cuda.empty_cache()
if self.device == 'auto':
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
# This can make distributed deployment work properly
self.model = AutoModel.from_pretrained(
self.model_path,
load_in_8bit=self.load_8bit,
torch_dtype=torch.bfloat16,
device_map=self.device_map,
trust_remote_code=True).eval()
else:
self.model = AutoModel.from_pretrained(
self.model_path,
load_in_8bit=self.load_8bit,
torch_dtype=torch.bfloat16,
trust_remote_code=True).eval()
if not self.load_8bit and not self.device == 'auto':
self.model = self.model.cuda()
@torch.inference_mode()
def generate(self, params):
system_message = params['prompt'][0]['content']
send_messages = params['prompt'][1:]
max_input_tiles = params['max_input_tiles']
temperature = params['temperature']
top_p = params['top_p']
max_new_tokens = params['max_new_tokens']
repetition_penalty = params['repetition_penalty']
video_frame_num = params.get('video_frame_num', 64)
do_sample = True if temperature > 0.0 else False
global_image_cnt = 0
history, pil_images, max_input_tile_list = [], [], []
for message in send_messages:
if message['role'] == 'user':
prefix = ''
if 'image' in message:
for image_data in message['image']:
pil_images.append(load_image(image_data))
prefix = prefix + f'<image {global_image_cnt + 1}><image>\n'
global_image_cnt += 1
max_input_tile_list.append(max_input_tiles)
if 'video' in message:
for video_data in message['video']:
video_frames, tmp_prefix = load_video(video_data, num_frames=video_frame_num)
pil_images.extend(video_frames)
prefix = prefix + tmp_prefix
global_image_cnt += len(video_frames)
max_input_tile_list.extend([1] * len(video_frames))
content = prefix + message['content']
history.append([content, ])
else:
history[-1].append(message['content'])
question, history = history[-1][0], history[:-1]
if global_image_cnt == 1:
question = question.replace('<image 1><image>\n', '<image>\n')
history = [[item[0].replace('<image 1><image>\n', '<image>\n'), item[1]] for item in history]
try:
assert len(max_input_tile_list) == len(pil_images), 'The number of max_input_tile_list and pil_images should be the same.'
except Exception as e:
from IPython import embed; embed()
exit()
print(f'Error: {e}')
print(f'max_input_tile_list: {max_input_tile_list}, pil_images: {pil_images}')
# raise e
old_system_message = self.model.system_message
self.model.system_message = system_message
transform = build_transform(input_size=self.image_size, norm_type=self.norm_type)
if len(pil_images) > 0:
max_input_tiles_limited_by_contect = params['max_input_tiles']
while True:
image_tiles = []
for current_max_input_tiles, pil_image in zip(max_input_tile_list, pil_images):
if self.model.config.dynamic_image_size:
tiles = dynamic_preprocess(
pil_image, image_size=self.image_size, max_num=min(current_max_input_tiles, max_input_tiles_limited_by_contect),
use_thumbnail=self.model.config.use_thumbnail)
else:
tiles = [pil_image]
image_tiles += tiles
if (len(image_tiles) * self.per_tile_len < self.context_len):
break
else:
max_input_tiles_limited_by_contect -= 2
if max_input_tiles_limited_by_contect < 1:
break
pixel_values = [transform(item) for item in image_tiles]
pixel_values = torch.stack(pixel_values).to(self.model.device, dtype=torch.bfloat16)
print(f'Split images to {pixel_values.shape}')
else:
pixel_values = None
generation_config = dict(
num_beams=1,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
repetition_penalty=repetition_penalty,
max_length=self.context_len,
top_p=top_p,
)
response = self.model.chat(
tokenizer=self.tokenizer,
pixel_values=pixel_values,
question=question,
history=history,
return_history=False,
generation_config=generation_config,
)
self.model.system_message = old_system_message
return {'text': response, 'error_code': 0}
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--model-path', type=str, default='nvidia/Eagle2-9B')
parser.add_argument('--model-name', type=str, default='Eagle2-9B')
parser.add_argument('--device', type=str, default='cuda')
parser.add_argument('--load-8bit', action='store_true')
args = parser.parse_args()
print(f'args: {args}')
worker = ModelWorker(
args.model_path,
args.model_name,
args.load_8bit,
args.device)
2. 准备提示信息
- 单张图像输入
prompt = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Describe this image in details.',
'image':[
{'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'}
],
}
]
- 多张图像输入
prompt = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Describe these two images in details.',
'image':[
{'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'},
{'url': 'https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-vert-500x200-2c50-d@2x.png'}
],
}
]
- 视频输入
prompt = [
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Describe this video in details.',
'video':[
'path/to/your/video.mp4'
],
}
]
3. 生成响应
params = {
'prompt': prompt,
'max_input_tiles': 24,
'temperature': 0.7,
'top_p': 1.0,
'max_new_tokens': 4096,
'repetition_penalty': 1.0,
}
worker.generate(params)
✨ 主要特性
我们很高兴发布最新的Eagle2系列视觉语言模型。开源视觉语言模型(VLM)在缩小与专有模型的差距方面取得了显著进展。然而,关于数据策略和实现的关键细节往往缺失,限制了可重复性和创新。在这个项目中,我们从以数据为中心的角度专注于VLM的后训练,分享从头构建有效数据策略的见解。通过将这些策略与强大的训练方法和模型设计相结合,我们推出了Eagle2,一系列性能出色的VLM。我们的工作旨在使开源社区能够以透明的流程开发具有竞争力的VLM。
📦 模型库
我们提供以下模型:
模型名称 | 大语言模型 | 视觉模型 | 最大长度 | Hugging Face链接 |
---|---|---|---|---|
Eagle2-1B | Qwen2.5-0.5B-Instruct | Siglip | 16K | 🤗 link |
Eagle2-2B | Qwen2.5-1.5B-Instruct | Siglip | 16K | 🤗 link |
Eagle2-9B | Qwen2.5-7B-Instruct | Siglip+ConvNext | 16K | 🤗 link |
📚 详细文档
基准测试结果
基准测试 | MiniCPM-Llama3-V-2_5 | InternVL-Chat-V1-5 | InternVL2-8B | QwenVL2-7B | Eagle2-9B |
---|---|---|---|---|---|
模型大小 | 8.5B | 25.5B | 8.1B | 8.3B | 8.9B |
DocVQAtest | 84.8 | 90.9 | 91.6 | 94.5 | 92.6 |
ChartQAtest | - | 83.8 | 83.3 | 83.0 | 86.4 |
InfoVQAtest | - | 72.5 | 74.8 | 74.3 | 77.2 |
TextVQAval | 76.6 | 80.6 | 77.4 | 84.3 | 83.0 |
OCRBench | 725 | 724 | 794 | 845 | 868 |
MMEsum | 2024.6 | 2187.8 | 2210.3 | 2326.8 | 2260 |
RealWorldQA | 63.5 | 66.0 | 64.4 | 70.1 | 69.3 |
AI2Dtest | 78.4 | 80.7 | 83.8 | - | 83.9 |
MMMUval | 45.8 | 45.2 / 46.8 | 49.3 / 51.8 | 54.1 | 56.1 |
MMBench_V11test | 79.5 | 79.4 | 80.6 | ||
MMVetGPT-4-Turbo | 52.8 | 55.4 | 54.2 | 62.0 | 62.2 |
SEED-Image | 72.3 | 76.0 | 76.2 | 77.1 | |
HallBenchavg | 42.4 | 49.3 | 45.2 | 50.6 | 49.3 |
MathVistatestmini | 54.3 | 53.5 | 58.3 | 58.2 | 63.8 |
MMstar | - | - | 60.9 | 60.7 | 62.6 |
📄 许可证
- 代码根据Apache 2.0许可证发布。
- 预训练模型权重根据知识共享署名-非商业性使用 4.0 国际许可协议发布。
- 该服务仅供研究预览,仅用于非商业用途,并受以下许可证和条款的约束:
- Qwen2.5-7B-Instruct的模型许可证:Apache-2.0
- PaliGemma的模型许可证:Gemma许可证
📚 引用
待补充
🔧 道德考量
NVIDIA认为可信AI是一项共同责任,我们已经制定了政策和实践,以支持广泛的AI应用开发。当开发者按照我们的服务条款下载或使用该模型时,应与内部模型团队合作,确保该模型满足相关行业和用例的要求,并解决意外的产品滥用问题。
请在此报告安全漏洞或NVIDIA AI相关问题。
📋 待办事项
- [ ] 支持vLLM推理
- [ ] 提供AWQ量化权重
- [ ] 提供微调脚本
📊 模型信息
属性 | 详情 |
---|---|
模型类型 | 视觉语言模型 |
基础模型 | google/paligemma-3b-mix-448、Qwen/Qwen2.5-7B-Instruct、google/siglip-so400m-patch14-384、timm/convnext_xxlarge.clip_laion2b_soup_ft_in1k |
基础模型关系 | 合并 |
语言支持 | 多语言 |
标签 | eagle、VLM |
许可证 | cc-by-nc-4.0 |
任务类型 | 图像文本到文本 |
库名称 | transformers |
Clip Vit Large Patch14
CLIP是由OpenAI开发的视觉-语言模型,通过对比学习将图像和文本映射到共享的嵌入空间,支持零样本图像分类
图像生成文本
C
openai
44.7M
1,710
Clip Vit Base Patch32
CLIP是由OpenAI开发的多模态模型,能够理解图像和文本之间的关系,支持零样本图像分类任务。
图像生成文本
C
openai
14.0M
666
Siglip So400m Patch14 384
Apache-2.0
SigLIP是基于WebLi数据集预训练的视觉语言模型,采用改进的sigmoid损失函数,优化了图像-文本匹配任务。
图像生成文本
Transformers

S
google
6.1M
526
Clip Vit Base Patch16
CLIP是由OpenAI开发的多模态模型,通过对比学习将图像和文本映射到共享的嵌入空间,实现零样本图像分类能力。
图像生成文本
C
openai
4.6M
119
Blip Image Captioning Base
Bsd-3-clause
BLIP是一个先进的视觉-语言预训练模型,擅长图像描述生成任务,支持条件式和非条件式文本生成。
图像生成文本
Transformers

B
Salesforce
2.8M
688
Blip Image Captioning Large
Bsd-3-clause
BLIP是一个统一的视觉-语言预训练框架,擅长图像描述生成任务,支持条件式和无条件式图像描述生成。
图像生成文本
Transformers

B
Salesforce
2.5M
1,312
Openvla 7b
MIT
OpenVLA 7B是一个基于Open X-Embodiment数据集训练的开源视觉-语言-动作模型,能够根据语言指令和摄像头图像生成机器人动作。
图像生成文本
Transformers 英语

O
openvla
1.7M
108
Llava V1.5 7b
LLaVA 是一款开源多模态聊天机器人,基于 LLaMA/Vicuna 微调,支持图文交互。
图像生成文本
Transformers

L
liuhaotian
1.4M
448
Vit Gpt2 Image Captioning
Apache-2.0
这是一个基于ViT和GPT2架构的图像描述生成模型,能够为输入图像生成自然语言描述。
图像生成文本
Transformers

V
nlpconnect
939.88k
887
Blip2 Opt 2.7b
MIT
BLIP-2是一个视觉语言模型,结合了图像编码器和大型语言模型,用于图像到文本的生成任务。
图像生成文本
Transformers 英语

B
Salesforce
867.78k
359
精选推荐AI模型
Llama 3 Typhoon V1.5x 8b Instruct
专为泰语设计的80亿参数指令模型,性能媲美GPT-3.5-turbo,优化了应用场景、检索增强生成、受限生成和推理任务
大型语言模型
Transformers 支持多种语言

L
scb10x
3,269
16
Cadet Tiny
Openrail
Cadet-Tiny是一个基于SODA数据集训练的超小型对话模型,专为边缘设备推理设计,体积仅为Cosmo-3B模型的2%左右。
对话系统
Transformers 英语

C
ToddGoldfarb
2,691
6
Roberta Base Chinese Extractive Qa
基于RoBERTa架构的中文抽取式问答模型,适用于从给定文本中提取答案的任务。
问答系统 中文
R
uer
2,694
98