開源Ovis2-1B-dev多模態大語言模型，高性能處理視頻多圖，強化推理能力！

首頁

Ovis2 1B Dev

由Isotr0py開發

Ovis2-1B是多模態大語言模型（MLLM）Ovis系列的最新成員，專注於視覺與文本嵌入的結構對齊，具有小模型高性能、強化推理能力、視頻與多圖處理以及多語言OCR增強等特性。

文本生成圖像

Transformers

支持多種語言開源協議:Apache-2.0 #多模態大語言模型 #視覺文本對齊 #多語言OCR增強

下載量 79

發布時間 : 4/9/2025

模型概述

Ovis2-1B是AIDC-AI發佈的多模態大語言模型，旨在實現視覺與文本嵌入的結構對齊。作為Ovis1.6的迭代升級，Ovis2在數據構建和訓練方法上均有顯著提升，特別適合處理複雜的視覺信息和多語言OCR任務。

模型特點

小模型高性能

通過優化訓練策略，使小規模模型實現更高能力密度，展現跨層級領先優勢。

強化推理能力

結合指令微調與偏好學習，顯著增強思維鏈（CoT）推理能力。

視頻與多圖處理

將視頻和多圖數據納入訓練，提升跨幀/跨圖像的複雜視覺信息處理能力。

多語言OCR增強

在英漢雙語基礎上優化多語言OCR能力，提升從表格/圖表等複雜視覺元素中提取結構化數據的效果。

模型能力

圖像理解

文本生成

視頻理解

多圖分析

多語言OCR

複雜推理

使用案例

視覺問答

圖像內容描述

對輸入圖像進行詳細描述

在MMBench-V1.1測試集上達到68.4分

視覺推理

基於圖像內容進行邏輯推理

在MathVista測試精簡集上達到59.4分

文檔理解

表格數據提取

從複雜表格中提取結構化數據

在OCRBench上達到89.0分

視頻理解

視頻內容分析

理解視頻中的動作和場景

在VideoMME(帶字幕)上達到49.5分

🚀 Ovis2-1B

Ovis2-1B是一款多模態大語言模型，繼承了Ovis系列的創新架構設計，在數據集管理和訓練方法上有顯著改進，具備小模型高性能、增強推理能力、支持視頻和多圖像處理以及多語言OCR等特性。

🚀 快速開始

你可以按照以下步驟使用Ovis2-1B模型：

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加載模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 單圖像輸入
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'

## 思維鏈風格輸入
# cot_suffix = "Provide a step-by-step solution to the problem, and conclude with 'the answer is' followed by the final solution."
# image_path = '/data/images/example_1.jpg'
# images = [Image.open(image_path)]
# max_partition = 9
# text = "What's the area of the shape?"
# query = f'<image>\n{text}\n{cot_suffix}'

## 多圖像輸入
# image_paths = [
#     '/data/images/example_1.jpg',
#     '/data/images/example_2.jpg',
#     '/data/images/example_3.jpg'
# ]
# images = [Image.open(image_path) for image_path in image_paths]
# max_partition = 4
# text = 'Describe each image.'
# query = '\n'.join([f'Image {i+1}: <image>' for i in range(len(images))]) + '\n' + text

## 視頻輸入 (需要 `pip install moviepy==1.0.3`)
# from moviepy.editor import VideoFileClip
# video_path = '/data/videos/example_1.mp4'
# num_frames = 12
# max_partition = 1
# text = 'Describe the video.'
# with VideoFileClip(video_path) as clip:
#     total_frames = int(clip.fps * clip.duration)
#     if total_frames <= num_frames:
#         sampled_indices = range(total_frames)
#     else:
#         stride = total_frames / num_frames
#         sampled_indices = [min(total_frames - 1, int((stride * i + stride * (i + 1)) / 2)) for i in range(num_frames)]
#     frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
#     frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
# images = frames
# query = '\n'.join(['<image>'] * len(images)) + '\n' + text

## 純文本輸入
# images = []
# max_partition = None
# text = 'Hello'
# query = text

# 格式化對話
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

# 生成輸出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'輸出:\n{output}')

批量推理

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加載模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 預處理輸入
batch_inputs = [
    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
    ('/data/images/example_3.jpg', 'Is there any text in the image?')
]

batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []

for image_path, text in batch_inputs:
    image = Image.open(image_path)
    query = f'<image>\n{text}'
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    batch_input_ids.append(input_ids.to(device=model.device))
    batch_attention_mask.append(attention_mask.to(device=model.device))
    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))

batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
                                                  padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
                                                       batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]

# 生成輸出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
                                **gen_kwargs)

for i in range(len(batch_inputs)):
    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
    print(f'輸出 {i + 1}:\n{output}\n')

✨ 主要特性

小模型高性能：優化的訓練策略使小模型實現更高的能力密度，展現跨層級的領先優勢。
增強推理能力：通過指令微調與偏好學習相結合，顯著增強思維鏈（CoT）推理能力。
視頻和多圖像處理：將視頻和多圖像數據納入訓練，增強處理跨幀和圖像的複雜視覺信息的能力。
多語言支持和OCR：增強英語和中文以外的多語言OCR能力，改進從表格和圖表等複雜視覺元素中提取結構化數據的能力。

📦 安裝指南

pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
pip install flash-attn==2.7.0.post2 --no-build-isolation

💻 使用示例

基礎用法

import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加載模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 單圖像輸入
image_path = '/data/images/example_1.jpg'
images = [Image.open(image_path)]
max_partition = 9
text = 'Describe the image.'
query = f'<image>\n{text}'

# 格式化對話
prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
input_ids = input_ids.unsqueeze(0).to(device=model.device)
attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
if pixel_values is not None:
    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
pixel_values = [pixel_values]

# 生成輸出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
    print(f'輸出:\n{output}')

高級用法

# 批量推理示例
import torch
from PIL import Image
from transformers import AutoModelForCausalLM

# 加載模型
model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
                                             torch_dtype=torch.bfloat16,
                                             multimodal_max_length=32768,
                                             trust_remote_code=True).cuda()
text_tokenizer = model.get_text_tokenizer()
visual_tokenizer = model.get_visual_tokenizer()

# 預處理輸入
batch_inputs = [
    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
    ('/data/images/example_3.jpg', 'Is there any text in the image?')
]

batch_input_ids = []
batch_attention_mask = []
batch_pixel_values = []

for image_path, text in batch_inputs:
    image = Image.open(image_path)
    query = f'<image>\n{text}'
    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
    batch_input_ids.append(input_ids.to(device=model.device))
    batch_attention_mask.append(attention_mask.to(device=model.device))
    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))

batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
                                                  padding_value=0.0).flip(dims=[1])
batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
                                                       batch_first=True, padding_value=False).flip(dims=[1])
batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]

# 生成輸出
with torch.inference_mode():
    gen_kwargs = dict(
        max_new_tokens=1024,
        do_sample=False,
        top_p=None,
        top_k=None,
        temperature=None,
        repetition_penalty=None,
        eos_token_id=model.generation_config.eos_token_id,
        pad_token_id=text_tokenizer.pad_token_id,
        use_cache=True
    )
    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
                                **gen_kwargs)

for i in range(len(batch_inputs)):
    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
    print(f'輸出 {i + 1}:\n{output}\n')

📚 詳細文檔

模型庫

Ovis多模態大語言模型	視覺Transformer（ViT）	大語言模型（LLM）	模型權重	演示
Ovis2-1B	aimv2-large-patch14-448	Qwen2.5-0.5B-Instruct	Huggingface	Space
Ovis2-2B	aimv2-large-patch14-448	Qwen2.5-1.5B-Instruct	Huggingface	Space
Ovis2-4B	aimv2-huge-patch14-448	Qwen2.5-3B-Instruct	Huggingface	Space
Ovis2-8B	aimv2-huge-patch14-448	Qwen2.5-7B-Instruct	Huggingface	Space
Ovis2-16B	aimv2-huge-patch14-448	Qwen2.5-14B-Instruct	Huggingface	Space
Ovis2-34B	aimv2-1B-patch14-448	Qwen2.5-32B-Instruct	Huggingface	-

性能評估

我們使用 VLMEvalKit 對Ovis2進行評估，該工具也用於OpenCompass 多模態和推理排行榜。

image/png

圖像基準測試

基準測試	Qwen2.5-VL-3B	SAIL-VL-2B	InternVL2.5-2B-MPO	Ovis1.6-3B	InternVL2.5-1B-MPO	Ovis2-1B	Ovis2-2B
MMBench-V1.1_測試集	77.1	73.6	70.7	74.1	65.8	68.4	76.9
MMStar	56.5	56.5	54.9	52.0	49.5	52.1	56.7
MMMU_驗證集	51.4	44.1	44.6	46.7	40.3	36.1	45.6
MathVista_{測試迷你集}	60.1	62.8	53.4	58.9	47.7	59.4	64.1
HallusionBench	48.7	45.9	40.7	43.8	34.8	45.2	50.2
AI2D	81.4	77.4	75.1	77.8	68.5	76.4	82.7
OCRBench	83.1	83.1	83.8	80.1	84.3	89.0	87.3
MMVet	63.2	44.2	64.2	57.6	47.2	50.0	58.3
MMBench_測試集	78.6	77	72.8	76.6	67.9	70.2	78.9
MMT-Bench_驗證集	60.8	57.1	54.4	59.2	50.8	55.5	61.7
RealWorldQA	66.5	62	61.3	66.7	57	63.9	66.0
BLINK	48.4	46.4	43.8	43.8	41	44.0	47.9
QBench	74.4	72.8	69.8	75.8	63.3	71.3	76.2
ABench	75.5	74.5	71.1	75.2	67.5	71.3	76.6
MTVQA	24.9	20.2	22.6	21.1	21.7	23.7	25.6

視頻基準測試

基準測試	Qwen2.5-VL-3B	InternVL2.5-2B	InternVL2.5-1B	Ovis2-1B	Ovis2-2B
VideoMME(無字幕/有字幕)	61.5/67.6	51.9 / 54.1	50.3 / 52.3	48.6/49.5	57.2/60.8
MVBench	67.0	68.8	64.3	60.32	64.9
MLVU(均值/全局均值)	68.2/-	61.4/-	57.3/-	58.5/3.66	68.6/3.86
MMBench-視頻	1.63	1.44	1.36	1.26	1.57
TempCompass	64.4	-	-	51.43	62.64

📄 許可證

本項目採用 Apache許可證2.0版（SPDX許可證標識符：Apache-2.0）。

📚 引用

如果你發現Ovis模型有用，請考慮引用以下論文：

@article{lu2024ovis,
  title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
  author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
  year={2024},
  journal={arXiv:2405.20797}
}