MiniCPM-V-2_6开源多模态大语言模型 - 支持手机运行，能处理图像视频理解

首页

Minicpm V 2 6

由 FriendliAI 开发

MiniCPM-V 2.6是一款强大的多模态大语言模型，能够在手机等设备上高效运行，支持单图像、多图像和视频理解任务。

文本生成图像

Transformers

其他#多模态理解 #终端设备高效运行 #多图像推理

下载量 102

发布时间 : 3/5/2025

模型简介

MiniCPM-V 2.6是一款GPT-4V级别的多模态大语言模型，具有领先的性能、高效的处理能力和丰富的功能特性，适用于单图像、多图像和视频理解任务。

模型特点

领先性能

在单图像理解方面超越了GPT-4o mini、GPT-4V、Gemini 1.5 Pro和Claude 3.5 Sonnet等广泛使用的专有模型。

多图像理解与上下文学习

能够对多图像进行对话和推理，在多个基准测试中达到了最先进的性能。

视频理解

支持视频输入，可进行对话并为时空信息提供密集字幕，表现优于GPT-4V、Claude 3.5 Sonnet和LLaVA-NeXT-Video-34B。

强大的OCR能力

在OCRBench上达到了最先进的性能，超越了GPT-4o、GPT-4V和Gemini 1.5 Pro等专有模型。

卓越效率

模型规模友好，展现出了最先进的令牌密度，能够在iPad等终端设备上高效支持实时视频理解。

模型能力

单图像理解

多图像对话与推理

视频理解与密集字幕

高分辨率图像处理

多语言支持

上下文学习

OCR识别

使用案例

图像分析

图像内容描述

分析图像内容并生成描述

能够准确描述图像中的物体和场景

多图像比较

比较多张图像的差异

能够识别并描述图像间的差异

视频分析

视频内容描述

分析视频内容并生成描述

能够准确描述视频中的动作和场景变化

文档处理

OCR识别

从图像中提取文字信息

在OCRBench上达到最先进性能

🚀 MiniCPM-V 2.6：适用于单图像、多图像和视频的GPT - 4V级别多模态大语言模型

MiniCPM-V 2.6是一款强大的多模态大语言模型，能够在手机等设备上高效运行。它在单图像、多图像和视频理解任务上表现卓越，具有领先的性能、高效的处理能力和丰富的功能特性，为多模态交互带来了新的体验。

🚀 快速开始

你可以点击这里尝试MiniCPM-V 2.6的在线演示。若想深入了解使用方法，请查看GitHub上的详细说明。

✨ 主要特性

领先性能

MiniCPM-V 2.6在OpenCompass的最新版本中，通过对8个流行基准的综合评估，平均得分达到65.2。仅80亿参数的它，在单图像理解方面超越了如GPT - 4o mini、GPT - 4V、Gemini 1.5 Pro和Claude 3.5 Sonnet等广泛使用的专有模型。

多图像理解与上下文学习

该模型能够对多图像进行对话和推理，在Mantis - Eval、BLINK、Mathverse mv和Sciverse mv等流行的多图像基准测试中达到了最先进的性能，同时展现出了出色的上下文学习能力。

视频理解

MiniCPM-V 2.6支持视频输入，可进行对话并为时空信息提供密集字幕。在有或没有字幕的Video - MME测试中，它的表现优于GPT - 4V、Claude 3.5 Sonnet和LLaVA - NeXT - Video - 34B。

强大的OCR能力及其他特性

高分辨率处理：能够处理任意宽高比且像素高达180万（如1344x1344）的图像。
OCR性能领先：在OCRBench上达到了最先进的性能，超越了GPT - 4o、GPT - 4V和Gemini 1.5 Pro等专有模型。
可靠行为：基于最新的[RLAIF - V](https://github.com/RLHF - V/RLAIF - V/)和VisCPM技术，具有可靠的行为，在Object HalBench上的幻觉率显著低于GPT - 4o和GPT - 4V。
多语言支持：支持英语、中文、德语、法语、意大利语、韩语等多种语言。

卓越效率

MiniCPM-V 2.6不仅模型规模友好，还展现出了最先进的令牌密度（即每个视觉令牌编码的像素数）。在处理180万像素的图像时，仅生成640个令牌，比大多数模型少75%。这直接提高了推理速度、首令牌延迟、内存使用和功耗，使其能够在iPad等终端设备上高效支持实时视频理解。

易于使用

MiniCPM-V 2.6提供了多种便捷的使用方式：

本地CPU推理：[llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpmv - main/examples/llava/README - minicpmv2.6.md)和[ollama](https://github.com/OpenBMB/ollama/tree/minicpm - v2.6)支持在本地设备上进行高效的CPU推理。
量化模型：提供[int4](https://huggingface.co/openbmb/MiniCPM - V - 2_6 - int4)和[GGUF](https://huggingface.co/openbmb/MiniCPM - V - 2_6 - gguf)格式的16种量化模型。
高吞吐量推理：[vLLM](https://github.com/OpenBMB/MiniCPM - V/tree/main?tab = readme - ov - file#inference - with - vllm)支持高吞吐量和内存高效的推理。
微调功能：支持在新的领域和任务上进行微调。
本地WebUI演示：可以使用[Gradio](https://github.com/OpenBMB/MiniCPM - V/tree/main?tab = readme - ov - file#chat - with - our - demo - on - gradio)快速设置本地WebUI演示。
在线演示：提供在线演示。

📦 安装指南

在NVIDIA GPU上使用Huggingface transformers进行推理，测试环境为Python 3.10，需要安装以下依赖：

Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
decord

💻 使用示例

基础用法

# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

高级用法

多图像对话

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

上下文少样本学习

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

视频对话

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

video_path ="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

# Set decode params for video
params={}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution >  448*448

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)

📚 详细文档

评估结果

单图像评估

在OpenCompass、MME、MMVet、OCRBench、MMMU、MathVista、MMB、AI2D、TextVQA、DocVQA、HallusionBench、Object HalBench等基准测试上的单图像结果如下：

^* 我们使用思维链提示对该基准进行评估。 ⁺ 令牌密度：最大分辨率下每个视觉令牌编码的像素数，即最大分辨率下的像素数/视觉令牌数。注：对于专有模型，我们根据官方API文档中定义的图像编码收费策略计算令牌密度，这是一个上限估计。

多图像评估

在Mantis Eval、BLINK Val、Mathverse mv、Sciverse mv、MIRB等基准测试上的多图像结果如下：

^* 我们自行评估官方发布的检查点。

视频评估

在Video - MME和Video - ChatGPT上的视频评估结果如下：

点击查看TextVQA、VizWiz、VQAv2、OK - VQA上的少样本结果。

* 表示零图像样本和两个额外的文本样本（遵循Flamingo）。 ⁺ 我们评估未进行SFT的预训练检查点。

示例展示

点击查看更多案例。

我们将MiniCPM-V 2.6部署在终端设备上，以下是iPad Pro上的原始屏幕录制演示视频：

llama.cpp推理

MiniCPM-V 2.6可以使用llama.cpp运行。更多详细信息请查看我们的[llama.cpp分支](https://github.com/OpenBMB/llama.cpp/tree/minicpm - v2.5/examples/minicpmv)。

Int4量化版本

你可以下载[int4量化版本](https://huggingface.co/openbmb/MiniCPM - V - 2_6 - int4)以减少GPU内存（7GB）使用。

📄 许可证

模型许可证

本仓库中的代码遵循Apache - 2.0许可证发布。
MiniCPM-V系列模型权重的使用必须严格遵循MiniCPM Model License.md。
MiniCPM的模型和权重完全免费用于学术研究。填写“问卷”进行注册后，MiniCPM-V 2.6的权重也可免费用于商业用途。

声明

作为一个多模态大语言模型，MiniCPM-V 2.6通过学习大量的多模态语料生成内容，但它无法理解、表达个人观点或进行价值判断。MiniCPM-V 2.6生成的任何内容均不代表模型开发者的观点和立场。
我们不对使用MinCPM-V模型所产生的任何问题负责，包括但不限于数据安全问题、舆论风险，或因模型的误导、误用、传播或滥用而产生的任何风险和问题。

🔧 技术细节

欢迎探索MiniCPM-V 2.6的关键技术以及我们团队的其他多模态项目： VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V

📚 引用

如果您觉得我们的工作有帮助，请考虑引用我们的论文并给这个项目点赞：

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}