Minicpm V 2 6

由openbmb開發

MiniCPM-V是一個手機端GPT-4V級多模態大語言模型，支持單圖、多圖與視頻理解，具備視覺、光學字符識別等功能。

圖像生成文本

Transformers

其他#手機端多模態 #即時語音對話 #多模態直播

下載量 91.52k

發布時間 : 8/4/2024

模型概述

MiniCPM-V是一個多模態大語言模型，能夠在手機端實現GPT-4V級別的多模態理解能力，支持單張圖片、多張圖片以及視頻內容的理解與分析。

模型特點

手機端部署

專為手機端優化的多模態大語言模型，實現高效運行。

多模態理解

支持單圖、多圖和視頻內容的理解與分析。

光學字符識別

具備OCR能力，可從圖像中提取文本信息。

模型能力

圖像理解

視頻理解

光學字符識別

多模態對話

使用案例

內容分析

圖像內容描述

對上傳的圖片進行內容分析和描述生成。

生成準確的圖片內容描述文本。

視頻內容理解

分析視頻內容並生成摘要或關鍵幀描述。

提取視頻關鍵信息並生成文本摘要。

文檔處理

圖像文字識別

從包含文字的圖片中提取文本內容。

準確識別並提取圖片中的文字信息。

🚀 MiniCPM-V 2.6：手機端適用的單圖像、多圖像和視頻的GPT - 4V級別多模態大語言模型

MiniCPM-V 2.6是一款功能強大的多模態大語言模型，能夠處理單圖像、多圖像和視頻輸入，在性能、效率和易用性方面表現出色，為用戶提供了便捷高效的多模態交互體驗。

GitHub | Demo

📢 最新消息

[2025.01.14] 🔥🔥 我們開源了 MiniCPM-o 2.6，相較於 MiniCPM-V 2.6 有顯著的性能提升，並且支持即時語音對話和多模態直播，快來試試吧。

✨ MiniCPM-V 2.6 主要特性

MiniCPM-V 2.6 是MiniCPM-V系列中最新且功能最強大的模型。該模型基於SigLip - 400M和Qwen2 - 7B構建，總參數達80億。與MiniCPM-Llama3-V 2.5相比，它的性能有顯著提升，併為多圖像和視頻理解引入了新特性。MiniCPM-V 2.6的顯著特性包括：

🔥 卓越性能：在最新版本的OpenCompass上，MiniCPM-V 2.6在8個流行基準測試中取得了平均65.2分的成績。僅80億參數的它，在單圖像理解方面超越了廣泛使用的專有模型，如GPT - 4o mini、GPT - 4V、Gemini 1.5 Pro和Claude 3.5 Sonnet。
🖼️ 多圖像理解與上下文學習：MiniCPM-V 2.6還能進行 多圖像對話和推理。在Mantis - Eval、BLINK、Mathverse mv和Sciverse mv等流行的多圖像基準測試中，它達到了 業界領先水平，並展現出了出色的上下文學習能力。
🎬 視頻理解：MiniCPM-V 2.6可以 接受視頻輸入，進行對話併為時空信息提供密集字幕。在有/無字幕的Video - MME測試中，它的表現優於 GPT - 4V、Claude 3.5 Sonnet和LLaVA - NeXT - Video - 34B。
💪 強大的OCR能力及其他特性：MiniCPM-V 2.6可以處理任意寬高比且像素高達180萬（如1344x1344）的圖像。在OCRBench上，它達到了 業界領先水平，超越了GPT - 4o、GPT - 4V和Gemini 1.5 Pro等專有模型。基於最新的 RLAIF - V 和 VisCPM 技術，它具備 可靠的行為，在Object HalBench上的幻覺率顯著低於GPT - 4o和GPT - 4V，並支持英語、中文、德語、法語、意大利語、韓語等 多語言能力。
🚀 卓越效率：除了模型規模友好外，MiniCPM-V 2.6還展現了 業界領先的令牌密度（即每個視覺令牌編碼的像素數）。處理180萬像素的圖像時，它僅生成640個令牌，比大多數模型少75%。這直接提高了推理速度、首令牌延遲、內存使用和功耗。因此，MiniCPM-V 2.6可以在iPad等終端設備上高效支持 即時視頻理解。
💫 易於使用：MiniCPM-V 2.6可以通過多種方式輕鬆使用：(1) llama.cpp 和 ollama 支持在本地設備上進行高效的CPU推理；(2) 提供 int4 和 GGUF 格式的16種量化模型；(3) vLLM 支持高吞吐量和內存高效推理；(4) 可在新領域和任務上進行微調；(5) 可使用 Gradio 快速搭建本地WebUI演示；(6) 提供在線Web 演示。

📊 評估

單圖像評估結果（OpenCompass、MME、MMVet、OCRBench、MMMU、MathVista、MMB、AI2D、TextVQA、DocVQA、HallusionBench、Object HalBench）

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/QVl0iPtT5aUhlvViyEpgs.png)

^* 我們使用思維鏈提示評估此基準。 ⁺ 令牌密度：最大分辨率下每個視覺令牌編碼的像素數，即最大分辨率下的像素數 / 視覺令牌數。注意：對於專有模型，我們根據官方API文檔中定義的圖像編碼收費策略計算令牌密度，這提供了一個上限估計。

多圖像評估結果（Mantis Eval、BLINK Val、Mathverse mv、Sciverse mv、MIRB）

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/o6FGHytRhzeatmhxq0Dbi.png)

^* 我們自行評估官方發佈的檢查點。

視頻評估結果（Video - MME和Video - ChatGPT）

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/jmrjoRr8SFLkrstjDmpaV.png)

點擊查看TextVQA、VizWiz、VQAv2、OK - VQA的少樣本評估結果。

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64abc4aa6cadc7aca585dddf/zXIuiCTTe-POqKGHszdn0.png)

* 表示零圖像樣本和遵循Flamingo的兩個額外文本樣本。 ⁺ 我們評估未進行SFT的預訓練檢查點。

🌟 示例

點擊查看更多示例。

我們在終端設備上部署了MiniCPM-V 2.6。演示視頻是在iPad Pro上的原始屏幕錄製，未進行編輯。

💻 演示

點擊此處嘗試 MiniCPM-V 2.6 的演示。

💻 使用示例

基礎用法

在NVIDIA GPU上使用Huggingface transformers進行推理。在Python 3.10上測試的依賴項如下：

Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
decord

# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

高級用法

多圖像對話

點擊查看使用多圖像輸入運行MiniCPM-V 2.6的Python代碼。

```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB') image2 = Image.open('image2.jpg').convert('RGB') question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat( image=None, msgs=msgs, tokenizer=tokenizer ) print(answer)

</details>

#### 上下文少樣本學習
<details>
<summary> 點擊查看使用少樣本輸入運行MiniCPM-V 2.6的Python代碼。 </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

視頻對話

點擊查看使用視頻輸入運行MiniCPM-V 2.6的Python代碼。

```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer from decord import VideoReader, cpu # pip install decord

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path): def uniform_sample(l, n): gap = len(l) / n idxs = [int(i * gap + gap / 2) for i in range(n)] return [l[i] for i in idxs]

vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1)  # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
    frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames

video_path ="video_test.mp4" frames = encode_video(video_path) question = "Describe the video" msgs = [ {'role': 'user', 'content': frames + [question]}, ]

Set decode params for video

params={} params["use_image_id"] = False params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448

answer = model.chat( image=None, msgs=msgs, tokenizer=tokenizer, **params ) print(answer)

</details>

更多使用細節請查看 [GitHub](https://github.com/OpenBMB/MiniCPM-V)。

## 📦 llama.cpp推理
MiniCPM-V 2.6可以使用llama.cpp運行。更多詳情請查看我們的 [llama.cpp](https://github.com/OpenBMB/llama.cpp/tree/minicpm-v2.5/examples/minicpmv) 分支。

## 📦 Int4量化版本
下載Int4量化版本以減少GPU內存（7GB）使用：[MiniCPM-V-2_6-int4](https://huggingface.co/openbmb/MiniCPM-V-2_6-int4)。

## 📄 許可證
#### 模型許可證
* 本倉庫中的代碼遵循 [Apache - 2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) 許可證發佈。
* MiniCPM-V系列模型權重的使用必須嚴格遵循 [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
* MiniCPM的模型和權重完全免費用於學術研究。填寫 ["問卷"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) 進行註冊後，MiniCPM-V 2.6的權重也可免費用於商業用途。

#### 聲明
* 作為一個多模態大語言模型，MiniCPM-V 2.6通過學習大量多模態語料生成內容，但它無法理解、表達個人觀點或進行價值判斷。MiniCPM-V 2.6生成的任何內容均不代表模型開發者的觀點和立場。
* 我們不對使用MinCPM-V模型產生的任何問題負責，包括但不限於數據安全問題、輿論風險，或因模型誤導、誤用、傳播或濫用而產生的任何風險和問題。

## 🔧 關鍵技術及其他多模態項目
👏 歡迎探索MiniCPM-V 2.6的關鍵技術和我們團隊的其他多模態項目：
[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD)  | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V)

## 📖 引用
如果您覺得我們的工作有幫助，請考慮引用我們的論文 📝 並給這個項目點贊 ❤️！
```bib
@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}