MiniCPM-o-2_6開源多模態大模型 - 手機端運行，支持視、音與直播流處理

首頁

Minicpm O 2 6

由openbmb開發

MiniCPM-o 2.6是一款手機端運行的GPT-4o級多模態大模型，支持視覺、語音與直播流處理

多模態融合

Transformers

其他#手機端多模態 #即時語音對話 #直播流處理

下載量 178.38k

發布時間 : 1/12/2025

模型概述

基於SigLip-400M、Whisper-medium-300M、ChatTTS-200M和Qwen2.5-7B構建的端到端全模態架構，參數量總計8B。相比MiniCPM-V 2.6實現顯著性能提升，新增即時語音對話與多模態直播流處理能力。

模型特點

頂尖視覺能力

在OpenCompass涵蓋8大基準的綜合評測中超越GPT-4o-202405、Gemini 1.5 Pro等商用閉源模型

領先語音技術

支持中英雙語即時語音對話與可配置音色，在ASR、STT翻譯等音頻理解任務上超越GPT-4o即時版

強悍直播處理

創新支持持續視頻/音頻流輸入與即時語音交互，實現開源社區最佳即時視頻理解

卓越OCR能力

OCRBench評測在25B以下模型中奪冠，支持任意長寬比圖像和180萬像素處理

極致效能

超高視覺token密度（單token編碼2822像素），可在iPad等終端設備流暢運行多模態直播

模型能力

視覺理解

語音識別

語音合成

即時語音對話

多圖像處理

視頻理解

OCR

語音克隆

直播流處理

多語言支持

使用案例

智能助手

即時語音助手

支持中英雙語即時語音交互，可配置音色和情感風格

在AudioArena語義/音質評測雙第一

多模態客服

同時處理語音、圖像和文本輸入，提供綜合解決方案

在MMHal-Bench可信度評測中超越GPT-4o

內容處理

直播內容分析

即時處理直播視頻流，提供內容理解和互動

在StreamingBench直播基準上超越GPT-4o-202408

文檔OCR

高精度識別任意長寬比文檔

OCRBench評測在25B以下模型中奪冠

創意應用

語音克隆

支持端到端語音克隆與描述式音色生成

在Seed-TTS測試集上表現優異

多模態創作

基於視覺和語音輸入的創意內容生成

🚀 MiniCPM-o 2.6：適用於手機端的視覺、語音和多模態直播的GPT - 4o級多模態大語言模型

MiniCPM-o 2.6是一款強大的多模態大語言模型，可在手機端高效運行，具備卓越的視覺、語音處理能力以及多模態直播功能，為用戶帶來全新的交互體驗。

GitHub | 在線演示 | 技術博客

🔔 最新消息

[2025.03.01] 🚀🚀🚀 MiniCPM-o的對齊技術RLAIF - V被CVPR 2025接收！代碼、數據集、論文均已開源！
[2025.01.24] 📢📢📢 MiniCPM-o 2.6技術報告發布！點擊查看。
[2025.01.19] ⭐️⭐️⭐️ MiniCPM-o登上GitHub趨勢榜，並在Hugging Face趨勢榜上排名第二！

✨ 主要特性

🔥 領先的視覺能力

MiniCPM-o 2.6在OpenCompass上的綜合評估中平均得分達到70.2分，該評估涵蓋了8個流行的基準測試。僅80億參數的情況下，在單圖像理解方面超越了廣泛使用的專有模型，如GPT - 4o - 202405、Gemini 1.5 Pro和Claude 3.5 Sonnet。在多圖像和視頻理解方面也優於GPT - 4V和Claude 3.5 Sonnet，並展現出了出色的上下文學習能力。

🎙 先進的語音能力

MiniCPM-o 2.6支持中英雙語即時語音對話，並可配置語音。在音頻理解任務（如ASR和STT翻譯）上優於GPT - 4o - realtime，在開源社區的語義和聲學評估中，語音對話表現達到了先進水平。還支持諸如情感/速度/風格控制、端到端語音克隆、角色扮演等有趣功能。

🎬 強大的多模態直播能力

作為一項新特性，MiniCPM-o 2.6可以獨立於用戶查詢接受連續的視頻和音頻流，並支持即時語音交互。在StreamingBench（一個用於即時視頻理解、全源（視頻和音頻）理解和多模態上下文理解的綜合基準測試）上，優於GPT - 4o - 202408和Claude 3.5 Sonnet，在開源社區中表現出先進水平。

💪 強大的OCR能力及其他

繼承了MiniCPM - V系列流行的視覺能力，MiniCPM-o 2.6可以處理任意寬高比、高達180萬像素（如1344x1344）的圖像。在OCRBench上，對於250億參數以下的模型，達到了先進水平，超越了GPT - 4o - 202405等專有模型。基於最新的RLAIF - V和VisCPM技術，具有可靠的行為表現，在MMHal - Bench上優於GPT - 4o和Claude 3.5 Sonnet，並支持30多種語言的多語言能力。

🚀 卓越的效率

除了友好的模型規模外，MiniCPM-o 2.6還展現出了先進的令牌密度（即每個視覺令牌編碼的像素數）。處理180萬像素的圖像時僅產生640個令牌，比大多數模型少75%。這直接提高了推理速度、首令牌延遲、內存使用效率和功耗。因此，MiniCPM-o 2.6可以在iPad等終端設備上高效支持多模態直播。

💫 易於使用

MiniCPM-o 2.6可以通過多種方式輕鬆使用：

llama.cpp支持在本地設備上進行高效的CPU推理。
int4和GGUF格式的量化模型，有16種尺寸可供選擇。
vLLM支持高吞吐量和內存高效的推理。
使用LLaMA - Factory在新領域和任務上進行微調。
使用Gradio快速設置本地WebUI演示。
在服務器上進行在線Web演示。

📚 詳細文檔

模型架構

端到端全模態架構：不同模態的編碼器/解碼器以端到端的方式連接和訓練，充分利用豐富的多模態知識。
全模態直播機制：
1. 將離線模態編碼器/解碼器轉換為在線編碼器/解碼器，以處理流式輸入/輸出。
2. 在大語言模型主幹中設計了一種時分複用（TDM）機制，用於全模態流式處理。它將並行的全模態流在小的週期性時間片內劃分為順序信息。
可配置語音建模設計：設計了一個多模態系統提示，包括傳統的文本系統提示和一個新的音頻系統提示，用於確定助手語音。這使得在推理時可以靈活配置語音，也便於端到端語音克隆和基於描述的語音創建。

評估

視覺理解結果

圖像理解：

模型	規模	令牌密度⁺	OpenCompass	OCRBench	MathVista mini	ChartQA	MMVet	MMStar	MME	MMB1.1 test	AI2D	MMMU val	HallusionBench	TextVQA val	DocVQA test	MathVerse mini	MathVision	MMHal Score
專有模型
GPT - 4o - 20240513	-	1088	69.9	736	61.3	85.7	69.1	63.9	2328.7	82.2	84.6	69.2	55.0	-	92.8	50.2	30.4	3.6
Claude3.5 - Sonnet	-	750	67.9	788	61.6	90.8	66.0	62.2	1920.0	78.5	80.2	65.9	49.9	-	95.2	-	-	3.4
Gemini 1.5 Pro	-	-	64.4	754	57.7	81.3	64.0	59.1	2110.6	73.9	79.1	60.6	45.6	73.5	86.5	-	19.2	-
GPT - 4o - mini - 20240718	-	1088	64.1	785	52.4	-	66.9	54.8	2003.4	76.0	77.8	60.0	46.1	-	-	-	-	3.3
開源模型
Cambrian - 34B	34B	1820	58.3	591	50.3	75.6	53.2	54.2	2049.9	77.8	79.5	50.4	41.6	76.7	75.5	-	-	-
GLM - 4V - 9B	13B	784	59.1	776	51.1	-	58.0	54.8	2018.8	67.9	71.2	46.9	45.0	-	-	-	-	-
Pixtral - 12B	12B	256	61.0	685	56.9	81.8	58.5	54.5	-	72.7	79.0	51.1	47.0	75.7	90.7	-	-	-
DeepSeek - VL2 - 27B (4B)	27B	672	66.4	809	63.9	86.0	60.0	61.9	2253.0	81.2	83.8	54.0	45.3	84.2	93.3	-	-	3.0
Qwen2 - VL - 7B	8B	784	67.1	866	58.2	83.0	62.0	60.7	2326.0	81.8	83.0	54.1	50.6	84.3	94.5	31.9	16.3	3.2
LLaVA - OneVision - 72B	72B	182	68.1	741	67.5	83.7	60.6	65.8	2261.0	85.0	85.6	56.8	49.0	80.5	91.3	39.1	-	3.5
InternVL2.5 - 8B	8B	706	68.3	822	64.4	84.8	62.8	62.8	2344.0	83.6	84.5	56.0	50.1	79.1	93.0	39.5	19.7	3.4
MiniCPM - V 2.6	8B	2822	65.2	852*	60.6	79.4	60.0	57.5	2348.4*	78.0	82.1	49.8*	48.1*	80.1	90.8	25.7	18.3	3.6
MiniCPM - o 2.6	8B	2822	70.2	897*	71.9*	86.9*	67.5	64.0	2372.0*	80.5	85.8	50.4*	51.9	82.0	93.5	41.4*	23.1*	3.8

* 我們使用思維鏈提示對該基準進行評估。具體而言，對於MME，我們僅在認知集上使用了該技術。

⁺ 令牌密度：在最大分辨率下，每個視覺令牌編碼的像素數，即最大分辨率下的像素數/視覺令牌數。

注意：對於專有模型，我們根據官方API文檔中定義的圖像編碼收費策略計算令牌密度，這提供了一個上限估計。

多圖像和視頻理解：

點擊查看

模型	規模	BLINK val	Mantis Eval	MIRB	Video - MME (wo / w subs)
專有模型
GPT - 4o - 20240513	-	68.0	-	-	71.9/77.2
GPT4V	-	54.6	62.7	53.1	59.9/63.3
開源模型
LLaVA - NeXT - Interleave 14B	14B	52.6	66.4	30.2	-
LLaVA - OneVision - 72B	72B	55.4	77.6	-	66.2/69.5
MANTIS 8B	8B	49.1	59.5	34.8	-
Qwen2 - VL - 7B	8B	53.2	69.6*	67.6*	63.3/69.0
InternVL2.5 - 8B	8B	54.8	67.7	52.5	64.2/66.9
MiniCPM - V 2.6	8B	53.0	69.1	53.8	60.9/63.6
MiniCPM - o 2.6	8B	56.7	71.9	58.6	63.9/67.9

* 我們自行評估了官方發佈的檢查點。

音頻理解和語音對話結果

音頻理解：

任務	規模	ASR (中文)			ASR (英文)			AST		情感識別
指標		CER↓			WER↓			BLEU↑		ACC↑
數據集		AISHELL - 1	Fleurs中文	WenetSpeech test - net	LibriSpeech test - clean	GigaSpeech	TED - LIUM	CoVoST英文轉中文	CoVoST中文轉英文	MELD情感
專有模型
GPT - 4o - Realtime	-	7.3*	5.4*	28.9*	2.6*	12.9*	4.8*	37.1*	15.7*	33.2*
Gemini 1.5 Pro	-	4.5*	5.9*	14.3*	2.9*	10.6*	3.0*	47.3*	22.6*	48.4*
開源模型
Qwen2 - Audio - 7B	8B	-	7.5	-	1.6	-	-	45.2	24.4	55.3
Qwen2 - Audio - 7B - Instruct	8B	2.6*	6.9*	10.3*	3.1*	9.7*	5.9*	39.5*	22.9*	17.4*
GLM - 4 - Voice - Base	9B	2.5	-	-	2.8	-	-	-	-
MiniCPM - o 2.6	8B	1.6	4.4	6.9	1.7	8.7	3.0	48.2	27.2	52.4

* 我們自行評估了官方發佈的檢查點。

語音生成：

任務	規模	語音問答
指標		ACC↑			G - Eval (10分制)↑	語義ELO得分↑	聲學ELO得分↑	總體ELO得分↑	UTMOS↑	ASR - WER↓
數據集		語音Llama問答	語音網絡問答	語音瑣事問答	語音AlpacaEval	AudioArena
專有模型
GPT - 4o - Realtime		71.7	51.6	69.7	7.4	1157	1203	1200	4.2	2.3
開源模型
GLM - 4 - Voice	9B	50.0	32.0	36.4	5.1	999	1147	1035	4.1	11.7
Llama - Omni	8B	45.3	22.9	10.7	3.9	960	878	897	3.2	24.3
Moshi	7B	43.7	23.8	16.7	2.4	871	808	875	2.8	8.2
Mini - Omni	1B	22.0	12.8	6.9	2.5	926	803	865	3.4	10.0
MiniCPM - o 2.6	8B	61.0	40.0	40.2	5.1	1088	1163	1131	4.2	9.8

所有結果均來自AudioEvals，評估方法及更多詳細信息可在[UltraEval - Audio](https://github.com/OpenBMB/UltraEval-Audio)中找到。

端到端語音克隆

任務	語音克隆
指標	SIMO↑	SIMO↑
數據集	Seed - TTS測試中文	Seed - TTS測試英文
F5 - TTS	76	67
CosyVoice	75	64
FireRedTTS	63	46
MiniCPM - o 2.6	57	47

多模態直播結果

多模態直播：StreamingBench上的結果

模型	規模	即時視頻理解	全源理解	上下文理解	總體
專有模型
Gemini 1.5 Pro	-	77.4	67.8	51.1	70.3
GPT - 4o - 202408	-	74.5	51.0	48.0	64.1
Claude - 3.5 - Sonnet	-	74.0	41.4	37.8	59.7
開源模型
VILA - 1.5	8B	61.5	37.5	26.7	49.5
LongVA	7B	63.1	35.9	30.2	50.7
LLaVA - Next - Video - 34B	34B	69.8	41.7	34.3	56.7
Qwen2 - VL - 7B	8B	71.2	40.7	33.1	57.0
InternVL2 - 8B	8B	70.1	42.7	34.1	57.0
VITA - 1.5	8B	70.9	40.8	35.8	57.4
LLaVA - OneVision - 7B	8B	74.3	40.8	31.0	58.4
InternLM - XC2.5 - OL - 7B	8B	75.4	46.2	33.6	60.8
MiniCPM - V 2.6	8B	72.4	40.2	33.4	57.7
MiniCPM - o 2.6	8B	79.9	53.4	38.5	66.0

示例

我們將MiniCPM-o 2.6部署在終端設備上。演示視頻是在iPad Pro和Web演示上的原始速度錄製。

在線演示

點擊此處嘗試MiniCPM-o 2.6的在線演示。

💻 使用示例

基礎用法

在NVIDIA GPU上使用Huggingface transformers進行推理。請確保安裝了transformers==4.44.2，因為其他版本可能存在兼容性問題。我們正在調查此問題。在Python 3.10上測試的依賴如下：

Pillow==10.1.0
torch==2.3.1
torchaudio==2.3.1
torchvision==0.18.1
transformers==4.44.2
librosa==0.9.0
soundfile==0.12.1
vector-quantize-pytorch==1.18.5
vocos==0.1.0
decord
moviepy

模型初始化

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

# 加載全模態模型默認設置，默認init_vision/init_audio/init_tts為True
# 如果加載僅視覺模型，請設置init_audio=False和init_tts=False
# 如果加載僅音頻模型，請設置init_vision=False
model = AutoModel.from_pretrained(
    'openbmb/MiniCPM-o-2_6',
    trust_remote_code=True,
    attn_implementation='sdpa', # sdpa或flash_attention_2
    torch_dtype=torch.bfloat16,
    init_vision=True,
    init_audio=True,
    init_tts=True
)

model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

# 除了僅視覺模式，還需要初始化tts處理器和vocos
model.init_tts()

如果使用較舊版本的PyTorch，可能會遇到 "weight_norm_fwd_first_dim_kernel" not implemented for 'BFloat16' 問題，請將TTS轉換為float32類型。

model.tts.float()

高級用法

全模態模式

我們提供兩種推理模式：聊天和流式推理。

聊天推理

import math
import numpy as np
from PIL import Image
from moviepy.editor import VideoFileClip
import tempfile
import librosa
import soundfile as sf

def get_video_chunk_content(video_path, flatten=True):
    video = VideoFileClip(video_path)
    print('video_duration:', video.duration)
    
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=True) as temp_audio_file:
        temp_audio_file_path = temp_audio_file.name
        video.audio.write_audiofile(temp_audio_file_path, codec="pcm_s16le", fps=16000)
        audio_np, sr = librosa.load(temp_audio_file_path, sr=16000, mono=True)
    num_units = math.ceil(video.duration)
    
    # 1幀 + 1秒音頻塊
    contents= []
    for i in range(num_units):
        frame = video.get_frame(i+1)
        image = Image.fromarray((frame).astype(np.uint8))
        audio = audio_np[sr*i:sr*(i+1)]
        if flatten:
            contents.extend(["<unit>", image, audio])
        else:
            contents.append(["<unit>", image, audio])
    
    return contents

video_path="assets/Skiing.mp4"
# 如果使用語音克隆提示，請設置ref_audio
ref_audio_path = 'assets/demo.wav'
ref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)
sys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode='omni', language='en')
# 或使用默認提示
# sys_msg = model.get_sys_prompt(mode='omni', language='en')

contents = get_video_chunk_content(video_path)
msg = {"role":"user", "content": contents}
msgs = [sys_msg, msg]

# 請設置generate_audio=True和output_audio_path以保存tts結果
generate_audio = True
output_audio_path = 'output.wav'

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.5,
    max_new_tokens=4096,
    omni_input=True, # 全模態推理時請設置omni_input=True
    use_tts_template=True,
    generate_audio=generate_audio,
    output_audio_path=output_audio_path,
    max_slice_nums=1,
    use_image_id=False,
    return_dict=True
)
print(res)

## 你將得到答案：The person in the picture is skiing down a snowy slope.
# import IPython
# IPython.display.Audio('output.wav')

流式推理

# 新對話需要先重置會話，這將重置kv緩存
model.reset_session()

contents = get_video_chunk_content(video_path, flatten=False)
session_id = '123'
generate_audio = True

# 1. 預填充系統提示
res = model.streaming_prefill(
    session_id=session_id,
    msgs=[sys_msg], 
    tokenizer=tokenizer
)

# 2. 預填充視頻/音頻塊
for content in contents:
    msgs = [{"role":"user", "content": content}]
    res = model.streaming_prefill(
        session_id=session_id,
        msgs=msgs, 
        tokenizer=tokenizer
    )

# 3. 生成
res = model.streaming_generate(
    session_id=session_id,
    tokenizer=tokenizer,
    temperature=0.5,
    generate_audio=generate_audio
)

audios = []
text = ""

if generate_audio:
    for r in res:
        audio_wav = r.audio_wav
        sampling_rate = r.sampling_rate
        txt = r.text

        audios.append(audio_wav)
        text += txt
        
    res = np.concatenate(audios)
    sf.write("output.wav", res, samplerate=sampling_rate)
    print("text:", text)
    print("音頻保存到output.wav")
else:
    for r in res:
        text += r['text']
    print("text:", text)

語音和音頻模式

模型初始化

import torch
import librosa
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa或flash_attention_2，無急切模式
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-o-2_6', trust_remote_code=True)

model.init_tts()
model.tts.float()

模仿任務

模仿任務反映了模型的端到端語音建模能力。模型接收音頻輸入，輸出ASR轉錄，然後以高度相似性重構原始音頻。重構音頻與原始音頻的相似度越高，模型在端到端語音建模方面的基礎能力就越強。

mimick_prompt = "請重複每個用戶的語音，包括語音風格和語音內容。"
audio_input, _ = librosa.load('./assets/input_examples/Trump_WEF_2018_10s.mp3', sr=16000, mono=True) # 加載要模仿的音頻

# 也可以嘗試 `./assets/input_examples/cxk_original.wav`, 
# `./assets/input_examples/fast-pace.wav`, 
# `./assets/input_examples/chi-english-1.wav` 
# `./assets/input_examples/exciting-emotion.wav` 
# 以測試語音相關的不同特徵。

msgs = [{'role': 'user', 'content': [mimick_prompt, audio_input]}]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    temperature=0.3,
    generate_audio=True,
    output_audio_path='output_mimick.wav', # 將tts結果保存到output_audio_path
)

可配置語音的通用語音對話

MiniCPM-o 2.6的一個常見使用場景是根據音頻提示扮演特定角色。它會在一定程度上模仿角色的語音，並在文本中表現得像該角色，包括語言風格。在這種模式下，MiniCPM-o 2.6聽起來更加自然和人性化。可以使用自定義音頻提示以端到端的方式自定義角色的語音。

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # 加載參考音頻
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_roleplay', language='en')

# 第一輪
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_roleplay_round_1.wav',
)

# 第二輪
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_roleplay_round_2.wav',
)
print(res)

作為AI助手的語音對話

MiniCPM-o 2.6的一個增強功能是作為AI助手，但語音選擇有限。在這種模式下，MiniCPM-o 2.6 不太像人類，更像語音助手。在這種模式下，模型更遵循指令。演示時，建議使用assistant_female_voice、assistant_male_voice和assistant_default_female_voice。其他語音可能有效，但不如默認語音穩定。

請注意，assistant_female_voice和assistant_male_voice更穩定，但聽起來像機器人，而assistant_default_female_voice更像人類，但不穩定，其語音在多輪對話中經常變化。建議嘗試穩定的語音assistant_female_voice和assistant_male_voice。

ref_audio, _ = librosa.load('./assets/input_examples/assistant_female_voice.wav', sr=16000, mono=True) # 或使用 `./assets/input_examples/assistant_male_voice.wav`
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='audio_assistant', language='en') 
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]} # 加載用戶的音頻問題

# 第一輪
msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_assistant_round_1.wav',
)

# 第二輪
history = msgs.append({'role': 'assistant', 'content': res})
user_question = {'role': 'user', 'content': [librosa.load('xxx.wav', sr=16000, mono=True)[0]]}
msgs = history.append(user_question)
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_assistant_round_2.wav',
)
print(res)

指令到語音

MiniCPM-o 2.6還可以進行指令到語音轉換，即語音創建。可以詳細描述一種語音，模型將生成符合描述的語音。有關更多指令到語音的示例指令，請參考VoxInstruct。

instruction = '像一位有魅力的男性巨星一樣說話，每一個字都散發著自信和風格。'

msgs = [{'role': 'user', 'content': [instruction]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_creation.wav',
)

語音克隆

MiniCPM-o 2.6還可以進行零樣本文本到語音轉換，即語音克隆。在這種模式下，模型將像TTS模型一樣工作。

ref_audio, _ = librosa.load('./assets/input_examples/icl_20.wav', sr=16000, mono=True) # 加載參考音頻
sys_prompt = model.get_sys_prompt(ref_audio=ref_audio, mode='voice_cloning', language='en')
text_prompt = f"請朗讀以下文本。"
user_question = {'role': 'user', 'content': [text_prompt, "你想要朗讀的內容"]}

msgs = [sys_prompt, user_question]
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_voice_cloning.wav',
)

處理各種音頻理解任務

MiniCPM-o 2.6還可用於處理各種音頻理解任務，如ASR、說話人分析、通用音頻字幕和聲音場景標記。

對於音頻到文本任務，可以使用以下提示：

中文ASR（與英文到中文AST相同）：請仔細聽這段音頻片段，並將其內容逐字記錄。
英文ASR（與中文到英文AST相同）：Please listen to the audio snippet carefully and transcribe the content.
說話人分析：Based on the speaker's content, speculate on their gender, condition, age range, and health status.
通用音頻字幕：Summarize the main content of the audio.
通用聲音場景標記：Utilize one keyword to convey the audio's content or the associated scene.

task_prompt = "Please listen to the audio snippet carefully and transcribe the content." + "\n" # 可以更改為其他提示。
audio_input, _ = librosa.load('./assets/input_examples/audio_understanding.mp3', sr=16000, mono=True) # 加載要加字幕的音頻

msgs = [{'role': 'user', 'content': [task_prompt, audio_input]}]

res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    max_new_tokens=128,
    use_tts_template=True,
    generate_audio=True,
    temperature=0.3,
    output_audio_path='result_audio_understanding.wav',
)
print(res)

僅視覺模式

MiniCPM-o-2_6的推理方法與MiniCPM-V-2_6相同。

單圖像聊天

# test.py
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## 如果你想使用流式推理，請確保sampling=True和stream=True
## model.chat將返回一個生成器
res = model.chat(
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)
generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

多圖像聊天

點擊查看使用多圖像輸入運行MiniCPM-o 2.6的Python代碼。

image1 = Image.open('image1.jpg').convert('RGB')
image2 = Image.open('image2.jpg').convert('RGB')
question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'
msgs = [{'role': 'user', 'content': [image1, image2, question]}]
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

上下文少樣本學習

點擊查看使用少樣本輸入運行MiniCPM-o 2.6的Python代碼。

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')
msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]
answer = model.chat(
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

視頻聊天

點擊查看使用視頻輸入運行MiniCPM-o 2.6的Python代碼。

MAX_NUM_FRAMES=64 # 如果cuda內存不足，請設置較小的數字
def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]
    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)