🚀 A GPT - 4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
MiniCPM - o 2.6 is an advanced model that integrates vision, speech, and multimodal live - streaming capabilities, offering high - performance and efficient solutions for various applications.
GitHub | Online Demo | Technical Blog
News
- [2025.03.01] 🚀🚀🚀 RLAIF - V, the alignment technique of MiniCPM - o, is accepted by CVPR 2025! The code, dataset, paper are open - sourced!
- [2025.01.24] 📢📢📢 MiniCPM - o 2.6 technical report is released! See Here.
- [2025.01.19] ⭐️⭐️⭐️ MiniCPM - o tops GitHub Trending and reaches top - 2 on Hugging Face Trending!
🚀 Quick Start
MiniCPM - o 2.6 can be easily used in various ways:
- llama.cpp support for efficient CPU inference on local devices.
- int4 and GGUF format quantized models in 16 sizes.
- vLLM support for high - throughput and memory - efficient inference.
- Fine - tuning on new domains and tasks with LLaMA - Factory.
- Quick local WebUI demo setup with Gradio.
- Online web demo on server.
✨ Features
MiniCPM - o 2.6 Features
MiniCPM - o 2.6 is the latest and most capable model in the MiniCPM - o series. It is built end - to - end based on SigLip - 400M, Whisper - medium - 300M, ChatTTS - 200M, and Qwen2.5 - 7B with a total of 8B parameters. It shows significant performance improvement over MiniCPM - V 2.6 and introduces new features for real - time speech conversation and multimodal live streaming.
- 🔥 Leading Visual Capability
- MiniCPM - o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT - 4o - 202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding. It also outperforms GPT - 4V and Claude 3.5 Sonnet in multi - image and video understanding and shows promising in - context learning capability.
- 🎙 State - of - the - art Speech Capability
- MiniCPM - o 2.6 supports bilingual real - time speech conversation with configurable voices in English and Chinese. It outperforms GPT - 4o - realtime on audio understanding tasks such as ASR and STT translation and shows state - of - the - art performance on speech conversation in both semantic and acoustic evaluations in the open - source community. It also allows for fun features such as emotion/speed/style control, end - to - end voice cloning, role play, etc.
- 🎬 Strong Multimodal Live Streaming Capability
- As a new feature, MiniCPM - o 2.6 can accept continuous video and audio streams independent of user queries and support real - time speech interaction. It outperforms GPT - 4o - 202408 and Claude 3.5 Sonnet and shows state - of - the - art performance in the open - source community on StreamingBench, a comprehensive benchmark for real - time video understanding, omni - source (video & audio) understanding, and multimodal contextual understanding.
- 💪 Strong OCR Capability and Others
- Advancing popular visual capabilities from the MiniCPM - V series, MiniCPM - o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state - of - the - art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT - 4o - 202405.
- Based on the latest RLAIF - V and VisCPM techniques, it features trustworthy behaviors, outperforming GPT - 4o and Claude 3.5 Sonnet on MMHal - Bench, and supports multilingual capabilities on more than 30 languages.
- 🚀 Superior Efficiency
- In addition to its friendly size, MiniCPM - o 2.6 also shows state - of - the - art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first - token latency, memory usage, and power consumption. As a result, MiniCPM - o 2.6 can efficiently support multimodal live streaming on end - side devices such as iPad.
- 💫 Easy Usage
- MiniCPM - o 2.6 can be easily used in various ways as mentioned in the Quick Start section.
Model Architecture
- End - to - end Omni - modal Architecture
- Different modality encoder/decoders are connected and trained in an end - to - end fashion to fully exploit rich multimodal knowledge.
- Omni - modal Live Streaming Mechanism
- (1) We change the offline modality encoder/decoders into online ones for streaming inputs/outputs.
- (2) We devise a time - division multiplexing (TDM) mechanism for omni - modality streaming processing in the LLM backbone. It divides parallel omni - modality streams into sequential info within small periodic time slices.
- Configurable Speech Modeling Design
- We devise a multimodal system prompt, including a traditional text system prompt and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time and also facilitates end - to - end voice cloning and description - based voice creation.
📚 Documentation
Evaluation
Visual understanding results
Image Understanding:
Model |
Size |
Token Density+ |
OpenCompass |
OCRBench |
MathVista mini |
ChartQA |
MMVet |
MMStar |
MME |
MMB1.1 test |
AI2D |
MMMU val |
HallusionBench |
TextVQA val |
DocVQA test |
MathVerse mini |
MathVision |
MMHal Score |
Proprietary |
GPT-4o-20240513 |
- |
1088 |
69.9 |
736 |
61.3 |
85.7 |
69.1 |
63.9 |
2328.7 |
82.2 |
84.6 |
69.2 |
55.0 |
- |
92.8 |
50.2 |
30.4 |
3.6 |
Claude3.5-Sonnet |
- |
750 |
67.9 |
788 |
61.6 |
90.8 |
66.0 |
62.2 |
1920.0 |
78.5 |
80.2 |
65.9 |
49.9 |
- |
95.2 |
- |
- |
3.4 |
Gemini 1.5 Pro |
- |
- |
64.4 |
754 |
57.7 |
81.3 |
64.0 |
59.1 |
2110.6 |
73.9 |
79.1 |
60.6 |
45.6 |
73.5 |
86.5 |
- |
19.2 |
- |
GPT-4o-mini-20240718 |
- |
1088 |
64.1 |
785 |
52.4 |
- |
66.9 |
54.8 |
2003.4 |
76.0 |
77.8 |
60.0 |
46.1 |
- |
- |
- |
- |
3.3 |
Open Source |
Cambrian-34B |
34B |
1820 |
58.3 |
591 |
50.3 |
75.6 |
53.2 |
54.2 |
2049.9 |
77.8 |
79.5 |
50.4 |
41.6 |
76.7 |
75.5 |
- |
- |
- |
GLM-4V-9B |
13B |
784 |
59.1 |
776 |
51.1 |
- |
58.0 |
54.8 |
2018.8 |
67.9 |
71.2 |
46.9 |
45.0 |
- |
- |
- |
- |
- |
Pixtral-12B |
12B |
256 |
61.0 |
685 |
56.9 |
81.8 |
58.5 |
54.5 |
- |
72.7 |
79.0 |
51.1 |
47.0 |
75.7 |
90.7 |
- |
- |
- |
DeepSeek-VL2-27B (4B) |
27B |
672 |
66.4 |
809 |
63.9 |
86.0 |
60.0 |
61.9 |
2253.0 |
81.2 |
83.8 |
54.0 |
45.3 |
84.2 |
93.3 |
- |
- |
3.0 |
Qwen2-VL-7B |
8B |
784 |
67.1 |
866 |
58.2 |
83.0 |
62.0 |
60.7 |
2326.0 |
81.8 |
83.0 |
54.1 |
50.6 |
84.3 |
94.5 |
31.9 |
16.3 |
3.2 |
LLaVA-OneVision-72B |
72B |
182 |
68.1 |
741 |
67.5 |
83.7 |
60.6 |
65.8 |
2261.0 |
85.0 |
85.6 |
56.8 |
49.0 |
80.5 |
91.3 |
39.1 |
- |
3.5 |
InternVL2.5-8B |
8B |
706 |
68.3 |
822 |
64.4 |
84.8 |
62.8 |
62.8 |
2344.0 |
83.6 |
84.5 |
56.0 |
50.1 |
79.1 |
93.0 |
39.5 |
19.7 |
3.4 |
MiniCPM-V 2.6 |
8B |
2822 |
65.2 |
852* |
60.6 |
79.4 |
60.0 |
57.5 |
2348.4* |
78.0 |
82.1 |
49.8* |
48.1* |
80.1 |
90.8 |
25.7 |
18.3 |
3.6 |
MiniCPM-o 2.6 |
8B |
2822 |
70.2 |
897* |
71.9* |
86.9* |
67.5 |
64.0 |
2372.0* |
80.5 |
85.8 |
50.4* |
51.9 |
82.0 |
93.5 |
41.4* |
23.1* |
3.8 |
* We evaluate this benchmark using chain - of - thought prompting. Specifically, for MME, we used this technique only for the Cognition set.
+ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.
Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper - bound estimation.
Multi - image and Video Understanding:
click to view
Model |
Size |
BLINK val |
Mantis Eval |
MIRB |
Video - MME (wo / w subs) |
Proprietary |
GPT-4o-20240513 |
- |
68.0 |
- |
- |
71.9/77.2 |
GPT4V |
- |
54.6 |
62.7 |
53.1 |
59.9/63.3 |
Open - source |
LLaVA - NeXT - Interleave 14B |
14B |
52.6 |
66.4 |
30.2 |
- |
LLaVA - OneVision - 72B |
72B |
55.4 |
77.6 |
- |
66.2/69.5 |
MANTIS 8B |
8B |
49.1 |
59.5 |
34.8 |
- |
Qwen2 - VL - 7B |
8B |
53.2 |
69.6* |
67.6* |
63.3/69.0 |
InternVL2.5 - 8B |
8B |
54.8 |
67.7 |
52.5 |
64.2/66.9 |
MiniCPM - V 2.6 |
8B |
53.0 |
69.1 |
53.8 |
60.9/63.6 |
MiniCPM - o 2.6 |
8B |
56.7 |
71.9 |
58.6 |
63.9/67.9 |
* We evaluate officially released checkpoints by ourselves.
Audio understanding and speech conversation results.
Audio Understanding:
Task |
Size |
ASR (zh) |
ASR (en) |
AST |
Emotion |
Metric |
|
CER↓ |
WER↓ |
BLEU↑ |
ACC↑ |
Dataset |
|
AISHELL - 1 |
Fleurs zh |
WenetSpeech test - net |
LibriSpeech test - clean |
GigaSpeech |
TED - LIUM |
CoVoST en2zh |
CoVoST zh2en |
MELD emotion |
Proprietary |
GPT - 4o - Realtime |
- |
7.3* |
5.4* |
28.9* |
2.6* |
12.9* |
4.8* |
37.1* |
15.7* |
33.2* |
Gemini 1.5 Pro |
- |
4.5* |
5.9* |
14.3 |