MiniCPM-o-2_6 Open-Source Multi-Modal Large Model - Runs on Mobile Devices, Supports Visual, Audio, and Live Stream Processing

Minicpm O 2 6

Developed by openbmb

MiniCPM-o 2.6 is a GPT-4o-level multimodal large model that runs on mobile devices, supporting vision, voice, and live stream processing

Multimodal Fusion

Transformers

Other#Mobile Multimodal #Real-time Voice Conversation #Live Stream Processing

Downloads 178.38k

Release Time : 1/12/2025

Model Overview

Built on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, it features an end-to-end full-modal architecture with a total of 8B parameters. It achieves significant performance improvements over MiniCPM-V 2.6, adding real-time voice conversation and multimodal live stream processing capabilities.

Model Features

Top-Tier Visual Capabilities

Outperforms commercial closed-source models like GPT-4o-202405 and Gemini 1.5 Pro in OpenCompass's comprehensive evaluation covering 8 major benchmarks.

Leading Voice Technology

Supports real-time bilingual (Chinese-English) voice conversations with configurable tones, surpassing GPT-4o's real-time version in ASR, STT translation, and other audio understanding tasks.

Powerful Live Stream Processing

Innovatively supports continuous video/audio stream input and real-time voice interaction, achieving the best real-time video understanding in the open-source community.

Exceptional OCR Capabilities

Ranked first in OCRBench among models under 25B, supporting images of any aspect ratio and processing up to 1.8 million pixels.

Ultimate Efficiency

Ultra-high visual token density (2822 pixels per token), enabling smooth multimodal live streaming on terminal devices like iPad.

Model Capabilities

Visual Understanding

Speech Recognition

Speech Synthesis

Real-time Voice Conversation

Multi-Image Processing

Video Understanding

OCR

Voice Cloning

Live Stream Processing

Multilingual Support

Use Cases

Smart Assistant

Real-time Voice Assistant

Supports real-time bilingual (Chinese-English) voice interaction with configurable tones and emotional styles.

Ranked first in both semantic and audio quality evaluations in AudioArena.

Multimodal Customer Service

Processes voice, image, and text inputs simultaneously to provide comprehensive solutions.

Outperformed GPT-4o in MMHal-Bench credibility evaluation.

Content Processing

Live Stream Content Analysis

Processes live video streams in real-time for content understanding and interaction.

Surpassed GPT-4o-202408 in the StreamingBench live benchmark.

Document OCR

High-precision recognition of documents with any aspect ratio.

Ranked first in OCRBench among models under 25B.

Creative Applications

Voice Cloning

Supports end-to-end voice cloning and descriptive tone generation.

Performed excellently on the Seed-TTS test set.

Multimodal Creation

Generates creative content based on visual and voice inputs.

🚀 A GPT - 4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone

MiniCPM - o 2.6 is an advanced model that integrates vision, speech, and multimodal live - streaming capabilities, offering high - performance and efficient solutions for various applications.

GitHub | Online Demo | Technical Blog

News

[2025.03.01] 🚀🚀🚀 RLAIF - V, the alignment technique of MiniCPM - o, is accepted by CVPR 2025! The code, dataset, paper are open - sourced!
[2025.01.24] 📢📢📢 MiniCPM - o 2.6 technical report is released! See Here.
[2025.01.19] ⭐️⭐️⭐️ MiniCPM - o tops GitHub Trending and reaches top - 2 on Hugging Face Trending!

🚀 Quick Start

MiniCPM - o 2.6 can be easily used in various ways:

llama.cpp support for efficient CPU inference on local devices.
int4 and GGUF format quantized models in 16 sizes.
vLLM support for high - throughput and memory - efficient inference.
Fine - tuning on new domains and tasks with LLaMA - Factory.
Quick local WebUI demo setup with Gradio.
Online web demo on server.

✨ Features

MiniCPM - o 2.6 Features

MiniCPM - o 2.6 is the latest and most capable model in the MiniCPM - o series. It is built end - to - end based on SigLip - 400M, Whisper - medium - 300M, ChatTTS - 200M, and Qwen2.5 - 7B with a total of 8B parameters. It shows significant performance improvement over MiniCPM - V 2.6 and introduces new features for real - time speech conversation and multimodal live streaming.

🔥 Leading Visual Capability
- MiniCPM - o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks. With only 8B parameters, it surpasses widely used proprietary models like GPT - 4o - 202405, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding. It also outperforms GPT - 4V and Claude 3.5 Sonnet in multi - image and video understanding and shows promising in - context learning capability.
🎙 State - of - the - art Speech Capability
- MiniCPM - o 2.6 supports bilingual real - time speech conversation with configurable voices in English and Chinese. It outperforms GPT - 4o - realtime on audio understanding tasks such as ASR and STT translation and shows state - of - the - art performance on speech conversation in both semantic and acoustic evaluations in the open - source community. It also allows for fun features such as emotion/speed/style control, end - to - end voice cloning, role play, etc.
🎬 Strong Multimodal Live Streaming Capability
- As a new feature, MiniCPM - o 2.6 can accept continuous video and audio streams independent of user queries and support real - time speech interaction. It outperforms GPT - 4o - 202408 and Claude 3.5 Sonnet and shows state - of - the - art performance in the open - source community on StreamingBench, a comprehensive benchmark for real - time video understanding, omni - source (video & audio) understanding, and multimodal contextual understanding.
💪 Strong OCR Capability and Others
- Advancing popular visual capabilities from the MiniCPM - V series, MiniCPM - o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344). It achieves state - of - the - art performance on OCRBench for models under 25B, surpassing proprietary models such as GPT - 4o - 202405.
- Based on the latest RLAIF - V and VisCPM techniques, it features trustworthy behaviors, outperforming GPT - 4o and Claude 3.5 Sonnet on MMHal - Bench, and supports multilingual capabilities on more than 30 languages.
🚀 Superior Efficiency
- In addition to its friendly size, MiniCPM - o 2.6 also shows state - of - the - art token density (i.e., number of pixels encoded into each visual token). It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models. This directly improves the inference speed, first - token latency, memory usage, and power consumption. As a result, MiniCPM - o 2.6 can efficiently support multimodal live streaming on end - side devices such as iPad.
💫 Easy Usage
- MiniCPM - o 2.6 can be easily used in various ways as mentioned in the Quick Start section.

Model Architecture

End - to - end Omni - modal Architecture
- Different modality encoder/decoders are connected and trained in an end - to - end fashion to fully exploit rich multimodal knowledge.
Omni - modal Live Streaming Mechanism
- (1) We change the offline modality encoder/decoders into online ones for streaming inputs/outputs.
- (2) We devise a time - division multiplexing (TDM) mechanism for omni - modality streaming processing in the LLM backbone. It divides parallel omni - modality streams into sequential info within small periodic time slices.
Configurable Speech Modeling Design
- We devise a multimodal system prompt, including a traditional text system prompt and a new audio system prompt to determine the assistant voice. This enables flexible voice configurations in inference time and also facilitates end - to - end voice cloning and description - based voice creation.

📚 Documentation

Evaluation

Visual understanding results

Image Understanding:

Model	Size	Token Density⁺	OpenCompass	OCRBench	MathVista mini	ChartQA	MMVet	MMStar	MME	MMB1.1 test	AI2D	MMMU val	HallusionBench	TextVQA val	DocVQA test	MathVerse mini	MathVision	MMHal Score
Proprietary
GPT-4o-20240513	-	1088	69.9	736	61.3	85.7	69.1	63.9	2328.7	82.2	84.6	69.2	55.0	-	92.8	50.2	30.4	3.6
Claude3.5-Sonnet	-	750	67.9	788	61.6	90.8	66.0	62.2	1920.0	78.5	80.2	65.9	49.9	-	95.2	-	-	3.4
Gemini 1.5 Pro	-	-	64.4	754	57.7	81.3	64.0	59.1	2110.6	73.9	79.1	60.6	45.6	73.5	86.5	-	19.2	-
GPT-4o-mini-20240718	-	1088	64.1	785	52.4	-	66.9	54.8	2003.4	76.0	77.8	60.0	46.1	-	-	-	-	3.3
Open Source
Cambrian-34B	34B	1820	58.3	591	50.3	75.6	53.2	54.2	2049.9	77.8	79.5	50.4	41.6	76.7	75.5	-	-	-
GLM-4V-9B	13B	784	59.1	776	51.1	-	58.0	54.8	2018.8	67.9	71.2	46.9	45.0	-	-	-	-	-
Pixtral-12B	12B	256	61.0	685	56.9	81.8	58.5	54.5	-	72.7	79.0	51.1	47.0	75.7	90.7	-	-	-
DeepSeek-VL2-27B (4B)	27B	672	66.4	809	63.9	86.0	60.0	61.9	2253.0	81.2	83.8	54.0	45.3	84.2	93.3	-	-	3.0
Qwen2-VL-7B	8B	784	67.1	866	58.2	83.0	62.0	60.7	2326.0	81.8	83.0	54.1	50.6	84.3	94.5	31.9	16.3	3.2
LLaVA-OneVision-72B	72B	182	68.1	741	67.5	83.7	60.6	65.8	2261.0	85.0	85.6	56.8	49.0	80.5	91.3	39.1	-	3.5
InternVL2.5-8B	8B	706	68.3	822	64.4	84.8	62.8	62.8	2344.0	83.6	84.5	56.0	50.1	79.1	93.0	39.5	19.7	3.4
MiniCPM-V 2.6	8B	2822	65.2	852*	60.6	79.4	60.0	57.5	2348.4*	78.0	82.1	49.8*	48.1*	80.1	90.8	25.7	18.3	3.6
MiniCPM-o 2.6	8B	2822	70.2	897*	71.9*	86.9*	67.5	64.0	2372.0*	80.5	85.8	50.4*	51.9	82.0	93.5	41.4*	23.1*	3.8

* We evaluate this benchmark using chain - of - thought prompting. Specifically, for MME, we used this technique only for the Cognition set.

⁺ Token Density: number of pixels encoded into each visual token at maximum resolution, i.e., # pixels at maximum resolution / # visual tokens.

Note: For proprietary models, we calculate token density based on the image encoding charging strategy defined in the official API documentation, which provides an upper - bound estimation.

Multi - image and Video Understanding:

click to view

Model	Size	BLINK val	Mantis Eval	MIRB	Video - MME (wo / w subs)
Proprietary
GPT-4o-20240513	-	68.0	-	-	71.9/77.2
GPT4V	-	54.6	62.7	53.1	59.9/63.3
Open - source
LLaVA - NeXT - Interleave 14B	14B	52.6	66.4	30.2	-
LLaVA - OneVision - 72B	72B	55.4	77.6	-	66.2/69.5
MANTIS 8B	8B	49.1	59.5	34.8	-
Qwen2 - VL - 7B	8B	53.2	69.6*	67.6*	63.3/69.0
InternVL2.5 - 8B	8B	54.8	67.7	52.5	64.2/66.9
MiniCPM - V 2.6	8B	53.0	69.1	53.8	60.9/63.6
MiniCPM - o 2.6	8B	56.7	71.9	58.6	63.9/67.9

* We evaluate officially released checkpoints by ourselves.

Audio understanding and speech conversation results.

Audio Understanding:

Task	Size	ASR (zh)			ASR (en)			AST		Emotion
Metric		CER↓			WER↓			BLEU↑		ACC↑
Dataset		AISHELL - 1	Fleurs zh	WenetSpeech test - net	LibriSpeech test - clean	GigaSpeech	TED - LIUM	CoVoST en2zh	CoVoST zh2en	MELD emotion
Proprietary
GPT - 4o - Realtime	-	7.3*	5.4*	28.9*	2.6*	12.9*	4.8*	37.1*	15.7*	33.2*
Gemini 1.5 Pro	-	4.5*	5.9*	14.3

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご