M

Minicpm O 2 6

Developed by openbmb
MiniCPM-o 2.6 is a GPT-4o-level multimodal large model that runs on mobile devices, supporting vision, voice, and live stream processing
Downloads 178.38k
Release Time : 1/12/2025

Model Overview

Built on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B, it features an end-to-end full-modal architecture with a total of 8B parameters. It achieves significant performance improvements over MiniCPM-V 2.6, adding real-time voice conversation and multimodal live stream processing capabilities.

Model Features

Top-Tier Visual Capabilities
Outperforms commercial closed-source models like GPT-4o-202405 and Gemini 1.5 Pro in OpenCompass's comprehensive evaluation covering 8 major benchmarks.
Leading Voice Technology
Supports real-time bilingual (Chinese-English) voice conversations with configurable tones, surpassing GPT-4o's real-time version in ASR, STT translation, and other audio understanding tasks.
Powerful Live Stream Processing
Innovatively supports continuous video/audio stream input and real-time voice interaction, achieving the best real-time video understanding in the open-source community.
Exceptional OCR Capabilities
Ranked first in OCRBench among models under 25B, supporting images of any aspect ratio and processing up to 1.8 million pixels.
Ultimate Efficiency
Ultra-high visual token density (2822 pixels per token), enabling smooth multimodal live streaming on terminal devices like iPad.

Model Capabilities

Visual Understanding
Speech Recognition
Speech Synthesis
Real-time Voice Conversation
Multi-Image Processing
Video Understanding
OCR
Voice Cloning
Live Stream Processing
Multilingual Support

Use Cases

Smart Assistant
Real-time Voice Assistant
Supports real-time bilingual (Chinese-English) voice interaction with configurable tones and emotional styles.
Ranked first in both semantic and audio quality evaluations in AudioArena.
Multimodal Customer Service
Processes voice, image, and text inputs simultaneously to provide comprehensive solutions.
Outperformed GPT-4o in MMHal-Bench credibility evaluation.
Content Processing
Live Stream Content Analysis
Processes live video streams in real-time for content understanding and interaction.
Surpassed GPT-4o-202408 in the StreamingBench live benchmark.
Document OCR
High-precision recognition of documents with any aspect ratio.
Ranked first in OCRBench among models under 25B.
Creative Applications
Voice Cloning
Supports end-to-end voice cloning and descriptive tone generation.
Performed excellently on the Seed-TTS test set.
Multimodal Creation
Generates creative content based on visual and voice inputs.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase