M

Minicpm V 2 6

Developed by jchevallard
MiniCPM-V 2.6 is the latest and most powerful multimodal large model in the MiniCPM-V series, supporting single-image, multi-image, and video understanding with leading performance and extreme efficiency.
Downloads 118
Release Time : 8/30/2024

Model Overview

MiniCPM-V 2.6 is a multimodal large model built on SigLip-400M and Qwen2-7B, with a total of 8 billion parameters. It supports single-image, multi-image, and video understanding, featuring powerful OCR and multilingual capabilities, suitable for various vision and language tasks.

Model Features

Leading Performance
In the OpenCompass comprehensive evaluation, MiniCPM-V 2.6 achieved an average score of 65.2, surpassing the single-image understanding capabilities of commercial models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet.
Multi-image Understanding and Context Learning
Supports cross-image dialogue and reasoning, achieving SOTA levels on multi-image benchmarks like Mantis-Eval, BLINK, Mathverse mv, and Sciverse mv, while demonstrating excellent context learning capabilities.
Video Understanding
Supports video input for spatiotemporal dialogue and dense descriptions. Outperforms GPT-4V, Claude 3.5 Sonnet, and LLaVA-NeXT-Video-34B on the Video-MME benchmark.
Powerful OCR and Other Capabilities
Supports images with arbitrary aspect ratios (up to 1344x1344/1.8 million pixels), achieving SOTA levels on OCRBench, surpassing commercial models like GPT-4o, GPT-4V, and Gemini 1.5 Pro.
Extreme Efficiency
Features SOTA-level token density, processing 1.8 million-pixel images with only 640 tokens, 75% fewer than mainstream models, directly improving inference speed, first-token latency, memory usage, and power efficiency.
Ready-to-Use
Offers multiple usage methods, including local CPU inference, quantized models, vLLM inference, fine-tuning for new domains/tasks, fast local WebUI deployment, and online demos.

Model Capabilities

Single-image understanding
Multi-image understanding
Video understanding
OCR
Multilingual support
Context learning
Cross-image dialogue and reasoning
Spatiotemporal dialogue
Dense descriptions

Use Cases

Image Understanding
OCR Recognition
Recognize text information in images
Achieved SOTA levels on OCRBench
Multi-image Comparison
Compare similarities and differences across multiple images
Achieved SOTA levels on multi-image benchmarks like Mantis-Eval and BLINK
Video Understanding
Video Content Analysis
Analyze spatiotemporal information in videos
Outperformed GPT-4V, Claude 3.5 Sonnet, and LLaVA-NeXT-Video-34B on the Video-MME benchmark
Multilingual Applications
Multilingual Menu Translation
Translate multilingual menus in images
Supports multiple languages including Chinese, English, German, French, Italian, and Korean
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase