MiniCPM-V Open-Source Multimodal Model - Lightweight, Optimal for Terminal Deployment, and Powerful in Bilingual (Chinese-English) Interaction

Minicpm V

Developed by openbmb

MiniCPM-V is an efficient lightweight multimodal model optimized for edge device deployment, supporting bilingual (Chinese-English) interaction and outperforming models of similar scale.

Text-to-Image

Transformers

#Lightweight Multimodal #Bilingual (Chinese-English)#Edge Deployment

Downloads 19.74k

Release Time : 1/30/2024

Model Overview

An efficient multimodal model built upon SigLip-400M and MiniCPM-2.4B, connected via a perceptual resampler, featuring exceptional visual understanding and language generation capabilities.

Model Features

Ultra Efficiency

Requires only 64 visual tokens with low memory consumption, capable of running smoothly on mainstream GPUs, personal computers, and even mobile devices.

Outstanding Performance

Surpasses models of similar scale in benchmarks like MMMU and MME, rivaling the 9.6B-parameter Qwen-VL-Chat in certain scenarios.

Bilingual Support

The first edge-deployable multimodal model supporting bilingual (Chinese-English) interaction, based on ICLR 2024 spotlight paper technology.

Multimodal Understanding

Supports single/multi-image and video understanding, including advanced features like real-time video analysis on iPad.

Model Capabilities

Image Content Understanding

Visual Question Answering

Multi-Image Association Analysis

Video Content Understanding

Bilingual (Chinese-English) Interaction

Real-Time Video Processing

Use Cases

Education

Flora and Fauna Recognition Teaching

Real-time recognition and explanation of flora and fauna characteristics via camera.

Demonstrated accurate identification of mushroom species and snake features in examples.

Smart Devices

Mobile Visual Assistant

Real-time image understanding and Q&A on mobile devices.

Already supports deployment on Android/HarmonyOS systems.

Content Analysis

Multi-Image Association Understanding

Analyzes content and logical relationships across multiple images.

Performs excellently in MMbench tests.

🚀 MiniCPM-V

MiniCPM-V (i.e., OmniLMM-3B) is an efficient multimodal model. It's built on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. It offers high efficiency, promising performance, and bilingual support, making it suitable for various deployment scenarios.

🚀 Quick Start

Demo: Click here to try out the Demo of MiniCPM-V.
Mobile Deployment: Currently, MiniCPM-V can be deployed on mobile phones with Android and Harmony operating systems. Try it here.

✨ Features

News

[2025.01.14] 🔥🔥 We open source MiniCPM-o 2.6, with significant performance improvement over MiniCPM-V 2.6, and support real-time speech-to-speech conversation and multimodal live streaming. Try it now.
[2024.08.06] 🔥 We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad.
[2024.05.20] 🔥 GPT-4V level multimodal model MiniCPM-Llama3-V 2.5 is out.
[2024.04.11] 🔥 MiniCPM-V 2.0 is out.

Key Features

⚡️ High Efficiency: MiniCPM-V can be efficiently deployed on most GPU cards, personal computers, and even end devices like mobile phones. In visual encoding, it compresses image representations into 64 tokens via a perceiver resampler, far fewer than other LMMs based on MLP architecture. This results in much less memory cost and higher inference speed.
🔥 Promising Performance: MiniCPM-V achieves state-of-the-art performance on multiple benchmarks (including MMMU, MME, and MMbech, etc) among comparable-sized models, surpassing existing LMMs built on Phi-2. It even performs comparably or better than the 9.6B Qwen-VL-Chat.
🙌 Bilingual Support: MiniCPM-V is the first end-deployable LMM supporting bilingual multimodal interaction in English and Chinese. This is achieved by generalizing multimodal capabilities across languages, a technique from the ICLR 2024 spotlight paper.

📊 Evaluation

Model	Size	MME	MMB dev (en)	MMB dev (zh)	MMMU val	CMMMU val
LLaVA-Phi	3.0B	1335	59.8	-	-	-
MobileVLM	3.0B	1289	59.6	-	-	-
Imp-v1	3B	1434	66.5	-	-	-
Qwen-VL-Chat	9.6B	1487	60.6	56.7	35.9	30.7
CogVLM	17.4B	1438	63.7	53.8	32.1	-
MiniCPM-V	3B	1452	67.9	65.3	37.2	32.1

💡 Examples

💻 Usage Examples

Basic Usage

Inference using Huggingface transformers on Nivdia GPUs or Mac with MPS (Apple silicon or AMD GPUs). Requirements tested on python 3.10：

Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99

# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16)
# For Nvidia GPUs support BF16 (like A100, H100, RTX3090)
model = model.to(device='cuda', dtype=torch.bfloat16)
# For Nvidia GPUs do NOT support BF16 (like V100, T4, RTX2080)
#model = model.to(device='cuda', dtype=torch.float16)
# For Mac with MPS (Apple silicon or AMD GPUs).
# Run with `PYTORCH_ENABLE_MPS_FALLBACK=1 python test.py`
#model = model.to(device='mps', dtype=torch.float16)

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True)
model.eval()

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': question}]

res, context, _ = model.chat(
    image=image,
    msgs=msgs,
    context=None,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.7
)
print(res)

Please look at GitHub for more detail about usage.

📄 License

Model License

The code in this repo is released under the Apache-2.0 License.
The usage of MiniCPM-V series model weights must strictly follow MiniCPM Model License.md.
The models and weights of MiniCPM are completely free for academic research. After filling out a "questionnaire" for registration, they are also available for free commercial use.

Statement

As a LLM, MiniCPM-V generates contents by learning a large amount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V does not represent the views and positions of the model developers.
We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご