đ MiniCPM-V
MiniCPM-V (i.e., OmniLMM-3B) is an efficient multimodal model. It's built on SigLip-400M and MiniCPM-2.4B, connected by a perceiver resampler. It offers high efficiency, promising performance, and bilingual support, making it suitable for various deployment scenarios.
đ Quick Start
- Demo: Click here to try out the Demo of MiniCPM-V.
- Mobile Deployment: Currently, MiniCPM-V can be deployed on mobile phones with Android and Harmony operating systems. Try it here.
⨠Features
News
- [2025.01.14] đĨđĨ We open source MiniCPM-o 2.6, with significant performance improvement over MiniCPM-V 2.6, and support real-time speech-to-speech conversation and multimodal live streaming. Try it now.
- [2024.08.06] đĨ We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad.
- [2024.05.20] đĨ GPT-4V level multimodal model MiniCPM-Llama3-V 2.5 is out.
- [2024.04.11] đĨ MiniCPM-V 2.0 is out.
Key Features
- âĄī¸ High Efficiency: MiniCPM-V can be efficiently deployed on most GPU cards, personal computers, and even end devices like mobile phones. In visual encoding, it compresses image representations into 64 tokens via a perceiver resampler, far fewer than other LMMs based on MLP architecture. This results in much less memory cost and higher inference speed.
- đĨ Promising Performance: MiniCPM-V achieves state-of-the-art performance on multiple benchmarks (including MMMU, MME, and MMbech, etc) among comparable-sized models, surpassing existing LMMs built on Phi-2. It even performs comparably or better than the 9.6B Qwen-VL-Chat.
- đ Bilingual Support: MiniCPM-V is the first end-deployable LMM supporting bilingual multimodal interaction in English and Chinese. This is achieved by generalizing multimodal capabilities across languages, a technique from the ICLR 2024 spotlight paper.
đ Evaluation
Model |
Size |
MME |
MMB dev (en) |
MMB dev (zh) |
MMMU val |
CMMMU val |
LLaVA-Phi |
3.0B |
1335 |
59.8 |
- |
- |
- |
MobileVLM |
3.0B |
1289 |
59.6 |
- |
- |
- |
Imp-v1 |
3B |
1434 |
66.5 |
- |
- |
- |
Qwen-VL-Chat |
9.6B |
1487 |
60.6 |
56.7 |
35.9 |
30.7 |
CogVLM |
17.4B |
1438 |
63.7 |
53.8 |
32.1 |
- |
MiniCPM-V |
3B |
1452 |
67.9 |
65.3 |
37.2 |
32.1 |
đĄ Examples
đģ Usage Examples
Basic Usage
Inference using Huggingface transformers on Nivdia GPUs or Mac with MPS (Apple silicon or AMD GPUs). Requirements tested on python 3.10īŧ
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.36.0
sentencepiece==0.1.99
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True, torch_dtype=torch.bfloat16)
model = model.to(device='cuda', dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V', trust_remote_code=True)
model.eval()
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': question}]
res, context, _ = model.chat(
image=image,
msgs=msgs,
context=None,
tokenizer=tokenizer,
sampling=True,
temperature=0.7
)
print(res)
Please look at GitHub for more detail about usage.
đ License
Model License
- The code in this repo is released under the Apache-2.0 License.
- The usage of MiniCPM-V series model weights must strictly follow MiniCPM Model License.md.
- The models and weights of MiniCPM are completely free for academic research. After filling out a "questionnaire" for registration, they are also available for free commercial use.
Statement
- As a LLM, MiniCPM-V generates contents by learning a large amount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V does not represent the views and positions of the model developers.
- We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.