๐ A GPT-4V Level Multimodal LLM on Your Phone
Bring a GPT-4V level multimodal large language model to your mobile device, offering high - performance multimodal processing capabilities.
GitHub | Demo | WeChat
๐ Quick Start
This README provides a detailed introduction to MiniCPM-Llama3-V 2.5, including its features, evaluation results, deployment methods, and usage examples. You can quickly understand and start using this model by referring to the following content.
โจ Features
MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series, built on SigLip - 400M and Llama3 - 8B - Instruct with 8B parameters. It shows significant performance improvement over MiniCPM-V 2.0, with the following notable features:
- ๐ฅ Leading Performance: It achieved an average score of 65.1 on OpenCompass, surpassing widely - used proprietary models like GPT - 4V - 1106, Gemini Pro, Claude 3, and Qwen - VL - Max with only 8B parameters.
- ๐ช Strong OCR Capabilities: It can process images with any aspect ratio and up to 1.8 million pixels, achieving over 700 scores on OCRBench, outperforming proprietary models such as GPT - 4o, GPT - 4V - 0409, Qwen - VL - Max, and Gemini Pro. It has also enhanced full - text OCR extraction, table - to - markdown conversion, and other capabilities.
- ๐ Trustworthy Behavior: Using the latest RLAIF - V method, it has a 10.3% hallucination rate on Object HalBench, lower than GPT - 4V - 1106.
- ๐ Multilingual Support: Thanks to Llama 3's multilingual capabilities and cross - lingual generalization techniques, it supports over 30 languages.
- ๐ Efficient Deployment: It uses model quantization, CPU, NPU, and compilation optimizations for high - efficiency deployment on edge devices. For Qualcomm - chip phones, it integrates the NPU acceleration framework QNN into llama.cpp, achieving a 150 - fold acceleration in end - side image encoding and a 3 - fold increase in language decoding speed.
- ๐ซ Easy Usage: It can be used in multiple ways, including efficient CPU inference on local devices via llama.cpp and ollama, using 16 - sized GGUF format quantized models, efficient LoRA fine - tuning with 2 V100 GPUs, streaming output, setting up local WebUI demos with Gradio and Streamlit, and using interactive demos on HuggingFace Spaces.
Evaluation
- Results on TextVQA, DocVQA, OCRBench, OpenCompass MultiModal Avg, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench:
- Evaluation results of multilingual LLaVA Bench:
Examples
- Deployment on end - devices: The demo video is a raw screen recording on a Xiaomi 14 Pro without editing.
๐ฆ Installation
Coming soon for mobile phone deployment.
๐ป Usage Examples
Basic Usage
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:
Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, torch_dtype=torch.float16)
model = model.to(device='cuda')
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()
image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': question}]
res = model.chat(
image=image,
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
temperature=0.7,
)
print(res)
Advanced Usage
res = model.chat(
image=image,
msgs=msgs,
tokenizer=tokenizer,
sampling=True,
temperature=0.7,
stream=True
)
generated_text = ""
for new_text in res:
generated_text += new_text
print(new_text, flush=True, end='')
Please refer to GitHub for more usage details.
๐ Documentation
Inference with llama.cpp
MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more details.
Int4 quantized version
Download the int4 quantized version for lower GPU memory (8GB) usage: MiniCPM-Llama3-V-2_5-int4.
MiniCPM-V 2.0
See the information about MiniCPM-V 2.0 here.
๐ License
Model License
- The code in this repo is released under the Apache - 2.0 License.
- The usage of MiniCPM-V series model weights must strictly follow MiniCPM Model License.md.
- The models and weights of MiniCPM are completely free for academic research. After filling out a "questionnaire" for registration, they are also available for free commercial use.
Statement
- As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large amount of texts, but it cannot comprehend, express personal opinions, or make value judgments. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers.
- We will not be liable for any problems arising from the use of the MinCPM-V open - source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.
๐ง Technical Details
๐ Welcome to explore key techniques of MiniCPM-V 2.6 and other multimodal projects of our team:
VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V
๐ Citation
If you find our work helpful, please consider citing our papers ๐ and liking this project โค๏ธ๏ผ
@article{yao2024minicpmv,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and Chen, Qianyu and Zhou, Huarong and Zou, Zhensheng and Zhang, Haoye and Hu, Shengding and Zheng, Zhi and Zhou, Jie and Cai, Jie and Han, Xu and Zeng, Guoyang and Li, Dahai and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint 2408.01800},
year={2024},
}
๐ News
Pinned
- [2025.01.14] ๐ฅ๐ฅ ๐ฅ We open source MiniCPM-o 2.6, with significant performance improvement over MiniCPM-V 2.6, and support real - time speech - to - speech conversation and multimodal live streaming. Try it now.
- [2024.08.10] ๐๐๐ MiniCPM-Llama3-V 2.5 is now fully supported by official llama.cpp! GGUF models of various sizes are available here.
- [2024.08.06] ๐ฅ๐ฅ๐ฅ We open - source MiniCPM-V 2.6, which outperforms GPT - 4V on single image, multi - image, and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5 and can support real - time video understanding on iPad. Try it now!
- [2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See here.
- [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See here.
- [2024.05.28] ๐ซ We now support LoRA fine - tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics here.
- [2024.05.23] ๐ฅ๐ฅ๐ฅ MiniCPM-V tops GitHub Trending and HuggingFace Trending! Our demo, recommended by Hugging Face Gradioโs official account, is available here. Come and try it out!
- [2024.05.20] We open - source MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end - side MLLM achieving GPT - 4V level performance! We provide efficient inference and simple fine - tuning. Try it now!