MiniCPM-Llama3-V-2_5 Open-Source Multimodal Large Model: Outperforms GPT-4V in Single/Multiple Image and Video Understanding, Can Be Used in Real-Time on iPad

Minicpm Llama3 V 2 5

Developed by openbmb

MiniCPM-V 2.6 is a multimodal large model launched by OpenBMB, surpassing GPT-4V in single-image, multi-image, and video understanding tasks, and supports real-time video understanding on iPad.

Image-to-Text

Transformers

Other#Edge-side Multimodal #Powerful OCR #Multilingual Support

Downloads 31.48k

Release Time : 5/19/2024

Model Overview

MiniCPM-V 2.6 is a multimodal large model built on SigLip-400M and Llama3-8B-Instruct, with a total of 8B parameters, demonstrating significant advantages in OCR, multilingual support, and deployment on edge devices.

Model Features

Top Performance

Achieved an average score of 65.1 in OpenCompass evaluation, surpassing commercial models like GPT-4V-1106 and Gemini Pro.

Powerful OCR Capability

Supports images with any aspect ratio (up to 1344x1344/1.8 million pixels) and scored 700+ in OCRBench evaluation.

Multilingual Support

Supports 30+ languages including Chinese, English, German, French, Spanish, Italian, Korean, and Japanese.

Efficient Deployment on Edge Devices

Achieves efficient operation on edge devices through quantization, CPU/NPU optimization, and compilation optimization.

Model Capabilities

Image Understanding

Text Generation

Multilingual Processing

OCR Recognition

Video Understanding

Complex Reasoning

Instruction Following

Use Cases

Document Processing

Full-text Extraction

Extract complete text content from images.

High-precision extraction results.

Table to Markdown

Convert tables in images to Markdown format.

Structured output.

Multilingual Applications

Multilingual Image Understanding

Supports image content understanding in 30+ languages.

Cross-language generalization capability.

Edge Device Applications

Real-time Video Understanding on Mobile Devices

Enables real-time video content analysis on devices like iPad.

Efficient operation.

🚀 A GPT-4V Level Multimodal LLM on Your Phone

Bring a GPT-4V level multimodal large language model to your mobile device, offering high - performance multimodal processing capabilities.

GitHub | Demo | WeChat

🚀 Quick Start

This README provides a detailed introduction to MiniCPM-Llama3-V 2.5, including its features, evaluation results, deployment methods, and usage examples. You can quickly understand and start using this model by referring to the following content.

✨ Features

MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series, built on SigLip - 400M and Llama3 - 8B - Instruct with 8B parameters. It shows significant performance improvement over MiniCPM-V 2.0, with the following notable features:

🔥 Leading Performance: It achieved an average score of 65.1 on OpenCompass, surpassing widely - used proprietary models like GPT - 4V - 1106, Gemini Pro, Claude 3, and Qwen - VL - Max with only 8B parameters.
💪 Strong OCR Capabilities: It can process images with any aspect ratio and up to 1.8 million pixels, achieving over 700 scores on OCRBench, outperforming proprietary models such as GPT - 4o, GPT - 4V - 0409, Qwen - VL - Max, and Gemini Pro. It has also enhanced full - text OCR extraction, table - to - markdown conversion, and other capabilities.
🏆 Trustworthy Behavior: Using the latest RLAIF - V method, it has a 10.3% hallucination rate on Object HalBench, lower than GPT - 4V - 1106.
🌏 Multilingual Support: Thanks to Llama 3's multilingual capabilities and cross - lingual generalization techniques, it supports over 30 languages.
🚀 Efficient Deployment: It uses model quantization, CPU, NPU, and compilation optimizations for high - efficiency deployment on edge devices. For Qualcomm - chip phones, it integrates the NPU acceleration framework QNN into llama.cpp, achieving a 150 - fold acceleration in end - side image encoding and a 3 - fold increase in language decoding speed.
💫 Easy Usage: It can be used in multiple ways, including efficient CPU inference on local devices via llama.cpp and ollama, using 16 - sized GGUF format quantized models, efficient LoRA fine - tuning with 2 V100 GPUs, streaming output, setting up local WebUI demos with Gradio and Streamlit, and using interactive demos on HuggingFace Spaces.

Evaluation

Results on TextVQA, DocVQA, OCRBench, OpenCompass MultiModal Avg, MME, MMBench, MMMU, MathVista, LLaVA Bench, RealWorld QA, Object HalBench:

- Evaluation results of multilingual LLaVA Bench:

Examples

Model examples:

- Deployment on end - devices: The demo video is a raw screen recording on a Xiaomi 14 Pro without editing.

📦 Installation

Coming soon for mobile phone deployment.

💻 Usage Examples

Basic Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:

Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99

# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, torch_dtype=torch.float16)
model = model.to(device='cuda')

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': question}]

res = model.chat(
    image=image,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True, # if sampling=False, beam_search will be used by default
    temperature=0.7,
    # system_prompt='' # pass system_prompt if needed
)
print(res)

Advanced Usage

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=image,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.7,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

Please refer to GitHub for more usage details.

📚 Documentation

Inference with llama.cpp

MiniCPM-Llama3-V 2.5 can run with llama.cpp now! See our fork of llama.cpp for more details.

Int4 quantized version

Download the int4 quantized version for lower GPU memory (8GB) usage: MiniCPM-Llama3-V-2_5-int4.

MiniCPM-V 2.0

See the information about MiniCPM-V 2.0 here.

📄 License

Model License

The code in this repo is released under the Apache - 2.0 License.
The usage of MiniCPM-V series model weights must strictly follow MiniCPM Model License.md.
The models and weights of MiniCPM are completely free for academic research. After filling out a "questionnaire" for registration, they are also available for free commercial use.

Statement

As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large amount of texts, but it cannot comprehend, express personal opinions, or make value judgments. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers.
We will not be liable for any problems arising from the use of the MinCPM-V open - source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination, or misuse of the model.

🔧 Technical Details

👏 Welcome to explore key techniques of MiniCPM-V 2.6 and other multimodal projects of our team: VisCPM | RLHF-V | LLaVA-UHD | RLAIF-V

📖 Citation

If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️！

@article{yao2024minicpmv,
      title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, 
      author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and Chen, Qianyu and Zhou, Huarong and Zou, Zhensheng and Zhang, Haoye and Hu, Shengding and Zheng, Zhi and Zhou, Jie and Cai, Jie and Han, Xu and Zeng, Guoyang and Li, Dahai and Liu, Zhiyuan and Sun, Maosong},
      journal={arXiv preprint 2408.01800},
      year={2024},
}

📌 News

Pinned

[2025.01.14] 🔥🔥 🔥 We open source MiniCPM-o 2.6, with significant performance improvement over MiniCPM-V 2.6, and support real - time speech - to - speech conversation and multimodal live streaming. Try it now.
[2024.08.10] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 is now fully supported by official llama.cpp! GGUF models of various sizes are available here.
[2024.08.06] 🔥🔥🔥 We open - source MiniCPM-V 2.6, which outperforms GPT - 4V on single image, multi - image, and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5 and can support real - time video understanding on iPad. Try it now!
[2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See here.
[2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See here.
[2024.05.28] 💫 We now support LoRA fine - tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics here.
[2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and HuggingFace Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available here. Come and try it out!
[2024.05.20] We open - source MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end - side MLLM achieving GPT - 4V level performance! We provide efficient inference and simple fine - tuning. Try it now!

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご