MiniCPM-V 2.6 Open-Source Multimodal Large Language Model - Free Support for Image and Video Understanding, Compatible with RK3588

Minicpm V 2 6 Rk3588 1.1.4

Developed by c01zaut

MiniCPM-V 2.6 is a GPT-4V-level multimodal large language model supporting single-image, multi-image, and video understanding, optimized for RK3588 NPU

Image-to-Text

Transformers

Other#Multimodal Visual Understanding #Efficient OCR Recognition #Edge-side Video Analysis

Downloads 31

Release Time : 11/7/2024

Model Overview

MiniCPM-V 2.6 is an 8B-parameter multimodal model built on SigLip-400M and Qwen2-7B, supporting single-image, multi-image, and video understanding with leading OCR capabilities and multilingual support

Model Features

Leading Performance

Achieved 65.2 average score on OpenCompass, surpassing proprietary models like GPT-4V

Multi-image Understanding

Supports multi-image dialogue and reasoning, achieving SOTA performance on benchmarks like Mantis-Eval

Video Understanding

Supports video input with spatiotemporal dense descriptions, outperforming GPT-4V

Powerful OCR Capability

Processes 1.8-megapixel images, surpassing GPT-4o and Gemini 1.5 Pro on OCRBench

Exceptional Efficiency

Processes 1.8-megapixel images with only 640 tokens generated, 75% fewer than most models

RK3588 Optimization

Specially optimized for RK3588 NPU, supporting various quantization methods like w8a8

Model Capabilities

Single-image Understanding

Multi-image Dialogue

Video Understanding

OCR Recognition

Multilingual Processing

Real-time Inference

Use Cases

Visual Question Answering

Image Content Description

Detailed description of input images

Generates accurate natural language descriptions

Multi-image Reasoning

Complex reasoning based on multiple images

Understands inter-image relationships and draws reasonable conclusions

Document Processing

Document OCR

Text extraction from scanned documents

High-precision text recognition and extraction

Video Analysis

Video Content Understanding

Analysis of spatiotemporal information in videos

Generates dense descriptions of videos

🚀 MiniCPM-V-2_6-RK3588-1.1.4

This version of MiniCPM-V-2_6 is optimized for RK3588 NPU, offering high performance with specific quantization.

Dataset and Language Information

Property	Details
Datasets	openbmb/RLAIF-V-Dataset
Language	Multilingual
Library Name	transformers
Pipeline Tag	image-text-to-text
Tags	minicpm-v, vision, ocr, multi-image, video, custom_code

🚀 Quick Start

This version of MiniCPM-V-2_6 has been converted to run on the RK3588 NPU using ['w8a8', 'w8a8_g128', 'w8a8_g256', 'w8a8_g512'] quantization. This model has been optimized with the following LoRA and is compatible with RKLLM version 1.1.4.

Useful Links

Official RKLLM GitHub
RockhipNPU Reddit
EZRKNN-LLM
Pretty much anything by these folks: marty1885 and happyme531
Converted using ez-er-rkllm-toolkit

✨ Features

Original Model Card for MiniCPM-V-2_6

MiniCPM-V 2.6 is the latest and most capable model in the MiniCPM-V series. Built on SigLip-400M and Qwen2-7B with a total of 8B parameters, it offers significant performance improvements and new features.

🔥 Leading Performance: Achieves an average score of 65.2 on the latest OpenCompass, surpassing popular proprietary models like GPT-4o mini, GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet for single image understanding with only 8B parameters.
🖼️ Multi Image Understanding and In-context Learning: Can perform conversation and reasoning over multiple images, achieving state-of-the-art performance on multi-image benchmarks and showing promising in-context learning capability.
🎬 Video Understanding: Accepts video inputs, performing conversation and providing dense captions for spatial-temporal information. Outperforms GPT-4V, Claude 3.5 Sonnet, and LLaVA-NeXT-Video-34B on Video-MME with/without subtitles.
💪 Strong OCR Capability and Others: Can process images with any aspect ratio and up to 1.8 million pixels, achieving state-of-the-art performance on OCRBench. Features trustworthy behaviors with lower hallucination rates and supports multilingual capabilities.
🚀 Superior Efficiency: Shows state-of-the-art token density, improving inference speed, latency, memory usage, and power consumption. Can efficiently support real-time video understanding on end-side devices like iPad.
💫 Easy Usage: Can be used in various ways, including local CPU inference, quantized models in different formats, high-throughput inference, fine-tuning, local WebUI demo setup, and online web demo.

Evaluation

Single Image Results

^* Evaluated using chain-of-thought prompting. ⁺ Token Density: number of pixels encoded into each visual token at maximum resolution.

Multi-image Results

^* Evaluated the officially released checkpoint by ourselves.

Video Results

Click to view few-shot results on TextVQA, VizWiz, VQAv2, OK-VQA.

* denotes zero image shot and two additional text shots following Flamingo. ⁺ Evaluated the pretraining ckpt without SFT.

Examples

Click to view more cases.

The model is deployed on end devices. The demo video is a raw screen recording on an iPad Pro without edition.

📚 Documentation

Demo

Click here to try the Demo of MiniCPM-V 2.6.

💻 Usage Examples

Basic Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10:

Pillow==10.1.0
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
decord

# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': [image, question]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

Advanced Usage

Chat with Multiple Images

Click to show Python code running MiniCPM-V 2.6 with multiple images input.

```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager model = model.eval().cuda() tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

image1 = Image.open('image1.jpg').convert('RGB') image2 = Image.open('image2.jpg').convert('RGB') question = 'Compare image 1 and image 2, tell me about the differences between image 1 and image 2.'

msgs = [{'role': 'user', 'content': [image1, image2, question]}]

answer = model.chat( image=None, msgs=msgs, tokenizer=tokenizer ) print(answer)

</details>

#### In-context Few-shot Learning
<details>
<summary> Click to view Python code running MiniCPM-V 2.6 with few-shot input. </summary>
```python
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

question = "production date" 
image1 = Image.open('example1.jpg').convert('RGB')
answer1 = "2023.08.04"
image2 = Image.open('example2.jpg').convert('RGB')
answer2 = "2007.04.24"
image_test = Image.open('test.jpg').convert('RGB')

msgs = [
    {'role': 'user', 'content': [image1, question]}, {'role': 'assistant', 'content': [answer1]},
    {'role': 'user', 'content': [image2, question]}, {'role': 'assistant', 'content': [answer2]},
    {'role': 'user', 'content': [image_test, question]}
]

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer
)
print(answer)

Chat with Video

Click to view Python code running MiniCPM-V 2.6 with video input.

```python import torch from PIL import Image from transformers import AutoModel, AutoTokenizer from decord import VideoReader, cpu # pip install decord

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path): def uniform_sample(l, n): gap = len(l) / n idxs = [int(i * gap + gap / 2) for i in range(n)] return [l[i] for i in idxs]

vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1)  # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
# The original README seems to be incomplete here, so it remains as is.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご