MiniCPM4-8B-marlin-vLLM Open-source Model - Efficient operation on edge devices to improve usage performance

Minicpm4 8B Marlin Vllm

Developed by openbmb

MiniCPM4 is an efficient large language model designed specifically for edge devices, achieving extreme efficiency improvements and optimal performance at the same scale.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Efficient inference on the edge side #Long text optimization #Ternary quantization compression

Downloads 200

Release Time : 6/6/2025

Model Overview

MiniCPM4 is an efficient large language model optimized for edge devices. Through innovations in four dimensions: model architecture, training data, training algorithm, and inference system, it achieves optimal performance and extreme efficiency at the same scale.

Model Features

Efficient model architecture

Adopts a trainable sparse attention mechanism architecture, significantly reducing the computational overhead of long texts.

Efficient learning algorithm

Introduces a scaling prediction method for downstream task performance to achieve a more accurate search for model training configurations.

High-quality training data

Builds an iterative data cleaning strategy based on efficient data validation to provide high-quality Chinese and English pre-training datasets.

Efficient inference system

Supports a lightweight speculative sampling and cross-platform deployment system, providing flexible cross-platform adaptation capabilities.

Model Capabilities

Text generation

Dialogue system

Long text processing

Tool invocation

Survey paper generation

Use Cases

Tourism recommendation

Tourist attraction recommendation

Generate a list of tourist attraction recommendations based on user requests.

Generate a detailed recommendation containing 5 tourist attractions in Beijing.

Content creation

Article writing

Generate high-quality articles based on the theme.

Generate a detailed article about artificial intelligence.

Academic research

Survey paper generation

Autonomously generate a credible long survey paper based on user queries.

🚀 MiniCPM4

MiniCPM4 is a highly efficient large language model (LLM) specifically designed for end - side devices. It achieves remarkable efficiency improvements through systematic innovation in model architecture, training data, training algorithms, and inference systems, offering high - performance text generation capabilities in resource - constrained environments.

GitHub Repo | Technical Report

Join us on Discord and WeChat

🚀 Quick Start

If you want to quickly start using MiniCPM4, you can follow the steps in the "Usage" section below. First, choose the appropriate inference method according to your needs, such as using vLLM, CPM.cu, Transformers, SGLang, etc. Then, install the necessary dependencies and run the corresponding code examples to experience the powerful text generation capabilities of MiniCPM4.

✨ Features

What's New

[2025.06.06] The MiniCPM4 series are released! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end - side chips! You can find the technical report here.

MiniCPM4 Series

The MiniCPM4 series are highly efficient large language models (LLMs) designed explicitly for end - side devices, achieving efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.

[MiniCPM4 - 8B](https://huggingface.co/openbmb/MiniCPM4 - 8B): The flagship of MiniCPM4, with 8B parameters, trained on 8T tokens.
[MiniCPM4 - 0.5B](https://huggingface.co/openbmb/MiniCPM4 - 0.5B): The small version of MiniCPM4, with 0.5B parameters, trained on 1T tokens.
[MiniCPM4 - 8B - Eagle - FRSpec](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - FRSpec): Eagle head for FRSpec, accelerating speculative inference for MiniCPM4 - 8B.
[MiniCPM4 - 8B - Eagle - FRSpec - QAT - cpmcu](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - FRSpec - QAT - cpmcu): Eagle head trained with QAT for FRSpec, efficiently integrating speculation and quantization to achieve ultra - acceleration for MiniCPM4 - 8B.
[MiniCPM4 - 8B - Eagle - vLLM](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - vLLM): Eagle head in vLLM format, accelerating speculative inference for MiniCPM4 - 8B.
[MiniCPM4 - 8B - marlin - Eagle - vLLM](https://huggingface.co/openbmb/MiniCPM4 - 8B - marlin - Eagle - vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4 - 8B.
[BitCPM4 - 0.5B](https://huggingface.co/openbmb/BitCPM4 - 0.5B): Extreme ternary quantization applied to MiniCPM4 - 0.5B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
[BitCPM4 - 1B](https://huggingface.co/openbmb/BitCPM4 - 1B): Extreme ternary quantization applied to MiniCPM3 - 1B compresses model parameters into ternary values, achieving a 90% reduction in bit width.
[MiniCPM4 - Survey](https://huggingface.co/openbmb/MiniCPM4 - Survey): Based on MiniCPM4 - 8B, accepts users' queries as input and autonomously generates trustworthy, long - form survey papers.
[MiniCPM4 - MCP](https://huggingface.co/openbmb/MiniCPM4 - MCP): Based on MiniCPM4 - 8B, accepts users' queries and available MCP tools as input and autonomously calls relevant MCP tools to satisfy users' requirements.

Introduction

MiniCPM 4 is an extremely efficient edge - side large model that has undergone efficient optimization across four dimensions: model architecture, learning algorithms, training data, and inference systems, achieving ultimate efficiency improvements.

Efficient Model Architecture:
- InfLLM v2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts.
Efficient Learning Algorithms:
- Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search.
- BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit - width to 3 values, achieving 90% extreme model bit - width reduction.
- Efficient Training Engineering Optimization: Adopts FP8 low - precision computing technology combined with Multi - token Prediction training strategy.
High - Quality Training Data:
- UltraClean -- High - quality Pre - training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open - sourcing high - quality Chinese and English pre - training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra - FineWeb).
- UltraChat v2 -- High - quality Supervised Fine - tuning Data Generation: Constructs large - scale high - quality supervised fine - tuning datasets covering multiple dimensions including knowledge - intensive data, reasoning - intensive data, instruction - following data, long text understanding data, and tool calling data.
Efficient Inference System:
- FRSpec -- Lightweight Speculative Sampling: Achieves draft model acceleration through vocabulary pruning of the draft model.
- ArkInfer -- Cross - platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross - platform adaptation capabilities.

📦 Installation

Using Quantized Eagle Speculative Decoding with [vLLM](https://github.com/vllm - project/vllm)

For now, you need to install the latest version of vLLM.

pip install -U vllm \
    --pre \
    --extra - index - url https://wheels.vllm.ai/nightly

Inference with CPM.cu

You can install CPM.cu by running the following command:

git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install

Inference with [SGLang](https://github.com/sgl - project/sglang)

For now, you need to install our forked version of SGLang.

git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

💻 Usage Examples

Using Quantized Eagle Speculative Decoding with [vLLM](https://github.com/vllm - project/vllm)

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "openbmb/MiniCPM4-8B-marlin-vLLM"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8,
    speculative_config={
        "method": "eagle",
        "model": "openbmb/MiniCPM4-8B-marlin-Eagle-vLLM",
        "num_speculative_tokens": 2,
        "max_model_len": 32768,
    },
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)

outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

Inference Quantized MiniCPM4 - 8B with [vLLM](https://github.com/vllm - project/vllm)

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "openbmb/MiniCPM4-8B-marlin-vLLM"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8,
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)

outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

Inference with CPM.cu

MiniCPM4 natively supports context lengths of up to 32,768 tokens. To reproduce the long - text acceleration effect in the paper, we recommend using the LongRoPE factors that have been validated. Change the rope_scaling field in the config.json file as follows to enable LongRoPE.

{
    ...,
    "rope_scaling": {
        "rope_type": "longrope", 
        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "original_max_position_embeddings": 32768
    }
}

After modification, you can run the following command to reproduce the long - context acceleration effect (the script will automatically download the model weights from HuggingFace).

python3 tests/test_generate.py

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# User can directly use the chat interface
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)

# User can also use the generate interface
messages = [
    {"role": "user", "content": "Write an article about Artificial Intelligence."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)

model_outputs = model.generate(
    model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)
output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

MiniCPM4 - 8B supports InfLLM v2, a sparse attention mechanism designed for efficient long - sequence inference. It requires the infllmv2_cuda_impl library.

You can install it by running the following command:

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install

To enable InfLLM v2, you need to add the sparse_config field in config.json:

{
    ...,
    "sparse_config": {
        "kernel_size": 32,
        "kernel_stride": 16,
        "init_blocks": 1,
        "block_size": 64,
        "window_size": 2048,
        "topk": 64,
        "use_nope": false,
        "dense_len": 8192
    }
}

Inference with [SGLang](https://github.com/sgl - project/sglang)

You can start the inference server by running the following command:

python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml

Then you can use the chat interface by running the following command:

import openai

client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="openbmb/MiniCPM4-8B",
    messages=[
        {"role": "user", "content": "Write an article about Artificial Intelligence."},
    ],
    temperature=0.7,
    max_tokens=1024,
)

print(response.choices[0].message.content)

📚 Documentation

Evaluation Results

On two typical end - side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 demonstrates significantly faster processing speed compared to similar - size models in long text processing tasks. As text length increases, MiniCPM4's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3 - 8B, MiniCPM4 achieves approximately 7x decoding speed improvement.

benchmark

Comprehensive Evaluation

MiniCPM4 launches end - side versions with 8B and 0.5B parameter scales, both achieving best - in - class performance in their respective categories.

benchmark

Long Text Evaluation

MiniCPM4 is pre - trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle - in - a - haystack task, MiniCPM4 demonstrates outstanding performance.

![long - niah](https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm4/128k - niah.png?raw=true)

Statement

As a language model, MiniCPM generates content by learning from a vast amount of text.
However, it does not possess the ability to comprehend or express persona.

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご