模型简介
模型特点
模型能力
使用案例
🚀 MiniCPM4
MiniCPM4是专为端侧设备设计的高效大语言模型,通过在模型架构、训练数据、训练算法和推理系统四个关键维度进行系统创新,在相同规模下保持最优性能的同时实现了极致的效率提升,在典型端侧芯片上可实现超5倍的生成加速。

🚀 快速开始
MiniCPM4系列模型已经发布,你可以根据以下不同的使用场景选择合适的推理方式。
✨ 主要特性
模型系列丰富
- MiniCPM4-8B:MiniCPM4的旗舰模型,拥有80亿参数,在8T tokens上进行训练。
- MiniCPM4-0.5B:MiniCPM4的小版本,拥有5亿参数,在1T tokens上进行训练。
- MiniCPM4-8B-Eagle-FRSpec:用于FRSpec的Eagle头,加速MiniCPM4-8B的推测推理。
- MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu:使用QAT为FRSpec训练的Eagle头,有效整合推测和量化,为MiniCPM4-8B实现超加速。
- MiniCPM4-8B-Eagle-vLLM:vLLM格式的Eagle头,加速MiniCPM4-8B的推测推理。
- MiniCPM4-8B-marlin-Eagle-vLLM:vLLM格式的量化Eagle头,加速MiniCPM4-8B的推测推理。
- BitCPM4-0.5B:将极端三元量化应用于MiniCPM4-0.5B,将模型参数压缩为三元值,实现位宽减少90%。
- BitCPM4-1B:将极端三元量化应用于MiniCPM3-1B,将模型参数压缩为三元值,实现位宽减少90%。
- MiniCPM4-Survey:基于MiniCPM4-8B,接受用户查询作为输入,自主生成可信的长篇调查论文。
- MiniCPM4-MCP:基于MiniCPM4-8B,接受用户查询和可用的MCP工具作为输入,自主调用相关MCP工具以满足用户需求。
多维度高效优化
- 🏗️ 高效模型架构:采用InfLLM v2可训练稀疏注意力机制,在处理128K长文本时,每个token只需与不到5%的token计算相关性,显著降低长文本计算开销。
- 🧠 高效学习算法:引入Model Wind Tunnel 2.0高效可预测缩放方法,实现更精确的模型训练配置搜索;采用BitCPM极端三元量化,将模型参数位宽压缩至3值,实现90%的极端模型位宽减少;采用FP8低精度计算技术结合Multi-token Prediction训练策略。
- 📚 高质量训练数据:构建UltraClean高质量预训练数据过滤和生成策略,开源高质量中英预训练数据集 UltraFinweb;构建UltraChat v2高质量监督微调数据集,覆盖知识密集型、推理密集型、指令跟随、长文本理解和工具调用等多维度数据。
- ⚡ 高效推理系统:集成CPM.cu轻量级高效CUDA推理框架,整合稀疏注意力、模型量化和推测采样,实现高效预填充和解码;支持ArkInfer跨平台部署系统,提供灵活的跨平台适配能力。
📦 安装指南
安装CPM.cu
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install
安装InfLLM v2依赖库
git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install
安装SGLang
git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
安装vLLM
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
💻 使用示例
使用CPM.cu进行推理
修改config.json
文件中的rope_scaling
字段以启用LongRoPE:
{
...,
"rope_scaling": {
"rope_type": "longrope",
"long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"original_max_position_embeddings": 32768
}
}
运行以下命令重现长上下文加速效果:
python3 tests/test_generate.py
使用Transformers进行推理
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
# 用户可以直接使用聊天接口
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)
# 用户也可以使用生成接口
messages = [
{"role": "user", "content": "Write an article about Artificial Intelligence."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
model_outputs = model.generate(
model_inputs,
max_new_tokens=1024,
top_p=0.7,
temperature=0.7
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)
使用SGLang进行推理
启动推理服务器:
python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
使用聊天接口:
import openai
client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
response = client.chat.completions.create(
model="openbmb/MiniCPM4-8B",
messages=[
{"role": "user", "content": "Write an article about Artificial Intelligence."},
],
temperature=0.7,
max_tokens=1024,
)
print(response.choices[0].message.content)
使用vLLM进行推理
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
llm = LLM(
model=model_name,
trust_remote_code=True,
max_num_batched_tokens=32768,
dtype="bfloat16",
gpu_memory_utilization=0.8,
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
📚 详细文档
InfLLM v2配置
在config.json
文件中添加sparse_config
字段以启用InfLLM v2:
{
...,
"sparse_config": {
"kernel_size": 32,
"kernel_stride": 16,
"init_blocks": 1,
"block_size": 64,
"window_size": 2048,
"topk": 64,
"use_nope": false,
"dense_len": 8192
}
}
这些参数控制InfLLM v2的行为:
kernel_size
(默认值:32):语义核的大小。kernel_stride
(默认值:16):相邻核之间的步长。init_blocks
(默认值:1):每个查询token关注的初始块数,确保关注序列开头。block_size
(默认值:64):键值块的块大小。window_size
(默认值:2048):局部滑动窗口的大小。topk
(默认值:64):指定每个token仅与最相关的前k个键值块计算注意力。use_nope
(默认值:false):是否在块选择中使用NOPE技术以提高性能。dense_len
(默认值:8192):由于稀疏注意力对短序列的好处有限,模型可以对较短文本使用标准(密集)注意力。模型将对token长度低于dense_len
的序列使用密集注意力,对超过此长度的序列切换到稀疏注意力。将此值设置为-1
可始终使用稀疏注意力,而不管序列长度如何。
LongRoPE配置
MiniCPM4原生支持最长32,768个token的上下文长度。对于总长度(包括输入和输出)显著超过此限制的对话,建议使用RoPE缩放技术有效处理长文本。通过修改LongRoPE因子,已验证模型在最长131,072个token的上下文长度上的性能。
在config.json
文件中调整rope_scaling
字段:
{
...,
"rope_scaling": {
"rope_type": "longrope",
"long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"original_max_position_embeddings": 32768
}
}
🔧 技术细节
评估结果
效率评估
在Jetson AGX Orin和RTX 4090两种典型端侧芯片上,MiniCPM4在长文本处理任务中表现出比类似规模模型显著更快的处理速度。随着文本长度的增加,MiniCPM4的效率优势更加明显。在Jetson AGX Orin平台上,与Qwen3-8B相比,MiniCPM4实现了约7倍的解码速度提升。
综合评估
MiniCPM4推出了80亿和5亿参数规模的端侧版本,两者在各自类别中均实现了同类最佳性能。
长文本评估
MiniCPM4在32K长文本上进行预训练,并通过YaRN技术实现长度扩展。在128K长文本的大海捞针任务中,MiniCPM4表现出色。
📄 许可证
本仓库和MiniCPM模型根据 Apache-2.0 许可证发布。
声明
- 作为一种语言模型,MiniCPM通过学习大量文本生成内容。
- 然而,它不具备理解或表达个人观点或价值判断的能力。
- MiniCPM生成的任何内容均不代表模型开发者的观点或立场。
- 因此,在使用MiniCPM生成的内容时,用户应自行承担全部评估和验证责任。
引用
如果您认为我们的工作有价值,请引用我们的 论文。
@article{minicpm4,
title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
author={MiniCPM Team},
year={2025}
}



