模型概述
模型特點
模型能力
使用案例
🚀 MiniCPM4
MiniCPM4是專為端側設備設計的高效大語言模型,通過在模型架構、訓練數據、訓練算法和推理系統四個關鍵維度進行系統創新,在相同規模下保持最優性能的同時實現了極致的效率提升,在典型端側芯片上可實現超5倍的生成加速。

🚀 快速開始
MiniCPM4系列模型已經發布,你可以根據以下不同的使用場景選擇合適的推理方式。
✨ 主要特性
模型系列豐富
- MiniCPM4-8B:MiniCPM4的旗艦模型,擁有80億參數,在8T tokens上進行訓練。
- MiniCPM4-0.5B:MiniCPM4的小版本,擁有5億參數,在1T tokens上進行訓練。
- MiniCPM4-8B-Eagle-FRSpec:用於FRSpec的Eagle頭,加速MiniCPM4-8B的推測推理。
- MiniCPM4-8B-Eagle-FRSpec-QAT-cpmcu:使用QAT為FRSpec訓練的Eagle頭,有效整合推測和量化,為MiniCPM4-8B實現超加速。
- MiniCPM4-8B-Eagle-vLLM:vLLM格式的Eagle頭,加速MiniCPM4-8B的推測推理。
- MiniCPM4-8B-marlin-Eagle-vLLM:vLLM格式的量化Eagle頭,加速MiniCPM4-8B的推測推理。
- BitCPM4-0.5B:將極端三元量化應用於MiniCPM4-0.5B,將模型參數壓縮為三元值,實現位寬減少90%。
- BitCPM4-1B:將極端三元量化應用於MiniCPM3-1B,將模型參數壓縮為三元值,實現位寬減少90%。
- MiniCPM4-Survey:基於MiniCPM4-8B,接受用戶查詢作為輸入,自主生成可信的長篇調查論文。
- MiniCPM4-MCP:基於MiniCPM4-8B,接受用戶查詢和可用的MCP工具作為輸入,自主調用相關MCP工具以滿足用戶需求。
多維度高效優化
- 🏗️ 高效模型架構:採用InfLLM v2可訓練稀疏注意力機制,在處理128K長文本時,每個token只需與不到5%的token計算相關性,顯著降低長文本計算開銷。
- 🧠 高效學習算法:引入Model Wind Tunnel 2.0高效可預測縮放方法,實現更精確的模型訓練配置搜索;採用BitCPM極端三元量化,將模型參數位寬壓縮至3值,實現90%的極端模型位寬減少;採用FP8低精度計算技術結合Multi-token Prediction訓練策略。
- 📚 高質量訓練數據:構建UltraClean高質量預訓練數據過濾和生成策略,開源高質量中英預訓練數據集 UltraFinweb;構建UltraChat v2高質量監督微調數據集,覆蓋知識密集型、推理密集型、指令跟隨、長文本理解和工具調用等多維度數據。
- ⚡ 高效推理系統:集成CPM.cu輕量級高效CUDA推理框架,整合稀疏注意力、模型量化和推測採樣,實現高效預填充和解碼;支持ArkInfer跨平臺部署系統,提供靈活的跨平臺適配能力。
📦 安裝指南
安裝CPM.cu
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install
安裝InfLLM v2依賴庫
git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install
安裝SGLang
git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
安裝vLLM
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
💻 使用示例
使用CPM.cu進行推理
修改config.json
文件中的rope_scaling
字段以啟用LongRoPE:
{
...,
"rope_scaling": {
"rope_type": "longrope",
"long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"original_max_position_embeddings": 32768
}
}
運行以下命令重現長上下文加速效果:
python3 tests/test_generate.py
使用Transformers進行推理
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
# 用戶可以直接使用聊天接口
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)
# 用戶也可以使用生成接口
messages = [
{"role": "user", "content": "Write an article about Artificial Intelligence."},
]
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(device)
model_outputs = model.generate(
model_inputs,
max_new_tokens=1024,
top_p=0.7,
temperature=0.7
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)
使用SGLang進行推理
啟動推理服務器:
python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
使用聊天接口:
import openai
client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
response = client.chat.completions.create(
model="openbmb/MiniCPM4-8B",
messages=[
{"role": "user", "content": "Write an article about Artificial Intelligence."},
],
temperature=0.7,
max_tokens=1024,
)
print(response.choices[0].message.content)
使用vLLM進行推理
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
llm = LLM(
model=model_name,
trust_remote_code=True,
max_num_batched_tokens=32768,
dtype="bfloat16",
gpu_memory_utilization=0.8,
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
📚 詳細文檔
InfLLM v2配置
在config.json
文件中添加sparse_config
字段以啟用InfLLM v2:
{
...,
"sparse_config": {
"kernel_size": 32,
"kernel_stride": 16,
"init_blocks": 1,
"block_size": 64,
"window_size": 2048,
"topk": 64,
"use_nope": false,
"dense_len": 8192
}
}
這些參數控制InfLLM v2的行為:
kernel_size
(默認值:32):語義核的大小。kernel_stride
(默認值:16):相鄰核之間的步長。init_blocks
(默認值:1):每個查詢token關注的初始塊數,確保關注序列開頭。block_size
(默認值:64):鍵值塊的塊大小。window_size
(默認值:2048):局部滑動窗口的大小。topk
(默認值:64):指定每個token僅與最相關的前k個鍵值塊計算注意力。use_nope
(默認值:false):是否在塊選擇中使用NOPE技術以提高性能。dense_len
(默認值:8192):由於稀疏注意力對短序列的好處有限,模型可以對較短文本使用標準(密集)注意力。模型將對token長度低於dense_len
的序列使用密集註意力,對超過此長度的序列切換到稀疏注意力。將此值設置為-1
可始終使用稀疏注意力,而不管序列長度如何。
LongRoPE配置
MiniCPM4原生支持最長32,768個token的上下文長度。對於總長度(包括輸入和輸出)顯著超過此限制的對話,建議使用RoPE縮放技術有效處理長文本。通過修改LongRoPE因子,已驗證模型在最長131,072個token的上下文長度上的性能。
在config.json
文件中調整rope_scaling
字段:
{
...,
"rope_scaling": {
"rope_type": "longrope",
"long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"original_max_position_embeddings": 32768
}
}
🔧 技術細節
評估結果
效率評估
在Jetson AGX Orin和RTX 4090兩種典型端側芯片上,MiniCPM4在長文本處理任務中表現出比類似規模模型顯著更快的處理速度。隨著文本長度的增加,MiniCPM4的效率優勢更加明顯。在Jetson AGX Orin平臺上,與Qwen3-8B相比,MiniCPM4實現了約7倍的解碼速度提升。
綜合評估
MiniCPM4推出了80億和5億參數規模的端側版本,兩者在各自類別中均實現了同類最佳性能。
長文本評估
MiniCPM4在32K長文本上進行預訓練,並通過YaRN技術實現長度擴展。在128K長文本的大海撈針任務中,MiniCPM4表現出色。
📄 許可證
本倉庫和MiniCPM模型根據 Apache-2.0 許可證發佈。
聲明
- 作為一種語言模型,MiniCPM通過學習大量文本生成內容。
- 然而,它不具備理解或表達個人觀點或價值判斷的能力。
- MiniCPM生成的任何內容均不代表模型開發者的觀點或立場。
- 因此,在使用MiniCPM生成的內容時,用戶應自行承擔全部評估和驗證責任。
引用
如果您認為我們的工作有價值,請引用我們的 論文。
@article{minicpm4,
title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
author={MiniCPM Team},
year={2025}
}



