Model Overview
Model Features
Model Capabilities
Use Cases
đ MiniCPM4-8B GGUF Models
MiniCPM4-8B GGUF Models are highly efficient large language models designed for end - side devices. They offer significant improvements in efficiency across multiple dimensions like model architecture, training data, algorithms, and inference systems.
đ Quick Start
Inference with CPM.cu
We recommend using CPM.cu for MiniCPM4 inference. It's a CUDA inference framework developed by OpenBMB, integrating efficient sparse, speculative sampling, and quantization techniques to fully leverage MiniCPM4's efficiency.
Install CPM.cu:
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install
To enable LongRoPE for long - text acceleration, modify the rope_scaling
field in config.json
:
{
...,
"rope_scaling": {
"rope_type": "longrope",
"long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"original_max_position_embeddings": 32768
}
}
Run the following command to reproduce long - context acceleration:
python3 tests/test_generate.py
Inference with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
# User can directly use the chat interface
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)
# User can also use the generate interface
messages = [
{"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
model_outputs = model.generate(
**model_inputs,
max_new_tokens=1024,
top_p=0.7,
temperature=0.7
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)
Inference with SGLang
Install the forked version of SGLang:
git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
Start the inference server:
python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml
Use the chat interface:
import openai
client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")
response = client.chat.completions.create(
model="openbmb/MiniCPM4-8B",
messages=[
{"role": "user", "content": "Write an article about Artificial Intelligence."},
],
temperature=0.7,
max_tokens=1024,
)
print(response.choices[0].message.content)
Inference with vLLM
Install the latest version of vLLM:
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
Inference with vLLM:
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
llm = LLM(
model=model_name,
trust_remote_code=True,
max_num_batched_tokens=32768,
dtype="bfloat16",
gpu_memory_utilization=0.8,
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Start the inference server:
vllm serve openbmb/MiniCPM4-8B
Use the chat interface:
import openai
client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="openbmb/MiniCPM4-8B",
messages=[
{"role": "user", "content": "Write an article about Artificial Intelligence."},
],
temperature=0.7,
max_tokens=1024,
extra_body=dict(add_special_tokens=True), # Ensures special tokens are added for chat template
)
print(response.choices[0].message.content)
⨠Features
- Efficient Model Architecture: Adopts InfLLM v2 with a trainable sparse attention mechanism. Each token only computes relevance with less than 5% of tokens in 128K long - text processing, significantly reducing long - text computational overhead.
- Efficient Learning Algorithms:
- Model Wind Tunnel 2.0: Introduces scaling prediction methods for downstream task performance, enabling more precise model training configuration search.
- BitCPM: Compresses model parameter bit - width to 3 values, achieving a 90% extreme reduction in model bit - width.
- Efficient Training Engineering Optimization: Combines FP8 low - precision computing technology with Multi - token Prediction training strategy.
- High - Quality Training Data:
- UltraClean: Builds iterative data cleaning strategies based on efficient data verification, open - sourcing the high - quality Chinese and English pre - training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra - FineWeb).
- UltraChat v2: Constructs large - scale high - quality supervised fine - tuning datasets covering multiple dimensions.
- Efficient Inference System:
- CPM.cu: Integrates sparse attention, model quantization, and speculative sampling for efficient prefilling and decoding.
- ArkInfer: Supports efficient deployment across multiple backend environments, providing flexible cross - platform adaptation capabilities.
đĻ Installation
Install CPM.cu
git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install
Install infllmv2_cuda_impl
git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install
Install the forked version of SGLang
git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang
pip install --upgrade pip
pip install -e "python[all]"
Install vLLM
pip install -U vllm \
--pre \
--extra-index-url https://wheels.vllm.ai/nightly
đģ Usage Examples
Basic Usage with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)
path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
messages = [
{"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
model_outputs = model.generate(
**model_inputs,
max_new_tokens=1024,
top_p=0.7,
temperature=0.7
)
output_token_ids = [
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)
Advanced Usage with vLLM
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
llm = LLM(
model=model_name,
trust_remote_code=True,
max_num_batched_tokens=32768,
dtype="bfloat16",
gpu_memory_utilization=0.8,
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)
outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
đ Documentation
Model Generation Details
This model was generated using llama.cpp at commit 7f4fbe51
.
Quantization Beyond the IMatrix
I've been experimenting with a new quantization approach. Standard IMatrix quantization underperforms at lower bit depths, especially for Mixture of Experts (MoE) models. I'm using the --tensor - type
option in llama.cpp
to manually increase the precision of important layers. See [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model - converter/tensor_list_builder.py). This increases model file size but significantly improves precision for a given quantization level.
MiniCPM4 Series
- [MiniCPM4 - 8B](https://huggingface.co/openbmb/MiniCPM4 - 8B): The flagship with 8B parameters, trained on 8T tokens.
- [MiniCPM4 - 0.5B](https://huggingface.co/openbmb/MiniCPM4 - 0.5B): A small - version with 0.5B parameters, trained on 1T tokens.
- [MiniCPM4 - 8B - Eagle - FRSpec](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - FRSpec): Accelerates speculative inference for MiniCPM4 - 8B.
- [MiniCPM4 - 8B - Eagle - FRSpec - QAT - cpmcu](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - FRSpec - QAT - cpmcu): Integrates speculation and quantization for ultra - acceleration of MiniCPM4 - 8B.
- [MiniCPM4 - 8B - Eagle - vLLM](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - vLLM): Accelerates speculative inference for MiniCPM4 - 8B in vLLM format.
- [MiniCPM4 - 8B - marlin - Eagle - vLLM](https://huggingface.co/openbmb/MiniCPM4 - 8B - marlin - Eagle - vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4 - 8B.
- [BitCPM4 - 0.5B](https://huggingface.co/openbmb/BitCPM4 - 0.5B): Applies extreme ternary quantization to MiniCPM4 - 0.5B, reducing bit - width by 90%.
- [BitCPM4 - 1B](https://huggingface.co/openbmb/BitCPM4 - 1B): Applies extreme ternary quantization to MiniCPM3 - 1B, reducing bit - width by 90%.
- [MiniCPM4 - Survey](https://huggingface.co/openbmb/MiniCPM4 - Survey): Based on MiniCPM4 - 8B, generates trustworthy survey papers.
- [MiniCPM4 - MCP](https://huggingface.co/openbmb/MiniCPM4 - MCP): Based on MiniCPM4 - 8B, calls relevant MCP tools to meet user requirements.
What's New
- [2025.06.06] MiniCPM4 series are released! It achieves ultimate efficiency improvements while maintaining optimal performance at the same scale, with over 5x generation acceleration on typical end - side chips. See the technical report here.
đ§ Technical Details
InfLLM v2
MiniCPM4 - 8B supports InfLLM v2
, a sparse attention mechanism for efficient long - sequence inference. It requires the infllmv2_cuda_impl library.
To enable InfLLM v2, add the sparse_config
field in config.json
:
{
...,
"sparse_config": {
"kernel_size": 32,
"kernel_stride": 16,
"init_blocks": 1,
"block_size": 64,
"window_size": 2048,
"topk": 64,
"use_nope": false,
"dense_len": 8192
}
}
These parameters control the behavior of InfLLM v2:
kernel_size
(default: 32): The size of semantic kernels.kernel_stride
(default: 16): The stride between adjacent kernels.init_blocks
(default: 1): The number of initial blocks that every query token attends to.block_size
(default: 64): The block size for key - value blocks.window_size
(default: 2048): The size of the local sliding window.topk
(default: 64): Each token computes attention with only the top - k most relevant key - value blocks.use_nope
(default: false): Whether to use the NOPE technique in block selection.dense_len
(default: 8192): The model uses dense attention for sequences with a token length belowdense_len
and switches to sparse attention for longer sequences. Set to-1
to always use sparse attention.
LongRoPE
MiniCPM4 natively supports context lengths of up to 32,768 tokens. For long conversations, modify the rope_scaling
field in config.json
to apply the LongRoPE factor:
{
...,
"rope_scaling": {
"rope_type": "longrope",
"long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
"original_max_position_embeddings": 32768
}
}
đ License
This project is licensed under the Apache - 2.0 license.

GitHub Repo | Technical Report
Click here to get info on choosing the right GGUF model format
