MiniCPM4-8B-GGUF Open-source Large Language Model - Suitable for Edge Devices, with Multi-dimensional Innovation for Efficiency Improvement

Minicpm4 8B GGUF

Developed by Mungert

MiniCPM4-8B is an efficient large language model designed specifically for edge devices. Through innovations in four dimensions: model architecture, training data, training algorithms, and inference systems, it achieves extreme efficiency improvements.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Efficient inference on the edge side #Sparse attention for long texts #Extreme quantization optimization

Downloads 906

Release Time : 6/13/2025

Model Overview

MiniCPM4-8B is a large language model with 8 billion parameters, trained on 8T tokens. It is optimized for edge devices, supports a context length of up to 32,768 tokens, and can be extended to 131,072 tokens through RoPE scaling technology.

Model Features

Efficient sparse attention mechanism

Adopting the trainable sparse attention mechanism of InfLLM v2, when processing 128K long texts, each token only needs to calculate the correlation with less than 5% of the tokens, significantly reducing the computational overhead.

Extreme quantization technology

Supports BitCPM extreme ternary quantization, compressing model parameters into ternary values and achieving a 90% reduction in bit width.

Long context support

Natively supports a context length of 32,768 tokens and can be extended to 131,072 tokens through LongRoPE technology.

Edge - side optimization

Designed specifically for edge devices, it can achieve more than 5 - fold generation acceleration on typical edge - side chips.

Model Capabilities

Long text generation

Multi - round dialogue

Knowledge - intensive task processing

Inference - intensive task processing

Tool invocation

Use Cases

Content generation

Article writing

Generate high - quality long articles according to user prompts

Can generate professional articles with a complete structure and clear logic

Intelligent assistant

Travel recommendation

Recommend tourist attractions to users and provide detailed introductions

Can generate a detailed recommendation list containing multiple attractions

Academic research

Literature review

Autonomously generate credible long - form survey papers according to user queries

Can generate a complete academic review

🚀 MiniCPM4-8B GGUF Models

MiniCPM4-8B GGUF Models are highly efficient large language models designed for end - side devices. They offer significant improvements in efficiency across multiple dimensions like model architecture, training data, algorithms, and inference systems.

🚀 Quick Start

Inference with CPM.cu

We recommend using CPM.cu for MiniCPM4 inference. It's a CUDA inference framework developed by OpenBMB, integrating efficient sparse, speculative sampling, and quantization techniques to fully leverage MiniCPM4's efficiency.

Install CPM.cu:

git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install

To enable LongRoPE for long - text acceleration, modify the rope_scaling field in config.json:

{
    ...,
    "rope_scaling": {
        "rope_type": "longrope", 
        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "original_max_position_embeddings": 32768
    }
}

Run the following command to reproduce long - context acceleration:

python3 tests/test_generate.py

Inference with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

# User can directly use the chat interface
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
# print(responds)

# User can also use the generate interface
messages = [
    {"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)
output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

Inference with SGLang

Install the forked version of SGLang:

git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

Start the inference server:

python -m sglang.launch_server --model openbmb/MiniCPM4-8B --trust-remote-code --port 30000 --chat-template chatml

Use the chat interface:

import openai

client = openai.Client(base_url=f"http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="openbmb/MiniCPM4-8B",
    messages=[
        {"role": "user", "content": "Write an article about Artificial Intelligence."},
    ],
    temperature=0.7,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Inference with vLLM

Install the latest version of vLLM:

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

Inference with vLLM:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8, 
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)

outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

Start the inference server:

vllm serve openbmb/MiniCPM4-8B

Use the chat interface:

import openai

client = openai.Client(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="openbmb/MiniCPM4-8B",
    messages=[
        {"role": "user", "content": "Write an article about Artificial Intelligence."},
    ],
    temperature=0.7,
    max_tokens=1024,
    extra_body=dict(add_special_tokens=True),  # Ensures special tokens are added for chat template
    
)

print(response.choices[0].message.content)

✨ Features

Efficient Model Architecture: Adopts InfLLM v2 with a trainable sparse attention mechanism. Each token only computes relevance with less than 5% of tokens in 128K long - text processing, significantly reducing long - text computational overhead.
Efficient Learning Algorithms:
- Model Wind Tunnel 2.0: Introduces scaling prediction methods for downstream task performance, enabling more precise model training configuration search.
- BitCPM: Compresses model parameter bit - width to 3 values, achieving a 90% extreme reduction in model bit - width.
- Efficient Training Engineering Optimization: Combines FP8 low - precision computing technology with Multi - token Prediction training strategy.
High - Quality Training Data:
- UltraClean: Builds iterative data cleaning strategies based on efficient data verification, open - sourcing the high - quality Chinese and English pre - training dataset [UltraFinweb](https://huggingface.co/datasets/openbmb/Ultra - FineWeb).
- UltraChat v2: Constructs large - scale high - quality supervised fine - tuning datasets covering multiple dimensions.
Efficient Inference System:
- CPM.cu: Integrates sparse attention, model quantization, and speculative sampling for efficient prefilling and decoding.
- ArkInfer: Supports efficient deployment across multiple backend environments, providing flexible cross - platform adaptation capabilities.

📦 Installation

Install CPM.cu

git clone https://github.com/OpenBMB/cpm.cu.git --recursive
cd cpm.cu
python3 setup.py install

Install infllmv2_cuda_impl

git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
cd infllmv2_cuda_impl
git submodule update --init --recursive
pip install -e . # or python setup.py install

Install the forked version of SGLang

git clone -b openbmb https://github.com/OpenBMB/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

Install vLLM

pip install -U vllm \
    --pre \
    --extra-index-url https://wheels.vllm.ai/nightly

💻 Usage Examples

Basic Usage with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(0)

path = 'openbmb/MiniCPM4-8B'
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained(path)
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)

messages = [
    {"role": "user", "content": "Write an article about Artificial Intelligence."},
]
prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)

model_outputs = model.generate(
    **model_inputs,
    max_new_tokens=1024,
    top_p=0.7,
    temperature=0.7
)
output_token_ids = [
    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
]

responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
print(responses)

Advanced Usage with vLLM

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "openbmb/MiniCPM4-8B"
prompt = [{"role": "user", "content": "Please recommend 5 tourist attractions in Beijing. "}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
input_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)

llm = LLM(
    model=model_name,
    trust_remote_code=True,
    max_num_batched_tokens=32768, 
    dtype="bfloat16", 
    gpu_memory_utilization=0.8, 
)
sampling_params = SamplingParams(top_p=0.7, temperature=0.7, max_tokens=1024, repetition_penalty=1.02)

outputs = llm.generate(prompts=input_text, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

📚 Documentation

Model Generation Details

This model was generated using llama.cpp at commit 7f4fbe51.

Quantization Beyond the IMatrix

I've been experimenting with a new quantization approach. Standard IMatrix quantization underperforms at lower bit depths, especially for Mixture of Experts (MoE) models. I'm using the --tensor - type option in llama.cpp to manually increase the precision of important layers. See [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model - converter/tensor_list_builder.py). This increases model file size but significantly improves precision for a given quantization level.

MiniCPM4 Series

[MiniCPM4 - 8B](https://huggingface.co/openbmb/MiniCPM4 - 8B): The flagship with 8B parameters, trained on 8T tokens.
[MiniCPM4 - 0.5B](https://huggingface.co/openbmb/MiniCPM4 - 0.5B): A small - version with 0.5B parameters, trained on 1T tokens.
[MiniCPM4 - 8B - Eagle - FRSpec](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - FRSpec): Accelerates speculative inference for MiniCPM4 - 8B.
[MiniCPM4 - 8B - Eagle - FRSpec - QAT - cpmcu](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - FRSpec - QAT - cpmcu): Integrates speculation and quantization for ultra - acceleration of MiniCPM4 - 8B.
[MiniCPM4 - 8B - Eagle - vLLM](https://huggingface.co/openbmb/MiniCPM4 - 8B - Eagle - vLLM): Accelerates speculative inference for MiniCPM4 - 8B in vLLM format.
[MiniCPM4 - 8B - marlin - Eagle - vLLM](https://huggingface.co/openbmb/MiniCPM4 - 8B - marlin - Eagle - vLLM): Quantized Eagle head for vLLM format, accelerating speculative inference for MiniCPM4 - 8B.
[BitCPM4 - 0.5B](https://huggingface.co/openbmb/BitCPM4 - 0.5B): Applies extreme ternary quantization to MiniCPM4 - 0.5B, reducing bit - width by 90%.
[BitCPM4 - 1B](https://huggingface.co/openbmb/BitCPM4 - 1B): Applies extreme ternary quantization to MiniCPM3 - 1B, reducing bit - width by 90%.
[MiniCPM4 - Survey](https://huggingface.co/openbmb/MiniCPM4 - Survey): Based on MiniCPM4 - 8B, generates trustworthy survey papers.
[MiniCPM4 - MCP](https://huggingface.co/openbmb/MiniCPM4 - MCP): Based on MiniCPM4 - 8B, calls relevant MCP tools to meet user requirements.

What's New

[2025.06.06] MiniCPM4 series are released! It achieves ultimate efficiency improvements while maintaining optimal performance at the same scale, with over 5x generation acceleration on typical end - side chips. See the technical report here.

🔧 Technical Details

InfLLM v2

MiniCPM4 - 8B supports InfLLM v2, a sparse attention mechanism for efficient long - sequence inference. It requires the infllmv2_cuda_impl library.

To enable InfLLM v2, add the sparse_config field in config.json:

{
    ...,
    "sparse_config": {
        "kernel_size": 32,
        "kernel_stride": 16,
        "init_blocks": 1,
        "block_size": 64,
        "window_size": 2048,
        "topk": 64,
        "use_nope": false,
        "dense_len": 8192
    }
}

These parameters control the behavior of InfLLM v2:

kernel_size (default: 32): The size of semantic kernels.
kernel_stride (default: 16): The stride between adjacent kernels.
init_blocks (default: 1): The number of initial blocks that every query token attends to.
block_size (default: 64): The block size for key - value blocks.
window_size (default: 2048): The size of the local sliding window.
topk (default: 64): Each token computes attention with only the top - k most relevant key - value blocks.
use_nope (default: false): Whether to use the NOPE technique in block selection.
dense_len (default: 8192): The model uses dense attention for sequences with a token length below dense_len and switches to sparse attention for longer sequences. Set to -1 to always use sparse attention.

LongRoPE

MiniCPM4 natively supports context lengths of up to 32,768 tokens. For long conversations, modify the rope_scaling field in config.json to apply the LongRoPE factor:

{
    ...,
    "rope_scaling": {
        "rope_type": "longrope", 
        "long_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "short_factor": [0.9977997200264581, 1.014658295992452, 1.0349680404997148, 1.059429246056193, 1.0888815016813513, 1.1243301355211495, 1.166977103606075, 1.2182568066927284, 1.2798772354275727, 1.3538666751582975, 1.4426259039919596, 1.5489853358570191, 1.6762658237220625, 1.8283407612492941, 2.0096956085876183, 2.225478927469756, 2.481536379650452, 2.784415934557119, 3.1413289096347365, 3.560047844772632, 4.048719380066383, 4.752651957515948, 5.590913044973868, 6.584005926629993, 7.7532214876576155, 9.119754865903639, 10.704443927019176, 12.524994176518703, 14.59739595363613, 16.93214476166354, 19.53823297353041, 22.417131025031697, 25.568260840911098, 28.991144156566317, 32.68408069090375, 36.65174474170465, 40.90396065611201, 45.4664008671033, 50.37147343433591, 55.6804490772103, 61.470816952306556, 67.8622707390618, 75.00516023410414, 83.11898235973767, 92.50044360202462, 103.57086856690864, 116.9492274587385, 118.16074567836519, 119.18497548708795, 120.04810876261652, 120.77352815196981, 121.38182790207875, 121.89094985353891, 122.31638758099915, 122.6714244963338, 122.9673822552567, 123.21386397019609, 123.41898278254268, 123.58957065488238, 123.73136519024158, 123.84917421274221, 123.94701903496814, 124.02825801299717, 124.09569231686116],
        "original_max_position_embeddings": 32768
    }
}

📄 License

This project is licensed under the Apache - 2.0 license.

GitHub Repo | Technical Report

Join us on Discord and WeChat

Click here to get info on choosing the right GGUF model format

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご