模型简介
模型特点
模型能力
使用案例
🚀 ik_llama.cpp
imatrix MLA量化的DeepSeek-V3-0324
本量化集合需要使用 ik_llama.cpp 分支,以支持先进的非线性最优量化和多头潜在注意力(MLA)。请不要下载这些大文件后,期望它们能在主线版本的 llama.cpp、ollama、LM Studio、KoboldCpp 等中运行!
这些量化方案在给定的内存占用下,提供了同类最佳的困惑度。MLA 支持使得 R1
和 V3
能够在不到 24GB 的 GPU VRAM 中实现 32k 以上的上下文长度,同时将 MoE 层卸载到 RAM 中。
这些量化方案专为具有 24 - 48GB VRAM 的 CPU + GPU 系统,以及使用动态量化重新打包(以实现最大内存吞吐量)的纯 CPU 设备而设计。如果您有更多的 VRAM,我建议选择其他至少对部分路由专家层进行了 GPU 卸载优化的量化方案。
您可以使用现有的量化方案快速尝试 ik_llama.cpp
,因为它会在启动时动态计算 MLA 张量并重新打包量化数据(前提是您有足够的 RAM + VRAM 来容纳整个模型)。在体验到差异后,再来看看这里的大尺寸量化方案。
✨ 主要特性
- 低困惑度:在给定的内存占用下,提供了同类最佳的困惑度。
- 长上下文支持:MLA 支持使得在不到 24GB 的 GPU VRAM 中实现 32k 以上的上下文长度。
- 多平台适配:专为具有 24 - 48GB VRAM 的 CPU + GPU 系统,以及纯 CPU 设备而设计。
📦 安装指南
本项目依赖于 ik_llama.cpp 分支,请确保你已经正确安装该分支。
💻 使用示例
基础用法
ik_llama.cpp
API 服务器(GPU + CPU)
# 在不到 24GB VRAM 中容纳 32k 上下文
# 可选的 `-ser 6,1` 以最小的质量损失提高速度
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-server \
--model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
--alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
--ctx-size 32768 \
-ctk q8_0 \
-mla 2 -fa \
-amb 512 \
-fmoe \
--temp 0.3 \
--min-p 0.05 \
--n-gpu-layers 63 \
--override-tensor exps=CPU \
--parallel 1 \
--threads 16 \
--host 127.0.0.1 \
--port 8080
ik_llama.cpp
API 服务器(仅 CPU)
# 目前的目标是在单个 NUMA 节点中获得尽可能多的 RAM 带宽,例如:
# 在 AMD Epyc 上使用 BIOS `NPS0`,或在 BIOS `SNC=Disable` 的 Intel Xeon 单插槽上使用
# 调整 `--threads` 以进行令牌生成,调整 `--threads-batch` 以进行提示处理(预填充)
# 注意 `--run-time-repack` 将预先分配足够的 RAM 用于模型权重,而不是通过 mmap() 从磁盘加载
# 注意,在 [git 仓库](https://github.com/ikawrakow/ik_llama.cpp/pull/278#issuecomment-2746381515) 中有关于显式和透明大页的调优讨论
numactl -N 0 -m 0 \
./build/bin/llama-server \
--model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
--alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
--run-time-repack \
--ctx-size 65536 \
-ctk q8_0 \
-mla 3 -fa \
-amb 512 \
-fmoe \
--temp 0.3 \
--min-p 0.05 \
--parallel 1 \
--threads 88 \
--threads-batch 128 \
--numa numactl \
--host 127.0.0.1 \
--port 8080
📚 详细文档
量化集合
到目前为止,这些是我提供的最佳方案,每 GiB 模型的困惑度最低,适用于各种 CPU + GPU 或纯 CPU 设备。
IQ4_K_R4
4.936 BPW
特殊混合 IQ5_K_R4
/IQ4_K_R4
路由专家,其他所有层均为完整的 q8_0
,适用于 CPU + GPU 卸载或使用 --run-time-repack
的纯 CPU 设备以实现最大速度。非常适合具有 384GB 以上 RAM 和 24GB 以上 GPU 的大型设备。
IQ2_K_R4
2.889 BPW
特殊混合 IQ3_K_R4
/IQ2_K_R4
路由专家,其他所有层均为完整的 q8_0
,适用于 CPU + GPU 卸载或使用 --run-time-repack
的纯 CPU 设备以实现最大速度。非常适合高端游戏系统等 CPU + GPU “强力设备”,例如 9950X 96GB RAM + 3090TI 24GB VRAM + Gen 5 NVMe SSD。
自定义混合
如果您在多个 GPU 上拥有超过 48GB 的 VRAM,可以考虑使用自定义 -ot
表达式,根据您的硬件情况自定义量化方案,以优化模型大小和性能。如果您的 VRAM 较少,可以在非路由专家层中使自定义量化更精简,或者在 24GB VRAM 中获得 64k 以上的上下文长度。此外,如果您想在启用 mmap()
的情况下仅使用 CPU,也可以使用离线重新打包工具。
量化比较
这些可能是 V3 - 0324
在这个尺寸类别中可用的最佳量化方案!
ubergarm 在令牌嵌入、注意力、密集层或共享专家方面没有做出任何牺牲。这是因为 ik_llama.cpp
的 MLA 实现节省了大量的 GPU VRAM,使得在不到 24GB 的 VRAM 中能够实现 32k 的上下文长度。此外,这些量化方案使用了一种新的高质量重要矩阵,包括各种编码样本和多种书面语言。路由专家层还使用了最先进的 CPU IQx_K_R4
非线性量化,可能实现每 GiB 最佳的困惑度。IQ2_K_R4
和 IQ4_K_R4
都设计为将约 17.33GiB 的权重卸载到 GPU VRAM 中,其余 VRAM 用于上下文。
bartowski 使用了完整的令牌嵌入质量,但注意力、密集层和共享专家的量化较低。他确实使用了一种高质量的重要矩阵,其困惑度性能相对于本方案在测量误差范围内。更新:还可以查看 bartowski 新的自定义 "V2" 版本 方案,其在相同尺寸下的困惑度有所改善!下表是他的原始版本量化方案。
unsloth 在令牌嵌入方面有所牺牲,注意力和密集层质量中等,且没有重要矩阵。
mradermacher 的模型卡侧边栏未显示,因此更难比较确切的方案。他们的团队很慷慨地运行了一些命令,在此提供了他们的方案详细信息。
比较详情
~Q2 类量化的详细比较
ubergarm/DeepSeek-V3-0324-IQ2_K_R4 | bartowski/DeepSeek-V3-0324-Q2_K_L | unsloth/DeepSeek-V3-0324-UD-Q2_K_XL | mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K | |
---|---|---|---|---|
概述 | "V1" | |||
split.tensors.count |
1147 | 1025 | 1025 | |
token_embd.weight |
Q8_0 |
Q8_0 |
Q4_K |
IQ3_S |
output.weight |
Q5_K |
|||
文件大小 (GiB) | 227 | 228 | 231 | |
多头潜在注意力 | ||||
blk.*.attn_kv_b.weight |
Q8_0 |
n/a | n/a | n/a |
blk.*.attn_k_b.weight |
Q8_0 |
n/a | n/a | n/a |
blk.*.attn_v_b.weight |
Q8_0 |
n/a | n/a | n/a |
密集层 | ||||
blk.[0 - 2].attn_kv_a_mqa.weight |
Q8_0 |
Q2_K |
Q6_K |
IQ2_XS |
blk.[0 - 2].attn_kv_a_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[0 - 2].attn_kv_b.weight |
Q8_0 |
Q2_K |
Q6_K |
IQ2_XS |
blk.[0 - 2].attn_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[0 - 2].attn_q_a.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[0 - 2].attn_q_a_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[0 - 2].attn_q_b.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[0 - 2].ffn_down.weight |
Q8_0 |
Q3_K |
Q6_K |
IQ3_S |
blk.[0 - 2].ffn_gate.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[0 - 2].ffn_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[0 - 2].ffn_up.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[0 - 2].attn_output.weight |
Q8_0 |
Q3_K |
Q4_K |
IQ3_S |
共享和路由 MoE 层 | ||||
blk.[3 - 60].attn_kv_a_mqa.weight |
Q8_0 |
Q2_K |
Q6_K |
IQ2_XS |
blk.[3 - 60].attn_kv_a_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[3 - 60].attn_kv_b.weight |
Q8_0 |
Q2_K |
Q6_K |
IQ2_XS |
blk.[3 - 60].attn_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[3 - 60].attn_q_a.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[3 - 60].attn_q_a_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[3 - 60].attn_q_b.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[3 - 60].exp_probs_b.bias |
F32 |
F32 |
F32 |
F32 |
blk.[3 - 60].ffn_down_exps.weight |
IQ3_K_R4 |
Q3_K |
Q3_K |
IQ3_S |
blk.[3 - 60].ffn_down_shexp.weight |
Q8_0 |
Q3_K |
Q6_K |
IQ3_S |
blk.[3 - 60].ffn_gate_exps.weight |
IQ2_K_R4 |
Q2_K |
Q2_K |
IQ2_XS |
blk.[3 - 60].ffn_gate_inp.weight |
F32 |
F32 |
F32 |
F32 |
blk.[3 - 60].ffn_gate_shexp.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[3 - 60].ffn_norm.weight |
F32 |
F32 |
F32 |
F32 |
blk.[3 - 60].ffn_up_exps.weight |
IQ2_K_R4 |
Q2_K |
Q2_K |
IQ2_XS |
blk.[3 - 60].ffn_up_shexp.weight |
Q8_0 |
Q2_K |
Q4_K |
IQ2_XS |
blk.[3 - 60].attn_output.weight |
Q8_0 |
Q3_K |
Q4_K |
IQ3_S |
重要矩阵和困惑度 | ||||
imatrix.dataset |
calibration_data_v5_rc.txt |
calibration_datav3.txt |
无 | imatrix-training-full-3 |
最终 PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | 3.9012 (V1) | ? | ? |
作为参考,Q8_0
在相同的 wiki.test.raw
文件上实现了 PPL = 3.3482 +/- 0.01847
。
重要矩阵
重要矩阵详细信息
# 在双 Intel Xeon 6980P CPU 的单插槽上仅使用 CPU 运行
numactl -N 0 -m 0 \
./build/bin/llama-imatrix \
--verbosity 1 \
-m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \
-f calibration_data_v5_rc.txt \
-o DeepSeek-V3-0324.imatrix \
--ctx-size 512 \
--numa numactl \
--threads 128
.
.
.
compute_imatrix: 正在对 213 个块进行计算,批量大小为 512
compute_imatrix: 每次计算耗时 41.77 秒 - 预计 2 小时 28.28 分钟完成
[1]60.9029,[2]10.8011,[3]5.8709,[4]3.7872,[5]2.9688,[6]2.5088,[7]2.2214,[8]2.0224,[9]1.9110,
save_imatrix: 条目 ' blk.60.ffn_down_exps.weight' 有部分数据 (99.61%),256 个专家中有 1 个缺少数据,正在存储 **请注意**
save_imatrix: 条目 ' blk.60.ffn_gate_exps.weight' 有部分数据 (99.61%),256 个专家中有 1 个缺少数据,正在存储 **请注意**
save_imatrix: 条目 ' blk.60.ffn_up_exps.weight' 有部分数据 (99.61%),256 个专家中有 1 个缺少数据,正在存储 **请注意**
save_imatrix: 在处理 10 个块后,将收集的数据存储在 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat
.
.
.
llama_print_timings: 加载时间 = 42726.11 毫秒
llama_print_timings: 采样时间 = 0.00 毫秒 / 1 次运行 ( 0.00 毫秒/令牌, 无穷 令牌/秒)
llama_print_timings: 提示评估时间 = 7125661.28 毫秒 / 109056 个令牌 ( 65.34 毫秒/令牌, 15.30 令牌/秒)
llama_print_timings: 评估时间 = 0.00 毫秒 / 1 次运行 ( 0.00 毫秒/令牌, 无穷 令牌/秒)
llama_print_timings: 总时间 = 7201368.59 毫秒 / 109057 个令牌
最终估计: PPL = 3.4755 +/- 0.03305
量化秘诀
秘诀
#!/usr/bin/env bash
custom="
# 令牌嵌入 (GPU)
# 注意:由于张量大小的原因,不能是重新打包的类型
token_embd\.weight=q8_0
# 输出张量 (GPU)
output\.weight=q8_0
output_norm\.weight=q8_0
# 前 3 个密集层 (0 - 3) (GPU)
blk\.[0-2]\..*=q8_0
# MoE 层 (3 - 60) 的所有注意力、权重和偏置张量 (GPU)
# 注意:attn_k_b.weight 不能是 k-、i- 或 iqk- 量化,因为其行大小为 128
blk\.[3-9]\.attn_.*=q8_0
blk\.[1-5][0-9]\.attn_.*=q8_0
blk\.60\.attn_.*=q8_0
blk\.[3-9]\.ffn_norm\.weight=q8_0
blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0
blk\.60\.ffn_norm\.weight=q8_0
blk\.[3-9]\.exp_probs_b\.bias=q8_0
blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0
blk\.60\.exp_probs_b\.bias=q8_0
# 共享专家 (3 - 60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=q8_0
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=q8_0
blk\.60\.ffn_down_shexp\.weight=q8_0
blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0
# 路由专家 (3 - 60) (CPU)
# 注意:传统经验表明,早期层应使用更高的量化
blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
blk\.60\.ffn_down_exps\.weight=iq3_k_r4
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_k_r4
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_k_r4
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_k_r4
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
--token-embedding-type q8_0 \
--output-tensor-type q8_0 \
--custom-q "$custom" \
/mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
/mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
IQ2_K_R4 \
24
困惑度
困惑度日志
$ CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
--model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
-ctk q8_0 \
-mla 2 -fa \
-amb 512 \
-fmoe \
--ctx-size 512 \
--ubatch-size 512 \
-f wiki.test.raw \
--seed 1337 \
--n-gpu-layers 63 \
--override-tensor exps=CPU \
--threads 24
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: 否
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: 否
ggml_cuda_init: 找到 1 个 CUDA 设备:
设备 0: NVIDIA RTX A6000,计算能力 8.6,VMM: 是
main: 构建版本 = 3614 (b9c25fe7)
main: 使用 cc (Ubuntu 13.3.0 - 6ubuntu2~24.04) 13.3.0 为 x86_64 - linux - gnu 构建
main: 种子 = 1337
llama_model_loader: 从 /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf 加载了包含 50 个键值对和 1147 个张量的元数据 (版本 GGUF V3 (最新))
llama_model_loader: 正在转储元数据键/值。注意:KV 覆盖不适用于此输出。
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = 模型
llama_model_loader: - kv 2: general.name str = DeepSeek V3 0324
llama_model_loader: - kv 3: general.version str = V3-0324
llama_model_loader: - kv 4: general.basename str = DeepSeek
llama_model_loader: - kv 5: general.size_label str = 256x21B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: deepseek2.block_count u32 = 61
llama_model_loader: - kv 8: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 9: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 10: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 11: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 12: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 13: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 14: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 16: general.file_type u32 = 338
llama_model_loader: - kv 17: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 18: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 19: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 20: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 21: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 22: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 23: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 24: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 25: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 26: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 27: deepseek2.expert_weights_norm bool = 是
llama_model_loader: - kv 28: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 29: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 31: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 32: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 33: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 34: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 35: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 36: tokenizer.ggml.tokens arr[str,129280] = ["
llama_model_loader: - kv 37: tokenizer.ggml.token_type arr[i32,129280] = [3
llama_model_loader: - kv 38: tokenizer.ggml.merges arr[str,127741] = ["
llama_model_loader: - kv 39: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 1
llama_model_loader: - kv 42: tokenizer.ggml.add_bos_token bool = 是
llama_model_loader: - kv 43: tokenizer.ggml.add_eos_token bool = 否
llama_model_loader: - kv 44: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 45: general.quantization_version u32 = 2
llama_model_loader: - kv 46: quantize.imatrix.file str = /mnt/raid/models/ubergarm/DeepSeek-V3...
llama_model_loader: - kv 47: quantize.imatrix.dataset str = calibration_data_v5_rc.txt
llama_model_loader: - kv 48: quantize.imatrix.entries_count i32 = 720
llama_model_loader: - kv 49: quantize.imatrix.chunks_count i32 = 213
llama_model_loader: - type f32: 361 个张量
llama_model_loader: - type q8_0: 612 个张量
llama_model_loader: - type iq2_k_r4: 116 个张量
llama_model_loader: - type iq3_k_r4: 58 个张量
llm_load_vocab: 特殊令牌缓存大小 = 818
llm_load_vocab: 令牌到片段缓存大小 = 0.8223 MB
llm_load_print_meta: 格式 = GGUF V3 (最新)
llm_load_print_meta: 架构 = deepseek2
llm_load_print_meta: 词汇类型 = BPE
llm_load_print_meta: n_vocab = 129280
llm_load_print_meta: n_merges = 127741
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 163840
llm_load_print_meta: n_embd = 7168
llm_load_print_meta: n_layer = 61
llm_load_print_meta: n_head = 128
llm_load_print_meta: n_head_kv = 128
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 192
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 24576
llm_load_print_meta: n_embd_v_gqa = 16384
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 18432
llm_load_print_meta: n_expert = 256
llm_load_print_meta: n_expert_used = 8
llm_load_print_meta: 因果注意力 = 1
llm_load_print_meta: 池化类型 = 0
llm_load_print_meta: RoPE 类型 = 0
llm_load_print_meta: RoPE 缩放 = yarn
llm_load_print_meta: 训练频率基数 = 10000.0
llm_load_print_meta: 训练频率缩放 = 0.025
llm_load_print_meta: 原始 Yarn 上下文长度 = 4096
llm_load_print_meta: RoPE 微调情况 = 未知
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: 模型类型 = 671B
llm_load_print_meta: 模型浮点类型 = IQ2_K_R4 - 2.375 bpw
llm_load_print_meta: 模型参数 = 672.050 B
llm_load_print_meta: 模型大小 = 226.003 GiB (2.889 BPW)
llm_load_print_meta: 重复层大小 = 224.169 GiB (2.873 BPW,670.196 B 参数)
llm_load_print_meta: 通用名称 = DeepSeek V3 0324
llm_load_print_meta: BOS 令牌 = 0 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS 令牌 = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD 令牌 = 1 '<|end▁of▁sentence|>'
llm_load_print_meta: LF 令牌 = 131 'Ä'
llm_load_print_meta: 最大令牌长度 = 256
llm_load_print_meta: n_layer_dense_lead = 3
llm_load_print_meta: n_lora_q = 1536
llm_load_print_meta: n_lora_kv = 512
llm_load_print_meta: n_ff_exp = 2048
llm_load_print_meta: n_expert_shared = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm = 1
llm_load_print_meta: expert_gating_func = sigmoid
llm_load_print_meta: rope_yarn_log_mul = 0.1000
llm_load_tensors: ggml 上下文大小 = 0.93 MiB
Tensor blk.3.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.3.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.3.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.4.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.4.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.4.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.5.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.5.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.5.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.6.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.6.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.6.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.7.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.7.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.7.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.8.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.8.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.8.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.9.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.9.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.9.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.10.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.10.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.10.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.11.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.11.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.11.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.12.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.12.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.12.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.13.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.13.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.13.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.14.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.14.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.14.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.15.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.15.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.15.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.16.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.16.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.16.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.17.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.17.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.17.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.18.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.18.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.18.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.19.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.19.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.19.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.20.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.20.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.20.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.21.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.21.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.21.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.22.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.22.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.22.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.23.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.23.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.23.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.24.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.24.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.24.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.25.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.25.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.25.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.26.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.26.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.26.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.27.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.27.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.27.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.28.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.28.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.28.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.29.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.29.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.29.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.30.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.30.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.30.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.31.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.31.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.31.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.32.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.32.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.32.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.33.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.33.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.33.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.34.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.34.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.34.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.35.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.35.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.35.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.36.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.36.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.36.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.37.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.37.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.37.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.38.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.38.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.38.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.39.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.39.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.39.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.40.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.40.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.40.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.41.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.41.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.41.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.42.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.42.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.42.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.43.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.43.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.43.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.44.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.44.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.44.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.45.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.45.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.45.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.46.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.46.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.46.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.47.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.47.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.47.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.48.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.48.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.48.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.49.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.49.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.49.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.50.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.50.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.50.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.51.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.51.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.51.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.52.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.52.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.52.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.53.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.53.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.53.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.54.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.54.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.54.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.55.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.55.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.55.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.56.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.56.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.56.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.57.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.57.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.57.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.58.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.58.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.58.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.59.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.59.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.59.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.60.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.60.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.60.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
llm_load_tensors: 将 61 个重复层卸载到 GPU
llm_load_tensors: 将非重复层卸载到 GPU
llm_load_tensors: 已将 62/62 层卸载到 GPU
llm_load_tensors: CPU 缓冲区大小 = 228404.85 MiB
llm_load_tensors: CPU 缓冲区大小 = 938.98 MiB
llm_load_tensors: CUDA0 缓冲区大小 = 17744.02 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: 层 0: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 1: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 2: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 3: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 4: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 5: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 6: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 7: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 8: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 9: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 10: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 11: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 12: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 13: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 14: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 15: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 16: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 17: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 18: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 19: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 20: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 21: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 22: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 23: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 24: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 25: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 26: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 27: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 28: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 29: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 30: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 31: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 32: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 33: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 34: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 35: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 36: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 37: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 38: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 39: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 40: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 41: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 42: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 43: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 44: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 45: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 46: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 47: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 48: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 49: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 50: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 51: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 52: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 53: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 54: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 55: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 56: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 57: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 58: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 59: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: 层 60: n_embd_head_qk_rope = 64,kv_lora_rank = 512
llama_kv_cache_init: CUDA0 KV 缓冲区大小 = 72.94 MiB
llama_new_context_with_model: KV 自身大小 = 72.91 MiB,c^KV (q8_0): 72.91 MiB,kv^T: 未使用
llama_new_context_with_model: CUDA_Host 输出缓冲区大小 = 1.97 MiB
llama_new_context_with_model: CUDA0 计算缓冲区大小 = 503.00 MiB
llama_new_context_with_model: CUDA_Host 计算缓冲区大小 = 162.01 MiB
llama_new_context_with_model: 图节点 = 3548
llama_new_context_with_model: 图分割 = 118
system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: 正在对输入进行分词 ..
perplexity: 分词耗时 617.314 毫秒
perplexity: 正在对 561 个块计算困惑度,n_ctx = 512,批量大小 = 2048,n_seq = 4
perplexity: 每次计算耗时 20.43 秒 - 预计 47.75 分钟完成
[1]2.7687,[2]3.5402,[3]2.5152,[4]2.1223,[5]1.9022,[6]1.7765,[7]1.6869,[8]1.6282,[9]1.5856,[10]1.5431,[11]1.5379,[12]1.5781,[13]1.5947,[14]1.7232,[15]1.8539,[16]1.9054,[17]2.0733,[18]2.1998,[19]2.1545,[20]2.1438,[21]2.2433,[22]2.214



