DeepSeek-V3-0324-GGUF开源模型 - 量化版本减小体积，性能接近Q8

首页

Deepseek V3 0324 GGUF

由 ubergarm 开发

当前V3-0324模型在该尺寸类别中表现最佳的量化版本，在保持性能接近Q8_0的同时显著减小了体积

大型语言模型其他开源协议:MIT #高压缩量化 #长上下文处理 #多专家路由

下载量 1,712

发布时间 : 3/26/2025

模型简介

高性能量化语言模型，支持32k长上下文处理，专为GPU显存优化设计

模型特点

高效显存利用

采用MLA技术实现显存优化，32k上下文可在24GB显存内运行

高质量量化

使用新一代重要性矩阵和IQx_K_R4量化技术，保持接近原始模型的性能

模块化量化策略

对不同层（词嵌入/注意力/专家层）采用差异化量化方案

模型能力

长文本生成

代码理解与生成

多语言处理

使用案例

文本处理

长文档摘要

处理长达32k token的文档并生成摘要

代码辅助

代码补全

基于上下文生成代码建议

🚀 `ik_llama.cpp` imatrix MLA量化的DeepSeek-V3-0324

本量化集合需要使用 ik_llama.cpp 分支，以支持先进的非线性最优量化和多头潜在注意力（MLA）。请不要下载这些大文件后，期望它们能在主线版本的 llama.cpp、ollama、LM Studio、KoboldCpp 等中运行！

这些量化方案在给定的内存占用下，提供了同类最佳的困惑度。MLA 支持使得 R1 和 V3 能够在不到 24GB 的 GPU VRAM 中实现 32k 以上的上下文长度，同时将 MoE 层卸载到 RAM 中。

这些量化方案专为具有 24 - 48GB VRAM 的 CPU + GPU 系统，以及使用动态量化重新打包（以实现最大内存吞吐量）的纯 CPU 设备而设计。如果您有更多的 VRAM，我建议选择其他至少对部分路由专家层进行了 GPU 卸载优化的量化方案。

您可以使用现有的量化方案快速尝试 ik_llama.cpp，因为它会在启动时动态计算 MLA 张量并重新打包量化数据（前提是您有足够的 RAM + VRAM 来容纳整个模型）。在体验到差异后，再来看看这里的大尺寸量化方案。

✨ 主要特性

低困惑度：在给定的内存占用下，提供了同类最佳的困惑度。
长上下文支持：MLA 支持使得在不到 24GB 的 GPU VRAM 中实现 32k 以上的上下文长度。
多平台适配：专为具有 24 - 48GB VRAM 的 CPU + GPU 系统，以及纯 CPU 设备而设计。

📦 安装指南

本项目依赖于 ik_llama.cpp 分支，请确保你已经正确安装该分支。

💻 使用示例

基础用法

`ik_llama.cpp` API 服务器（GPU + CPU）

# 在不到 24GB VRAM 中容纳 32k 上下文
# 可选的 `-ser 6,1` 以最小的质量损失提高速度
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-server \
    --model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
    --alias ubergarm/DeepSeek-R1-V3-0324-IQ2_K_R4 \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 16 \
    --host 127.0.0.1 \
    --port 8080

`ik_llama.cpp` API 服务器（仅 CPU）

# 目前的目标是在单个 NUMA 节点中获得尽可能多的 RAM 带宽，例如：
# 在 AMD Epyc 上使用 BIOS `NPS0`，或在 BIOS `SNC=Disable` 的 Intel Xeon 单插槽上使用
# 调整 `--threads` 以进行令牌生成，调整 `--threads-batch` 以进行提示处理（预填充）
# 注意 `--run-time-repack` 将预先分配足够的 RAM 用于模型权重，而不是通过 mmap() 从磁盘加载
# 注意，在 [git 仓库](https://github.com/ikawrakow/ik_llama.cpp/pull/278#issuecomment-2746381515) 中有关于显式和透明大页的调优讨论
numactl -N 0 -m 0 \
./build/bin/llama-server \
    --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4.gguf \
    --alias ubergarm/DeepSeek-V3-0324-IQ4_K_R4 \
    --run-time-repack \
    --ctx-size 65536 \
    -ctk q8_0 \
    -mla 3 -fa \
    -amb 512 \
    -fmoe \
    --temp 0.3 \
    --min-p 0.05 \
    --parallel 1 \
    --threads 88 \
    --threads-batch 128 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080

📚 详细文档

量化集合

到目前为止，这些是我提供的最佳方案，每 GiB 模型的困惑度最低，适用于各种 CPU + GPU 或纯 CPU 设备。

`IQ4_K_R4` 4.936 BPW

特殊混合 IQ5_K_R4/IQ4_K_R4 路由专家，其他所有层均为完整的 q8_0，适用于 CPU + GPU 卸载或使用 --run-time-repack 的纯 CPU 设备以实现最大速度。非常适合具有 384GB 以上 RAM 和 24GB 以上 GPU 的大型设备。

`IQ2_K_R4` 2.889 BPW

特殊混合 IQ3_K_R4/IQ2_K_R4 路由专家，其他所有层均为完整的 q8_0，适用于 CPU + GPU 卸载或使用 --run-time-repack 的纯 CPU 设备以实现最大速度。非常适合高端游戏系统等 CPU + GPU “强力设备”，例如 9950X 96GB RAM + 3090TI 24GB VRAM + Gen 5 NVMe SSD。

自定义混合

如果您在多个 GPU 上拥有超过 48GB 的 VRAM，可以考虑使用自定义 -ot 表达式，根据您的硬件情况自定义量化方案，以优化模型大小和性能。如果您的 VRAM 较少，可以在非路由专家层中使自定义量化更精简，或者在 24GB VRAM 中获得 64k 以上的上下文长度。此外，如果您想在启用 mmap() 的情况下仅使用 CPU，也可以使用离线重新打包工具。

量化比较

这些可能是 V3 - 0324 在这个尺寸类别中可用的最佳量化方案！

显示这些量化方案尺寸更小但性能与相似的基准测试

VRAM 使用情况图表

ubergarm 在令牌嵌入、注意力、密集层或共享专家方面没有做出任何牺牲。这是因为 ik_llama.cpp 的 MLA 实现节省了大量的 GPU VRAM，使得在不到 24GB 的 VRAM 中能够实现 32k 的上下文长度。此外，这些量化方案使用了一种新的高质量重要矩阵，包括各种编码样本和多种书面语言。路由专家层还使用了最先进的 CPU IQx_K_R4 非线性量化，可能实现每 GiB 最佳的困惑度。IQ2_K_R4 和 IQ4_K_R4 都设计为将约 17.33GiB 的权重卸载到 GPU VRAM 中，其余 VRAM 用于上下文。

bartowski 使用了完整的令牌嵌入质量，但注意力、密集层和共享专家的量化较低。他确实使用了一种高质量的重要矩阵，其困惑度性能相对于本方案在测量误差范围内。更新：还可以查看 bartowski 新的自定义 "V2" 版本方案，其在相同尺寸下的困惑度有所改善！下表是他的原始版本量化方案。

unsloth 在令牌嵌入方面有所牺牲，注意力和密集层质量中等，且没有重要矩阵。

mradermacher 的模型卡侧边栏未显示，因此更难比较确切的方案。他们的团队很慷慨地运行了一些命令，在此提供了他们的方案详细信息。

比较详情

~Q2 类量化的详细比较

	ubergarm/DeepSeek-V3-0324-IQ2_K_R4	bartowski/DeepSeek-V3-0324-Q2_K_L	unsloth/DeepSeek-V3-0324-UD-Q2_K_XL	mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K
概述		"V1"
`split.tensors.count`	1147	1025	1025
`token_embd.weight`	`Q8_0`	`Q8_0`	`Q4_K`	`IQ3_S`
`output.weight`				`Q5_K`
文件大小 (GiB)	227	228	231
多头潜在注意力
`blk.*.attn_kv_b.weight`	`Q8_0`	n/a	n/a	n/a
`blk.*.attn_k_b.weight`	`Q8_0`	n/a	n/a	n/a
`blk.*.attn_v_b.weight`	`Q8_0`	n/a	n/a	n/a
密集层
`blk.[0 - 2].attn_kv_a_mqa.weight`	`Q8_0`	`Q2_K`	`Q6_K`	`IQ2_XS`
`blk.[0 - 2].attn_kv_a_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[0 - 2].attn_kv_b.weight`	`Q8_0`	`Q2_K`	`Q6_K`	`IQ2_XS`
`blk.[0 - 2].attn_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[0 - 2].attn_q_a.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[0 - 2].attn_q_a_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[0 - 2].attn_q_b.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[0 - 2].ffn_down.weight`	`Q8_0`	`Q3_K`	`Q6_K`	`IQ3_S`
`blk.[0 - 2].ffn_gate.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[0 - 2].ffn_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[0 - 2].ffn_up.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[0 - 2].attn_output.weight`	`Q8_0`	`Q3_K`	`Q4_K`	`IQ3_S`
共享和路由 MoE 层
`blk.[3 - 60].attn_kv_a_mqa.weight`	`Q8_0`	`Q2_K`	`Q6_K`	`IQ2_XS`
`blk.[3 - 60].attn_kv_a_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[3 - 60].attn_kv_b.weight`	`Q8_0`	`Q2_K`	`Q6_K`	`IQ2_XS`
`blk.[3 - 60].attn_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[3 - 60].attn_q_a.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[3 - 60].attn_q_a_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[3 - 60].attn_q_b.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[3 - 60].exp_probs_b.bias`	`F32`	`F32`	`F32`	`F32`
`blk.[3 - 60].ffn_down_exps.weight`	`IQ3_K_R4`	`Q3_K`	`Q3_K`	`IQ3_S`
`blk.[3 - 60].ffn_down_shexp.weight`	`Q8_0`	`Q3_K`	`Q6_K`	`IQ3_S`
`blk.[3 - 60].ffn_gate_exps.weight`	`IQ2_K_R4`	`Q2_K`	`Q2_K`	`IQ2_XS`
`blk.[3 - 60].ffn_gate_inp.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[3 - 60].ffn_gate_shexp.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[3 - 60].ffn_norm.weight`	`F32`	`F32`	`F32`	`F32`
`blk.[3 - 60].ffn_up_exps.weight`	`IQ2_K_R4`	`Q2_K`	`Q2_K`	`IQ2_XS`
`blk.[3 - 60].ffn_up_shexp.weight`	`Q8_0`	`Q2_K`	`Q4_K`	`IQ2_XS`
`blk.[3 - 60].attn_output.weight`	`Q8_0`	`Q3_K`	`Q4_K`	`IQ3_S`
重要矩阵和困惑度
`imatrix.dataset`	`calibration_data_v5_rc.txt`	`calibration_datav3.txt`	无	`imatrix-training-full-3`
最终 PPL (wiki.test.raw)	3.5614 +/- 0.02001	3.9012 (V1)	?	?

作为参考，Q8_0 在相同的 wiki.test.raw 文件上实现了 PPL = 3.3482 +/- 0.01847。

重要矩阵

重要矩阵详细信息

# 在双 Intel Xeon 6980P CPU 的单插槽上仅使用 CPU 运行
numactl -N 0 -m 0 \
./build/bin/llama-imatrix \
    --verbosity 1 \
    -m /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \
    -f calibration_data_v5_rc.txt \
    -o DeepSeek-V3-0324.imatrix \
    --ctx-size 512 \
    --numa numactl \
    --threads 128

.
.
.

compute_imatrix: 正在对 213 个块进行计算，批量大小为 512
compute_imatrix: 每次计算耗时 41.77 秒 - 预计 2 小时 28.28 分钟完成
[1]60.9029,[2]10.8011,[3]5.8709,[4]3.7872,[5]2.9688,[6]2.5088,[7]2.2214,[8]2.0224,[9]1.9110,
save_imatrix: 条目 '             blk.60.ffn_down_exps.weight' 有部分数据 (99.61%)，256 个专家中有 1 个缺少数据，正在存储 **请注意**
save_imatrix: 条目 '             blk.60.ffn_gate_exps.weight' 有部分数据 (99.61%)，256 个专家中有 1 个缺少数据，正在存储 **请注意**
save_imatrix: 条目 '               blk.60.ffn_up_exps.weight' 有部分数据 (99.61%)，256 个专家中有 1 个缺少数据，正在存储 **请注意**

save_imatrix: 在处理 10 个块后，将收集的数据存储在 /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-ik_llamacpp-2089147a.dat

.
.
.

llama_print_timings:        加载时间 =   42726.11 毫秒
llama_print_timings:      采样时间 =       0.00 毫秒 /     1 次运行   (    0.00 毫秒/令牌，      无穷 令牌/秒)
llama_print_timings: 提示评估时间 = 7125661.28 毫秒 / 109056 个令牌 (   65.34 毫秒/令牌，    15.30 令牌/秒)
llama_print_timings:        评估时间 =       0.00 毫秒 /     1 次运行   (    0.00 毫秒/令牌，      无穷 令牌/秒)
llama_print_timings:       总时间 = 7201368.59 毫秒 / 109057 个令牌

最终估计: PPL = 3.4755 +/- 0.03305

量化秘诀

秘诀

#!/usr/bin/env bash

custom="
# 令牌嵌入 (GPU)
# 注意：由于张量大小的原因，不能是重新打包的类型
token_embd\.weight=q8_0
# 输出张量 (GPU)
output\.weight=q8_0
output_norm\.weight=q8_0

# 前 3 个密集层 (0 - 3) (GPU)
blk\.[0-2]\..*=q8_0

# MoE 层 (3 - 60) 的所有注意力、权重和偏置张量 (GPU)
# 注意：attn_k_b.weight 不能是 k-、i- 或 iqk- 量化，因为其行大小为 128
blk\.[3-9]\.attn_.*=q8_0
blk\.[1-5][0-9]\.attn_.*=q8_0
blk\.60\.attn_.*=q8_0

blk\.[3-9]\.ffn_norm\.weight=q8_0
blk\.[1-5][0-9]\.ffn_norm\.weight=q8_0
blk\.60\.ffn_norm\.weight=q8_0

blk\.[3-9]\.exp_probs_b\.bias=q8_0
blk\.[1-5][0-9]\.exp_probs_b\.bias=q8_0
blk\.60\.exp_probs_b\.bias=q8_0

# 共享专家 (3 - 60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=q8_0
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=q8_0
blk\.60\.ffn_down_shexp\.weight=q8_0

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=q8_0
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=q8_0
blk\.60\.ffn_(gate|up)_shexp\.weight=q8_0

# 路由专家 (3 - 60) (CPU)
# 注意：传统经验表明，早期层应使用更高的量化
blk\.[3-9]\.ffn_down_exps\.weight=iq3_k_r4
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_k_r4
blk\.60\.ffn_down_exps\.weight=iq3_k_r4

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_k_r4
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_k_r4
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_k_r4
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --imatrix /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324.imatrix \
    --token-embedding-type q8_0 \
    --output-tensor-type q8_0 \
    --custom-q "$custom" \
    /mnt/raid/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/DeepSeek-256x21B-V3-0324-BF16-00001-of-00030.gguf \
    /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
    IQ2_K_R4 \
    24

困惑度

困惑度日志

$ CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-perplexity \
    --model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
    -ctk q8_0 \
    -mla 2 -fa \
    -amb 512 \
    -fmoe \
    --ctx-size 512 \
    --ubatch-size 512 \
    -f wiki.test.raw \
    --seed 1337 \
    --n-gpu-layers 63 \
    --override-tensor exps=CPU \
    --threads 24

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    否
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: 否
ggml_cuda_init: 找到 1 个 CUDA 设备:
  设备 0: NVIDIA RTX A6000，计算能力 8.6，VMM: 是
main: 构建版本 = 3614 (b9c25fe7)
main: 使用 cc (Ubuntu 13.3.0 - 6ubuntu2~24.04) 13.3.0 为 x86_64 - linux - gnu 构建
main: 种子  = 1337
llama_model_loader: 从 /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf 加载了包含 50 个键值对和 1147 个张量的元数据 (版本 GGUF V3 (最新))
llama_model_loader: 正在转储元数据键/值。注意：KV 覆盖不适用于此输出。
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = 模型
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V3 0324
llama_model_loader: - kv   3:                            general.version str              = V3-0324
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 256x21B
llama_model_loader: - kv   6:                            general.license str              = mit
llama_model_loader: - kv   7:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   8:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   9:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv  10:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv  11:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  12:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  13:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  14: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  16:                          general.file_type u32              = 338
llama_model_loader: - kv  17:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  18:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  19:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  27:              deepseek2.expert_weights_norm bool             = 是
llama_model_loader: - kv  28:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,129280]  = ["
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,129280]  = [3
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,127741]  = ["
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = 是
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = 否
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /mnt/raid/models/ubergarm/DeepSeek-V3...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = calibration_data_v5_rc.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 720
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 213
llama_model_loader: - type  f32:  361 个张量
llama_model_loader: - type q8_0:  612 个张量
llama_model_loader: - type iq2_k_r4:  116 个张量
llama_model_loader: - type iq3_k_r4:   58 个张量
llm_load_vocab: 特殊令牌缓存大小 = 818
llm_load_vocab: 令牌到片段缓存大小 = 0.8223 MB
llm_load_print_meta: 格式           = GGUF V3 (最新)
llm_load_print_meta: 架构             = deepseek2
llm_load_print_meta: 词汇类型       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: 因果注意力      = 1
llm_load_print_meta: 池化类型     = 0
llm_load_print_meta: RoPE 类型        = 0
llm_load_print_meta: RoPE 缩放     = yarn
llm_load_print_meta: 训练频率基数  = 10000.0
llm_load_print_meta: 训练频率缩放 = 0.025
llm_load_print_meta: 原始 Yarn 上下文长度  = 4096
llm_load_print_meta: RoPE 微调情况   = 未知
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: 模型类型       = 671B
llm_load_print_meta: 模型浮点类型      = IQ2_K_R4 - 2.375 bpw
llm_load_print_meta: 模型参数     = 672.050 B
llm_load_print_meta: 模型大小       = 226.003 GiB (2.889 BPW) 
llm_load_print_meta: 重复层大小 = 224.169 GiB (2.873 BPW，670.196 B 参数)
llm_load_print_meta: 通用名称     = DeepSeek V3 0324
llm_load_print_meta: BOS 令牌        = 0 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS 令牌        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD 令牌        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF 令牌         = 131 'Ä'
llm_load_print_meta: 最大令牌长度 = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors: ggml 上下文大小 =    0.93 MiB
Tensor blk.3.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.3.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.3.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.4.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.4.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.4.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.5.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.5.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.5.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.6.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.6.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.6.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.7.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.7.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.7.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.8.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.8.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.8.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.9.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.9.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.9.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.10.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.10.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.10.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.11.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.11.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.11.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.12.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.12.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.12.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.13.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.13.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.13.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.14.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.14.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.14.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.15.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.15.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.15.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.16.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.16.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.16.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.17.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.17.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.17.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.18.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.18.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.18.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.19.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.19.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.19.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.20.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.20.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.20.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.21.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.21.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.21.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.22.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.22.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.22.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.23.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.23.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.23.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.24.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.24.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.24.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.25.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.25.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.25.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.26.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.26.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.26.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.27.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.27.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.27.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.28.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.28.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.28.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.29.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.29.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.29.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.30.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.30.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.30.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.31.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.31.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.31.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.32.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.32.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.32.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.33.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.33.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.33.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.34.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.34.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.34.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.35.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.35.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.35.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.36.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.36.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.36.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.37.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.37.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.37.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.38.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.38.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.38.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.39.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.39.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.39.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.40.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.40.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.40.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.41.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.41.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.41.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.42.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.42.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.42.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.43.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.43.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.43.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.44.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.44.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.44.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.45.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.45.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.45.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.46.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.46.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.46.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.47.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.47.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.47.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.48.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.48.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.48.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.49.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.49.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.49.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.50.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.50.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.50.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.51.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.51.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.51.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.52.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.52.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.52.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.53.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.53.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.53.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.54.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.54.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.54.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.55.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.55.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.55.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.56.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.56.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.56.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.57.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.57.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.57.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.58.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.58.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.58.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.59.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.59.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.59.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.60.ffn_gate_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.60.ffn_down_exps.weight 缓冲区类型被覆盖为 CPU
Tensor blk.60.ffn_up_exps.weight 缓冲区类型被覆盖为 CPU
llm_load_tensors: 将 61 个重复层卸载到 GPU
llm_load_tensors: 将非重复层卸载到 GPU
llm_load_tensors: 已将 62/62 层卸载到 GPU
llm_load_tensors:        CPU 缓冲区大小 = 228404.85 MiB
llm_load_tensors:        CPU 缓冲区大小 =   938.98 MiB
llm_load_tensors:      CUDA0 缓冲区大小 = 17744.02 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: 层 0: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 1: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 2: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 3: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 4: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 5: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 6: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 7: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 8: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 9: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 10: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 11: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 12: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 13: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 14: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 15: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 16: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 17: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 18: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 19: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 20: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 21: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 22: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 23: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 24: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 25: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 26: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 27: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 28: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 29: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 30: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 31: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 32: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 33: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 34: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 35: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 36: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 37: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 38: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 39: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 40: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 41: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 42: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 43: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 44: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 45: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 46: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 47: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 48: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 49: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 50: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 51: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 52: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 53: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 54: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 55: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 56: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 57: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 58: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 59: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init: 层 60: n_embd_head_qk_rope = 64，kv_lora_rank = 512
llama_kv_cache_init:      CUDA0 KV 缓冲区大小 =    72.94 MiB
llama_new_context_with_model: KV 自身大小  =   72.91 MiB，c^KV (q8_0):   72.91 MiB，kv^T: 未使用
llama_new_context_with_model:  CUDA_Host 输出缓冲区大小 =     1.97 MiB
llama_new_context_with_model:      CUDA0 计算缓冲区大小 =   503.00 MiB
llama_new_context_with_model:  CUDA_Host 计算缓冲区大小 =   162.01 MiB
llama_new_context_with_model: 图节点  = 3548
llama_new_context_with_model: 图分割 = 118

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: 正在对输入进行分词 ..
perplexity: 分词耗时 617.314 毫秒
perplexity: 正在对 561 个块计算困惑度，n_ctx = 512，批量大小 = 2048，n_seq = 4
perplexity: 每次计算耗时 20.43 秒 - 预计 47.75 分钟完成
[1]2.7687,[2]3.5402,[3]2.5152,[4]2.1223,[5]1.9022,[6]1.7765,[7]1.6869,[8]1.6282,[9]1.5856,[10]1.5431,[11]1.5379,[12]1.5781,[13]1.5947,[14]1.7232,[15]1.8539,[16]1.9054,[17]2.0733,[18]2.1998,[19]2.1545,[20]2.1438,[21]2.2433,[22]2.214