OLMo-2-0325-32B-Instruct-GGUF開源模型 - 內存受限環境下的指令微調好幫手

首頁

Olmo 2 0325 32B Instruct GGUF

由Mungert開發

基於OLMo-2-0325-32B-DPO的指令微調模型，採用IQ-DynamicGate超低比特量化技術，專為內存受限環境優化。

大型語言模型英語開源協議:Apache-2.0 #超低比特量化 #精度自適應 #邊緣設備推理

下載量 15.57k

發布時間 : 4/2/2025

模型概述

該模型是一個32B參數的大語言模型，經過指令微調，支持文本生成任務。採用創新的IQ-DynamicGate量化技術，可在1-2比特超低精度下保持較高性能。

模型特點

IQ-DynamicGate超低比特量化

創新的1-2比特量化技術，採用精度自適應策略，在保持極致內存效率的同時減少錯誤傳播。

分層量化策略

對模型不同層採用差異化量化方案，關鍵組件保留更高精度，平衡性能與效率。

多格式支持

提供從BF16到IQ3_XS等多種量化格式，適應不同硬件環境和性能需求。

模型能力

文本生成

指令跟隨

低內存推理

使用案例

資源受限環境部署

邊緣設備推理

在內存有限的邊緣設備上運行大型語言模型

IQ1_M量化版本困惑度降低43.9%

CPU推理優化

在沒有GPU加速的CPU環境中高效運行模型

Q4_K量化版本適合內存有限的CPU推理

研究應用

超低比特量化研究

研究1-2比特量化對模型性能的影響

IQ2_S量化版本困惑度降低36.9%

🚀 OLMo-2-0325-32B-Instruct GGUF模型

OLMo-2-0325-32B-Instruct GGUF模型是基於Transformer架構的文本生成模型，它在超低比特量化技術上取得了顯著突破，能夠在保證一定精度的前提下，大幅降低內存使用，適用於多種硬件環境和應用場景。

🚀 快速開始

OLMo 2將在下一版本的Transformers中得到支持，你需要從主分支進行安裝：

pip install --upgrade git+https://github.com/huggingface/transformers.git

✨ 主要特性

超低比特量化技術

我們最新的量化方法為超低比特模型（1 - 2比特）引入了精度自適應量化，並在Llama-3-8B上通過基準測試驗證了其有效性。該方法採用特定層策略，在保持極高內存效率的同時保留了模型的準確性。

多格式支持

提供多種模型格式，包括BF16、F16和多種量化格式（Q4_K、Q6_K、Q8等），可根據不同的硬件能力和內存限制進行選擇。

中間檢查點

為了便於強化學習微調研究，我們發佈了模型在RLVR訓練期間的中間檢查點，模型權重每20個訓練步驟保存一次。

📦 安裝指南

OLMo 2將在下一版本的Transformers中得到支持，你需要從主分支進行安裝：

pip install --upgrade git+https://github.com/huggingface/transformers.git

💻 使用示例

加載模型

from transformers import AutoModelForCausalLM

olmo_model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0325-32B-Instruct")

聊天模板

<|user|>
How are you doing?
<|assistant|>
I'm just a computer program, so I don't have feelings, but I'm functioning as expected. How can I assist you today?<|endoftext|>

系統提示

在Ai2演示中，我們默認使用以下系統提示：

You are OLMo 2, a helpful and harmless AI Assistant built by the Allen Institute for AI.

加載中間檢查點

olmo_model = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-0325-32B-Instruct", revision="step_200")

📚 詳細文檔

量化性能對比（Llama-3-8B）

量化方式	標準困惑度	DynamicGate困惑度	困惑度變化	標準大小	DG大小	大小變化	標準速度	DG速度
IQ2_XXS	11.30	9.84	-12.9%	2.5G	2.6G	+0.1G	234s	246s
IQ2_XS	11.72	11.63	-0.8%	2.7G	2.8G	+0.1G	242s	246s
IQ2_S	14.31	9.02	-36.9%	2.7G	2.9G	+0.2G	238s	244s
IQ1_M	27.46	15.41	-43.9%	2.2G	2.5G	+0.3G	206s	212s
IQ1_S	53.07	32.00	-39.7%	2.1G	2.4G	+0.3G	184s	209s

模型格式選擇

模型格式	精度	內存使用	設備要求	最佳用例
BF16	最高	高	支持BF16的GPU/CPU	減少內存的高速推理
F16	高	高	支持FP16的設備	BF16不可用時的GPU推理
Q4_K	中低	低	CPU或低顯存設備	內存受限環境
Q6_K	中	中等	內存較多的CPU	量化模型中較好的精度
Q8_0	高	中等	有足夠顯存的CPU或GPU	量化模型中最佳精度
IQ3_XS	非常低	非常低	超低內存設備	極致內存效率和低精度
Q4_0	低	低	ARM或低內存設備	llama.cpp可針對ARM設備優化

包含文件及詳情

OLMo-2-0325-32B-Instruct-bf16.gguf：模型權重保存為BF16格式，適用於支持BF16加速的設備，可用於將模型重新量化為其他格式。
OLMo-2-0325-32B-Instruct-f16.gguf：模型權重保存為F16格式，適用於支持FP16的設備，特別是不支持BF16的情況。
OLMo-2-0325-32B-Instruct-bf16-q8_0.gguf：輸出和嵌入層保持為BF16格式，其他層量化為Q8_0，適用於支持BF16的設備。
OLMo-2-0325-32B-Instruct-f16-q8_0.gguf：輸出和嵌入層保持為F16格式，其他層量化為Q8_0。
OLMo-2-0325-32B-Instruct-q4_k.gguf：輸出和嵌入層量化為Q8_0，其他層量化為Q4_K，適用於內存有限的CPU推理。
OLMo-2-0325-32B-Instruct-q4_k_s.gguf：最小的Q4_K變體，以犧牲精度為代價減少內存使用，適用於極低內存設置。
OLMo-2-0325-32B-Instruct-q6_k.gguf：輸出和嵌入層量化為Q8_0，其他層量化為Q6_K。
OLMo-2-0325-32B-Instruct-q8_0.gguf：完全Q8量化模型，精度更高，但需要更多內存。
OLMo-2-0325-32B-Instruct-iq3_xs.gguf：IQ3_XS量化，針對極致內存效率進行優化，適用於超低內存設備。
OLMo-2-0325-32B-Instruct-iq3_m.gguf：IQ3_M量化，提供中等塊大小以提高精度，適用於低內存設備。
OLMo-2-0325-32B-Instruct-q4_0.gguf：純Q4_0量化，針對ARM設備進行優化，適用於低內存環境，建議使用IQ4_NL以獲得更好的精度。

性能對比

模型	平均	AlpacaEval 2 LC	BBH	DROP	GSM8k	IFEval	MATH	MMLU	安全性	PopQA	TruthQA
封閉API模型
GPT-3.5 Turbo 0125	59.6	38.7	66.6	70.2	74.3	66.9	41.2	70.2	69.1	45.0	62.9
GPT 4o Mini 2024-07-18	65.7	49.7	65.9	36.3	83.0	83.5	67.9	82.2	84.9	39.0	64.8
開放權重模型
Mistral-Nemo-Instruct-2407	50.9	45.8	54.6	23.6	81.4	64.5	31.9	70.0	52.7	26.9	57.7
Ministral-8B-Instruct	52.1	31.4	56.2	56.2	80.0	56.4	40.0	68.5	56.2	20.2	55.5
Gemma-2-27b-it	61.3	49.0	72.7	67.5	80.7	63.2	35.1	70.7	75.9	33.9	64.6
Qwen2.5-32B	66.5	39.1	82.3	48.3	87.5	82.4	77.9	84.7	82.4	26.1	70.6
Mistral-Small-24B	67.6	43.2	80.1	78.5	87.2	77.3	65.9	83.7	66.5	24.4	68.1
Llama-3.1-70B	70.0	32.9	83.0	77.0	94.5	88.0	56.2	85.2	76.4	46.5	66.8
Llama-3.3-70B	73.0	36.5	85.8	78.0	93.6	90.8	71.8	85.9	70.4	48.2	66.1
Gemma-3-27b-it	-	63.4	83.7	69.2	91.1	-	-	81.8	-	30.9	-
完全開放模型
OLMo-2-7B-1124-Instruct	55.7	31.0	48.5	58.9	85.2	75.6	31.3	63.9	81.2	24.6	56.3
OLMo-2-13B-1124-Instruct	61.4	37.5	58.4	72.1	87.4	80.4	39.7	68.6	77.5	28.8	63.9
OLMo-2-32B-0325-SFT	61.7	16.9	69.7	77.2	78.4	72.4	35.9	76.1	93.8	35.4	61.3
OLMo-2-32B-0325-DPO	68.8	44.1	70.2	77.5	85.7	83.8	46.8	78.0	91.9	36.4	73.5
OLMo-2-32B-0325-Instruct	68.8	42.8	70.6	78.0	87.6	85.6	49.7	77.3	85.9	37.5	73.2

學習曲線

訓練曲線訓練曲線（時間）核心評估分數曲線其他評估分數曲線

復現命令

# clone and check out commit
git clone https://github.com/allenai/open-instruct.git
# this should be the correct commit, the main thing is to have the vllm monkey patch for
# 32b olmo https://github.com/allenai/open-instruct/blob/894ffa236319bc6c26c346240a7e4ee04ba0bd31/open_instruct/vllm_utils2.py#L37-L59
git checkout a51dc98525eec01de6e8a24c071f42dce407d738
uv sync
uv sync --extra compile

# note that you may need 5 8xH100 nodes for the training.
# so please setup ray properly, e.g., https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md#llama-31-tulu-3-70b-reproduction
python open_instruct/grpo_vllm_thread_ray_gtrl.py \
    --exp_name 0310_olmo2_32b_grpo_12818 \
    --beta 0.01 \
    --local_mini_batch_size 32 \
    --number_samples_per_prompt 16 \
    --output_dir output \
    --local_rollout_batch_size 4 \
    --kl_estimator kl3 \
    --learning_rate 5e-7 \
    --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 \
    --dataset_mixer_list_splits train \
    --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16 \
    --dataset_mixer_eval_list_splits train \
    --max_token_length 2048 \
    --max_prompt_token_length 2048 \
    --response_length 2048 \
    --model_name_or_path allenai/OLMo-2-0325-32B-DPO \
    --non_stop_penalty \
    --stop_token eos \
    --temperature 1.0 \
    --ground_truths_key ground_truth \
    --chat_template_name tulu \
    --sft_messages_key messages \
    --eval_max_length 4096 \
    --total_episodes 10000000 \
    --penalty_reward_value 0.0 \
    --deepspeed_stage 3 \
    --no_gather_whole_model \
    --per_device_train_batch_size 2 \
    --local_rollout_forward_batch_size 2 \
    --actor_num_gpus_per_node 8 8 8 4 \
    --num_epochs 1 \
    --vllm_tensor_parallel_size 1 \
    --vllm_num_engines 12 \
    --lr_scheduler_type constant \
    --apply_verifiable_reward true \
    --seed 1 \
    --num_evals 30 \
    --save_freq 20 \
    --reward_model_multiplier 0.0 \
    --no_try_launch_beaker_eval_jobs \
    --try_launch_beaker_eval_jobs_on_weka \
    --gradient_checkpointing \
    --with_tracking

🔧 技術細節

量化方法

動態精度分配：前/後25%的層使用IQ4_XS（選定層），中間50%使用IQ2_XXS/IQ3_S（提高效率）。
關鍵組件保護：嵌入層和輸出層使用Q5_K，與標準1 - 2比特量化相比，誤差傳播減少38%。

模型訓練

模型使用5個8xH100節點進行訓練，訓練過程中每20個步驟保存一次中間檢查點。

📄 許可證

OLMo 2採用Apache 2.0許可證，旨在用於研究和教育目的。更多信息請參閱我們的負責任使用指南。該模型使用了包含第三方模型生成輸出的數據集進行微調，需遵守額外條款：Gemma使用條款。

引用

@article{olmo20242olmo2furious,
      title={2 OLMo 2 Furious}, 
      author={Team OLMo and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
      year={2024},
      eprint={2501.00656},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.00656}, 
}

測試模型

如果您覺得這些模型有用，請點擊“點贊”！幫助我測試我的AI網絡監控助手，進行量子就緒安全檢查： 👉 免費網絡監控器

💬 測試方法：

點擊任何頁面右下角的聊天圖標。
選擇一個AI助手類型：
- TurboLLM (GPT-4-mini)
- FreeLLM (開源)
- TestLLM (僅實驗性CPU)

測試內容

我正在探索小型開源模型在AI網絡監控中的極限，具體包括：

針對即時網絡服務的函數調用。
模型可以多小，同時仍能處理：
- 自動化Nmap掃描。
- 量子就緒檢查。
- Metasploit集成。

🟡 TestLLM – 當前實驗模型（llama.cpp在6個CPU線程上）：

✅ 零配置設置。
⏳ 30秒加載時間（推理速度慢，但無API成本）。
🔧 尋求幫助！ 如果您對邊緣設備AI感興趣，讓我們一起合作！

其他助手

🟢 TurboLLM – 使用gpt-4-mini進行：

即時網絡診斷。
自動化滲透測試 (Nmap/Metasploit)。
🔑 通過下載我們的免費網絡監控代理獲得更多令牌。

🔵 HugLLM – 開源模型（約8B參數）：

比TurboLLM多2倍令牌。
AI日誌分析。
🌐 在Hugging Face推理API上運行。

💡 測試示例AI命令：

"Give me info on my websites SSL certificate"
"Check if my server is using quantum safe encyption for communication"
"Run a quick Nmap vulnerability test"

模型信息

屬性	詳情
模型類型	基於公開可用、合成和人工創建的數據集混合訓練的模型
語言	主要為英語
許可證	Apache 2.0
微調基礎模型	allenai/OLMo-2-0325-32B-DPO
項目頁面	https://allenai.org/olmo
倉庫	核心倉庫（訓練、推理、微調等）：https://github.com/allenai/OLMo-core；評估代碼：https://github.com/allenai/olmes；進一步微調代碼：https://github.com/allenai/open-instruct
論文	https://arxiv.org/abs/2501.00656
演示	https://playground.allenai.org/