開源Granite 3.2-8B-Instruct - 高效推理場景適用的免費指令微調語言模型

首頁

Granite 3.2 8b Instruct GGUF

由Mungert開發

IBM Granite系列8B參數指令微調語言模型，採用IQ-DynamicGate超低比特量化技術，適用於高效推理場景

大型語言模型開源協議:Apache-2.0 #超低比特量化 #精度自適應 #邊緣設備推理

下載量 1,048

發布時間 : 3/19/2025

模型概述

該模型是IBM Granite系列的中等規模語言模型，經過指令微調優化，支持文本生成任務。採用創新的IQ-DynamicGate量化技術，可在1-2比特精度下保持較高性能。

模型特點

IQ-DynamicGate量化技術

創新的1-2比特精度自適應量化方法，通過分層策略在保持內存效率的同時保留模型精度

混合精度分配

前25%和後25%層使用IQ4_XS，中間50%層使用IQ2_XXS/IQ3_S，關鍵組件使用Q5_K保護

高效推理

針對CPU和低顯存設備優化，提供多種量化版本適應不同硬件環境

模型能力

文本生成

指令跟隨

低資源推理

使用案例

邊緣計算

移動設備AI助手

在內存受限的移動設備上部署智能助手

IQ1_M量化版本困惑度降低43.9%

研究開發

超低比特量化研究

作為1-2比特量化技術的研究平臺

IQ2_S量化版本在僅增加0.2GB情況下降低36.9%困惑度

🚀 Granite-3.2-8B-Instruct GGUF模型

Granite-3.2-8B-Instruct是一款具有80億參數的長上下文AI模型，經過微調後具備出色的思考能力。它基於Granite-3.1-8B-Instruct構建，使用了經過寬鬆許可的開源數據集和內部生成的用於推理任務的合成數據進行訓練。該模型能夠控制其思考能力，確保僅在需要時發揮作用。

模型基本信息

屬性	詳情
開發者	Granite團隊，IBM
網站	Granite文檔
發佈日期	2025年2月26日
許可證	Apache 2.0
支持語言	英語、德語、西班牙語、法語、日語、葡萄牙語、阿拉伯語、捷克語、意大利語、韓語、荷蘭語和中文。用戶也可以針對這12種語言之外的語言對該模型進行微調。
預期用途	該模型旨在處理一般的指令跟隨任務，並可集成到各個領域的AI助手，包括商業應用中。

🚀 快速開始

安裝依賴庫

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

代碼示例

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

model_path="ibm-granite/granite-3.2-8b-instruct"
device="cuda"
model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map=device,
        torch_dtype=torch.bfloat16,
    )
tokenizer = AutoTokenizer.from_pretrained(
        model_path
)

conv = [{"role": "user", "content":"You have 10 liters of a 30% acid solution. How many liters of a 70% acid solution must be added to achieve a 50% acid mixture?"}]

input_ids = tokenizer.apply_chat_template(conv, return_tensors="pt", thinking=True, return_dict=True, add_generation_prompt=True).to(device)

set_seed(42)
output = model.generate(
    **input_ids,
    max_new_tokens=8192,
)

prediction = tokenizer.decode(output[0, input_ids["input_ids"].shape[1]:], skip_special_tokens=True)
print(prediction)

示例輸出

thinking=True

Here is my thought process:

1. We know we start with 10 liters of a 30% acid solution, so there's 3 liters of pure acid in it (10 liters * 0.3).
2. We need to determine how much of the 70% acid solution to add such that the final mixture is 50% acid.
3. Let's say x liters of the 70% acid solution are added. This will contribute 0.7x liters of pure acid.
4. The total volume of the new mixture will be 10 + x liters, and the total amount of pure acid will be 3 + 0.7x liters.
5. We need to set up an equation where (3 + 0.7x) / (10 + x) = 0.5 (since we want a 50% acid mixture).
6. Solving this equation for x will give us the volume of 70% solution needed.

Here is my response:

To solve this, we need to set up an equation based on the concentration of acid before and after mixing. 

1. **Initial Acid Content**: 
   - We start with 10 liters of a 30% acid solution. 
   - Pure acid in this = 10 * 0.3 = 3 liters.

2. **Variable for Addition**:
   - Let x be the liters of 70% acid solution to be added.
   - Pure acid in x liters of 70% solution = 0.7x liters.

3. **Final Mixture Concentration**:
   - Total volume of mixture = 10 + x liters.
   - Total pure acid in the mixture = 3 (from initial solution) + 0.7x (from added solution).

4. **Setting up the equation for 50% acid concentration**:
   - We require that the ratio of pure acid to total volume is 0.5 (50%).
   - Therefore, the equation is:

   \[
   \frac{3 + 0.7x}{10 + x} = 0.5
   \]

5. **Solving the equation**:

   Multiply both sides by (10 + x) to clear the fraction:

   \[
   3 + 0.7x = 0.5(10 + x)
   \]

   Expanding the right side:

   \[
   3 + 0.7x = 5 + 0.5x
   \]

   Subtract 0.5x from both sides:

   \[
   3 + 0.2x = 5
   \]

   Subtract 3 from both sides:

   \[
   0.2x = 2
   \]

   Divide by 0.2:

   \[
   x = 10
   \]

So, 10 liters of the 70% acid solution must be added to the initial 10 liters of 30% solution to get a 50% acid mixture. 

Let's verify: 

- Initial acid: 3 liters
- Acid from added solution: 0.7 * 10 = 7 liters
- Total acid: 3 + 7 = 10 liters
- Total volume: 10 + 10 = 20 liters
- Acid percentage: \(\frac{10}{20} = 0.5 = 50\%\)

This confirms our solution. 

Final answer: 10 liters of 70% acid solution should be added.

thinking=False

Let's denote the amount of 70% acid solution to be added as x liters.

We start with 10 liters of a 30% acid solution, which means there are 0.3 * 10 = 3 liters of pure acid in it.

We want to end up with a 50% acid solution after adding x liters of 70% acid solution. The total volume of the mixture will be 10 + x liters, and we want half of this volume to be pure acid.

So, the amount of pure acid in the final mixture should be 0.5 * (10 + x) liters.

The amount of pure acid in the final mixture comes from two sources: the original 30% solution and the added 70% solution. So, we can set up the following equation:

3 (from the original solution) + 0.7x (from the added solution) = 0.5 * (10 + x)

Now, let's solve for x:

3 + 0.7x = 5 + 0.5x
0.7x - 0.5x = 5 - 3
0.2x = 2
x = 2 / 0.2
x = 10

So, you need to add 10 liters of a 70% acid solution to the 10 liters of a 30% acid solution to get a 50% acid mixture.

✨ 主要特性

超低比特量化與IQ-DynamicGate (1 - 2比特)

我們最新的量化方法為超低比特模型（1 - 2比特）引入了精度自適應量化，並在Llama-3-8B上通過基準測試證明了其有效性。這種方法採用特定層的策略，在保持極高內存效率的同時保留了模型的準確性。

基準測試環境

所有測試均在Llama-3-8B-Instruct上進行，使用以下設置：

標準困惑度評估流程
2048令牌的上下文窗口
所有量化方法使用相同的提示集

量化方法

動態精度分配：
- 前/後25%的層 → IQ4_XS（選定層）
- 中間50% → IQ2_XXS/IQ3_S（提高效率）
關鍵組件保護：
- 嵌入層/輸出層使用Q5_K
- 與標準的1 - 2比特量化相比，誤差傳播降低了38%

量化性能對比（Llama-3-8B）

量化方式	標準困惑度	DynamicGate困惑度	困惑度變化	標準大小	DG大小	大小變化	標準速度	DG速度
IQ2_XXS	11.30	9.84	-12.9%	2.5G	2.6G	+0.1G	234s	246s
IQ2_XS	11.72	11.63	-0.8%	2.7G	2.8G	+0.1G	242s	246s
IQ2_S	14.31	9.02	-36.9%	2.7G	2.9G	+0.2G	238s	244s
IQ1_M	27.46	15.41	-43.9%	2.2G	2.5G	+0.3G	206s	212s
IQ1_S	53.07	32.00	-39.7%	2.1G	2.4G	+0.3G	184s	209s

關鍵指標說明：

PPL = 困惑度（越低越好）
Δ PPL = 從標準量化到DynamicGate量化的困惑度變化百分比
速度 = 推理時間（CPU avx2，2048令牌上下文）
大小差異反映了混合量化的開銷

主要改進：

🔥 IQ1_M的困惑度大幅降低了43.9%（從27.46降至15.41）
🚀 IQ2_S的困惑度降低了36.9%，同時僅增加了0.2GB的大小
⚡ IQ1_S儘管採用了1比特量化，但仍保持了39.7%的更高準確性

權衡因素：

所有變體的大小都有適度增加（0.1 - 0.3GB）
推理速度保持相近（差異小於5%）

模型使用場景

📌 將模型裝入GPU顯存

✔ 內存受限的部署環境

✔ 可以容忍1 - 2比特誤差的CPU和邊緣設備

✔ 超低比特量化的研究

📦 安裝指南

安裝相關依賴庫：

pip install torch torchvision torchaudio
pip install accelerate
pip install transformers

📚 詳細文檔

選擇合適的模型格式

選擇正確的模型格式取決於你的硬件能力和內存限制。

BF16（Brain Float 16） – 若支持BF16加速則使用

一種16位浮點格式，專為更快的計算而設計，同時保留了較好的精度。
提供與FP32 相似的動態範圍，但內存使用更低。
如果你的硬件支持BF16加速（請查看設備規格），建議使用。
與FP32相比，適用於高性能推理且內存佔用減少的場景。

📌 適用場景： ✔ 你的硬件具有原生BF16支持（例如，較新的GPU、TPU）。 ✔ 你希望在節省內存的同時獲得更高的精度。 ✔ 你計劃將模型重新量化為其他格式。

📌 避免場景： ❌ 你的硬件不支持BF16（可能會回退到FP32並運行較慢）。 ❌ 你需要與缺乏BF16優化的舊設備兼容。

F16（Float 16） – 比BF16更廣泛支持

一種16位浮點格式，具有較高的精度，但動態範圍小於BF16。
適用於大多數支持FP16加速的設備（包括許多GPU和一些CPU）。
數值精度略低於BF16，但通常足以進行推理。

📌 適用場景： ✔ 你的硬件支持FP16但不支持BF16。 ✔ 你需要在速度、內存使用和準確性之間取得平衡。 ✔ 你在GPU或其他針對FP16計算優化的設備上運行。

📌 避免場景： ❌ 你的設備缺乏原生FP16支持（可能會比預期運行更慢）。 ❌ 你有內存限制。

量化模型（Q4_K、Q6_K、Q8等） – 用於CPU和低顯存推理

量化可以在儘可能保持準確性的同時減少模型大小和內存使用。

低比特模型（Q4_K） → 最適合最小化內存使用，可能精度較低。
高比特模型（Q6_K、Q8_0） → 準確性更好，但需要更多內存。

📌 適用場景： ✔ 你在CPU上進行推理，需要優化的模型。 ✔ 你的設備顯存較低，無法加載全精度模型。 ✔ 你希望在保持合理準確性的同時減少內存佔用。

📌 避免場景： ❌ 你需要最高的準確性（全精度模型更適合）。 ❌ 你的硬件有足夠的顯存支持更高精度的格式（BF16/F16）。

極低比特量化（IQ3_XS、IQ3_S、IQ3_M、Q4_K、Q4_0）

這些模型針對極端內存效率進行了優化，非常適合低功耗設備或大規模部署，其中內存是關鍵限制因素。

IQ3_XS：超低比特量化（3比特），具有極高的內存效率。
- 使用場景：最適合超低內存設備，即使Q4_K也太大的情況。
- 權衡因素：與高比特量化相比，準確性較低。
IQ3_S：小塊大小，以實現最大內存效率。
- 使用場景：最適合低內存設備，當IQ3_XS過於激進時。
IQ3_M：中等塊大小，比IQ3_S具有更好的準確性。
- 使用場景：適用於低內存設備，當IQ3_S限制過多時。
Q4_K：4比特量化，具有逐塊優化以提高準確性。
- 使用場景：最適合低內存設備，當Q6_K太大時。
Q4_0：純4比特量化，針對ARM設備進行了優化。
- 使用場景：最適合基於ARM的設備或低內存環境。

模型格式選擇總結表

模型格式	精度	內存使用	設備要求	最佳使用場景
BF16	最高	高	支持BF16的GPU/CPU	減少內存的高速推理
F16	高	高	支持FP16的設備	當BF16不可用時的GPU推理
Q4_K	中低	低	CPU或低顯存設備	內存受限的環境
Q6_K	中等	適中	內存較多的CPU	量化模型中準確性較好的選擇
Q8_0	高	適中	有足夠顯存的CPU或GPU	量化模型中準確性最高的選擇
IQ3_XS	非常低	非常低	超低內存設備	極端內存效率和低準確性
Q4_0	低	低	ARM或低內存設備	llama.cpp可以針對ARM設備進行優化

包含的文件及詳情

`granite-3.2-8b-instruct-bf16.gguf`

模型權重以BF16格式保存。
如果你想將模型重新量化為其他格式，請使用此文件。
如果你的設備支持BF16加速，這是最佳選擇。

`granite-3.2-8b-instruct-f16.gguf`

模型權重以F16格式存儲。
如果你的設備支持FP16，尤其是當BF16不可用時，請使用此文件。

`granite-3.2-8b-instruct-bf16-q8_0.gguf`

輸出層和嵌入層保持為BF16。
所有其他層量化為Q8_0。
如果你的設備支持BF16，並且你想要一個量化版本，請使用此文件。

`granite-3.2-8b-instruct-f16-q8_0.gguf`

輸出層和嵌入層保持為F16。
所有其他層量化為Q8_0。

`granite-3.2-8b-instruct-q4_k.gguf`

輸出層和嵌入層量化為Q8_0。
所有其他層量化為Q4_K。
適用於內存有限的CPU推理。

`granite-3.2-8b-instruct-q4_k_s.gguf`

最小的Q4_K變體，以犧牲準確性為代價使用更少的內存。
最適合極低內存的設置。

`granite-3.2-8b-instruct-q6_k.gguf`

輸出層和嵌入層量化為Q8_0。
所有其他層量化為Q6_K。

`granite-3.2-8b-instruct-q8_0.gguf`

完全Q8量化的模型，以獲得更好的準確性。
需要更多的內存，但提供更高的精度。

`granite-3.2-8b-instruct-iq3_xs.gguf`

IQ3_XS量化，針對極端內存效率進行了優化。
最適合超低內存設備。

`granite-3.2-8b-instruct-iq3_m.gguf`

IQ3_M量化，提供中等塊大小以獲得更好的準確性。
適用於低內存設備。

`granite-3.2-8b-instruct-q4_0.gguf`

純Q4_0量化，針對ARM設備進行了優化。
最適合低內存環境。
若追求更高準確性，建議使用IQ4_NL。

🔧 技術細節

評估結果

模型	ArenaHard	Alpaca-Eval-2	MMLU	PopQA	TruthfulQA	BigBenchHard	DROP	GSM8K	HumanEval	HumanEval+	IFEval	AttaQ
Llama-3.1-8B-Instruct	36.43	27.22	69.15	28.79	52.79	72.66	61.48	83.24	85.32	80.15	79.10	83.43
DeepSeek-R1-Distill-Llama-8B	17.17	21.85	45.80	13.25	47.43	65.71	44.46	72.18	67.54	62.91	66.50	42.87
Qwen-2.5-7B-Instruct	25.44	30.34	74.30	18.12	63.06	70.40	54.71	84.46	93.35	89.91	74.90	81.90
DeepSeek-R1-Distill-Qwen-7B	10.36	15.35	50.72	9.94	47.14	65.04	42.76	78.47	79.89	78.43	59.10	42.45
Granite-3.1-8B-Instruct	37.58	30.34	66.77	28.7	65.84	68.55	50.78	79.15	89.63	85.79	73.20	85.73
Granite-3.1-2B-Instruct	23.3	27.17	57.11	20.55	59.79	54.46	18.68	67.55	79.45	75.26	63.59	84.7
Granite-3.2-2B-Instruct	24.86	34.51	57.18	20.56	59.8	52.27	21.12	67.02	80.13	73.39	61.55	83.23
Granite-3.2-8B-Instruct	55.25	61.19	66.79	28.04	66.92	64.77	50.95	81.65	89.35	85.72	74.31	85.42

訓練數據

總體而言，我們的訓練數據主要由兩個關鍵來源組成：（1）具有寬鬆許可的公開可用數據集，（2）旨在增強推理能力的內部合成生成數據。

基礎設施

我們使用IBM的超級計算集群Blue Vela來訓練Granite-3.2-8B-Instruct，該集群配備了NVIDIA H100 GPU。這個集群為在數千個GPU上訓練我們的模型提供了可擴展且高效的基礎設施。

倫理考慮和侷限性

Granite-3.2-8B-Instruct基於Granite-3.1-8B-Instruct構建，利用了經過寬鬆許可的開源數據和部分專有數據以提高性能。由於它繼承了前一個模型的基礎，所有適用於Granite-3.1-8B-Instruct的倫理考慮和侷限性仍然適用。

📄 許可證

本項目採用Apache 2.0許可證。

其他信息

測試AI網絡監控助手

如果你覺得這些模型有用，請幫忙測試我的由AI驅動的網絡監控助手，它具備量子就緒的安全檢查功能： 👉 免費網絡監控器

測試方法

點擊聊天圖標（任何頁面的右下角）
選擇一個AI助手類型：
- TurboLLM (GPT-4-mini)
- FreeLLM (開源)
- TestLLM (僅支持CPU的實驗性模型)

測試內容

我正在探索小型開源模型在AI網絡監控中的極限，具體包括：

針對即時網絡服務的函數調用
模型可以縮小到多小，同時仍能處理：
- 自動化Nmap掃描
- 量子就緒檢查
- Metasploit集成

實驗性模型TestLLM（llama.cpp在6個CPU線程上運行）

✅ 零配置設置
⏳ 30秒加載時間（推理速度慢，但無API成本）
🔧 尋求幫助！ 如果你對邊緣設備AI感興趣，讓我們一起合作！

其他助手

🟢 TurboLLM – 使用gpt-4-mini進行：
- 即時網絡診斷
- 自動化滲透測試 (Nmap/Metasploit)
- 🔑 通過下載我們的免費網絡監控代理獲得更多令牌
🔵 HugLLM – 開源模型（約80億參數）：
- 比TurboLLM多2倍的令牌
- 由AI驅動的日誌分析
- 🌐 在Hugging Face推理API上運行