Llama-3.1-Nemotron-Nano-8B-v1-GGUF開源推理模型 - 增強聊天和任務執行能力

首頁

Llama 3.1 Nemotron Nano 8B V1 GGUF

由unsloth開發

Llama-3.1-Nemotron-Nano-8B-v1是基於Meta Llama-3.1-8B-Instruct的推理模型，經過後訓練增強推理能力、人類聊天偏好及任務執行能力。

大型語言模型

Transformers

英語開源協議:其他 #強化學習優化 #128K長文本推理 #數學代碼增強

下載量 22.18k

發布時間 : 5/11/2025

模型概述

這是一個大型語言模型(LLM)，在模型準確性和效率之間提供良好平衡，支持128K上下文長度，適用於英語和編程語言。

模型特點

增強推理能力

經過多階段後訓練過程，包括監督微調和強化學習，顯著提升數學、代碼和推理能力

高效推理

可在單個RTX GPU上運行，適合本地部署，平衡計算效率與模型準確性

長上下文支持

支持128K標記的上下文長度，適合處理長文檔和複雜任務

雙模式推理

支持'推理開啟'和'推理關閉'兩種模式，適應不同場景需求

模型能力

文本生成

數學推理

代碼生成

指令跟隨

聊天對話

工具調用

RAG系統支持

使用案例

AI代理系統

智能聊天機器人

構建能夠理解複雜指令並進行自然對話的AI助手

在MT-Bench上獲得8.1分(推理開啟模式)

教育

數學問題解答

解決複雜數學問題並提供分步解釋

在MATH500上達到95.4% pass@1(推理開啟模式)

軟件開發

代碼生成與輔助

根據描述生成功能代碼或幫助調試

在MBPP 0-shot測試中達到84.6% pass@1

🚀 Llama-3.1-Nemotron-Nano-8B-v1

Llama-3.1-Nemotron-Nano-8B-v1是一個大語言模型，在推理能力和效率上取得了很好的平衡，適用於多種AI應用場景。

Unsloth Dynamic 2.0 實現了卓越的準確性，性能優於其他領先的量化方法。

🚀 快速開始

你可以通過預覽API試用這個模型，使用此鏈接：Llama-3.1-Nemotron-Nano-8B-v1。

以下是使用Hugging Face Transformers庫的代碼片段示例，推理模式（開啟/關閉）通過系統提示控制，請參考以下示例。我們的代碼要求transformers包的版本為4.44.2或更高。

基礎用法

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   temperature=0.6,
   top_p=0.95,
   **model_kwargs
)

# 思考模式可以是 "on" 或 "off"
thinking = "on"

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))

高級用法

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-8B-v1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

# 思考模式可以是 "on" 或 "off"
thinking = "off"

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   do_sample=False,
   **model_kwargs
)

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))

✨ 主要特性

推理能力強：經過多階段的後訓練過程，增強了推理和非推理能力，適用於推理、人機對話偏好和任務，如RAG和工具調用。
準確性與效率平衡：在模型準確性和效率之間取得了很好的平衡，基於Llama 3.1 8B Instruct創建，提高了模型準確性，可在單個RTX GPU上運行並本地使用。
長上下文支持：支持128K的上下文長度。
多語言支持：適用於英語和編碼語言，也支持其他非英語語言，如德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語。

📦 安裝指南

文檔未提及具體安裝步驟，可參考快速開始部分使用Hugging Face Transformers庫的示例，確保transformers包的版本為4.44.2或更高。

📚 詳細文檔

模型概述

Llama-3.1-Nemotron-Nano-8B-v1是一個大語言模型（LLM），它是Meta Llama-3.1-8B-Instruct（又名參考模型）的衍生模型。它是一個推理模型，經過後訓練以提升推理能力、滿足人機對話偏好和處理各種任務，如RAG和工具調用。

該模型在模型準確性和效率之間取得了很好的平衡，基於Llama 3.1 8B Instruct創建，提高了模型準確性，可在單個RTX GPU上運行並本地使用，支持128K的上下文長度。

此模型經過多階段的後訓練過程，以增強其推理和非推理能力。這包括針對數學、代碼、推理和工具調用的有監督微調階段，以及使用REINFORCE（RLOO）和在線獎勵感知偏好優化（RPO）算法進行的多個強化學習（RL）階段，用於對話和指令跟隨。最終的模型檢查點是在合併最終的SFT和在線RPO檢查點後獲得的，並使用Qwen進行了改進。

該模型是Llama Nemotron系列的一部分，你可以在此處找到該系列的其他模型：Llama-3.3-Nemotron-Super-49B-v1。

該模型可用於商業用途。

許可證/使用條款

適用條款：你使用此模型受NVIDIA開放模型許可證的約束。
附加信息：Llama 3.1社區許可協議。基於Llama構建。

模型開發者：NVIDIA

模型訓練時間：2024年8月至2025年3月

數據時效性：根據Meta Llama 3.1 8B，預訓練數據的截止時間為2023年。

使用場景

適用於設計AI代理系統、聊天機器人、RAG系統和其他AI應用的開發者，也適用於典型的指令跟隨任務。在模型準確性和計算效率之間取得平衡（模型可在單個RTX GPU上運行並本地使用）。

發佈日期

2025年3月18日

參考資料

模型架構

屬性	詳情
架構類型	密集解碼器Transformer模型
網絡架構	Llama 3.1 8B Instruct

預期用途

Llama-3.1-Nemotron-Nano-8B-v1是一個通用的推理和對話模型，適用於英語和編碼語言，也支持其他非英語語言，如德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語。

輸入

屬性	詳情
輸入類型	文本
輸入格式	字符串
輸入參數	一維（1D）
其他輸入相關屬性	上下文長度最大為131,072個標記

輸出

屬性	詳情
輸出類型	文本
輸出格式	字符串
輸出參數	一維（1D）
其他輸出相關屬性	上下文長度最大為131,072個標記

模型版本

1.0（2025年3月18日）

軟件集成

屬性	詳情
運行時引擎	NeMo 24.12
推薦的硬件微架構兼容性	NVIDIA Hopper、NVIDIA Ampere

推理

屬性	詳情
推理引擎	Transformers
測試硬件	BF16：1x RTX 50系列GPU、1x RTX 40系列GPU、1x RTX 30系列GPU、1x H100 - 80GB GPU、1x A100 - 80GB GPU
首選/支持的操作系統	Linux

訓練數據集

後訓練管道使用了多種訓練數據，包括手動標註數據和合成數據。

用於多階段後訓練階段以改進代碼、數學和推理能力的數據是SFT和RL數據的集合，支持提高原始Llama指令模型的數學、代碼、一般推理和指令跟隨能力。

提示語來自公共開放語料庫或合成生成。響應由多種模型合成生成，一些提示語包含推理開啟和關閉模式的響應，以訓練模型區分兩種模式。

訓練數據集的數據收集：混合方式：自動化、人工、合成

訓練數據集的數據標註：不適用

評估數據集

我們使用以下數據集評估Llama-3.1-Nemotron-Nano-8B-v1。

評估數據集的數據收集：混合方式：人工/合成

評估數據集的數據標註：混合方式：人工/合成/自動

評估結果

這些結果包含“推理開啟”和“推理關閉”兩種模式。我們建議在“推理開啟”模式下使用溫度0.6、top_p0.95，在“推理關閉”模式下使用貪心解碼。所有評估均使用32k序列長度進行。我們最多運行16次基準測試並取平均分數以提高準確性。

⚠️ 重要提示

適用時，將提供提示模板。完成基準測試時，請確保按照提供的提示解析正確的輸出格式，以重現以下基準測試結果。

MT - Bench

推理模式	分數
推理關閉	7.9
推理開啟	8.1

MATH500

推理模式	pass@1
推理關閉	36.6%
推理開啟	95.4%

用戶提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

AIME25

推理模式	pass@1
推理關閉	0%
推理開啟	47.1%

用戶提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

GPQA - D

推理模式	pass@1
推理關閉	39.4%
推理開啟	54.1%

用戶提示模板：

"What is the correct answer to this question: {question}\nChoices:\nA. {option_A}\nB. {option_B}\nC. {option_C}\nD. {option_D}\nLet's think step by step, and put the final answer (should be a single letter A, B, C, or D) into a \boxed{}"

IFEval Average

推理模式	嚴格：提示	嚴格：指令
推理關閉	74.7%	82.1%
推理開啟	71.9%	79.3%

BFCL v2 Live

推理模式	分數
推理關閉	63.9%
推理開啟	63.6%

用戶提示模板：

<AVAILABLE_TOOLS>{functions}</AVAILABLE_TOOLS>

{user_prompt}

MBPP 0 - shot

推理模式	pass@1
推理關閉	66.1%
推理開啟	84.6%

用戶提示模板：

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Here is the given problem and test examples:
{prompt}
Please use the python programming language to solve this problem.
Please make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.
Please return all completed codes in one code block.
This code block should be in the following format:
```python
# Your codes here


### 倫理考慮
NVIDIA認為可信AI是一項共同責任，我們已經制定了政策和實踐，以支持開發各種AI應用。當按照我們的服務條款下載或使用此模型時，開發者應與內部模型團隊合作，確保該模型滿足相關行業和用例的要求，並解決不可預見的產品濫用問題。

有關此模型倫理考慮的更多詳細信息，請參閱模型卡片++的[可解釋性](explainability.md)、[偏差](bias.md)、[安全與保障](safety.md)和[隱私](privacy.md)子卡片。

請[在此](https://www.nvidia.com/en-us/support/submit-security-vulnerability/)報告安全漏洞或NVIDIA AI相關問題。

### 引用

@misc{bercovich2025llamanemotronefficientreasoningmodels, title={Llama-Nemotron: Efficient Reasoning Models}, author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk}, year={2025}, eprint={2505.00949}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.00949}, }


## 🔧 技術細節
該模型經過多階段的後訓練過程，包括有監督微調階段和強化學習階段。有監督微調階段針對數學、代碼、推理和工具調用進行，強化學習階段使用REINFORCE（RLOO）和在線獎勵感知偏好優化（RPO）算法，用於對話和指令跟隨。最終的模型檢查點是在合併最終的SFT和在線RPO檢查點後獲得的，並使用Qwen進行了改進。

## 📄 許可證
本模型的使用受[NVIDIA開放模型許可證](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/)的約束，附加信息請參考[Llama 3.1社區許可協議](https://www.llama.com/llama3_1/license/)。