Llama-3.1-Nemotron-Nano-4B-v1.1開源語言模型 - 本地單卡運行，推理與任務執行更高效

首頁

Llama 3.1 Nemotron Nano 4B V1.1

由unsloth開發

Llama-3.1-Nemotron-Nano-4B-v1.1 是一個基於Llama 3.1 8B壓縮而來的大型語言模型，優化了推理能力和任務執行效率，適用於單塊RTX顯卡本地運行。

大型語言模型

Transformers

英語開源協議:其他 #推理優化 #工具調用 #單卡部署

下載量 219

發布時間 : 5/21/2025

模型概述

該模型通過多階段後訓練流程增強其推理和非推理能力，包括數學、代碼、推理和工具調用的監督微調，以及對話和指令跟隨的強化學習。

模型特點

高效推理

通過LLM壓縮技術優化，適配單塊RTX顯卡，支持本地運行。

多階段訓練

結合監督微調（SFT）和強化學習（RL）提升模型在數學、代碼、推理和對話任務中的表現。

長上下文支持

支持最高131,072個標記的上下文長度，適合處理長文本任務。

工具調用支持

內置工具調用解析器，支持動態工具選擇和執行。

模型能力

文本生成

數學推理

代碼生成

工具調用

多語言支持

指令跟隨

使用案例

AI代理系統

聊天機器人

用於構建高效的對話系統，支持自然語言交互。

在MT-Bench基準測試中得分8.0（推理開啟模式）。

RAG系統

支持檢索增強生成任務，適用於知識密集型應用。

教育

數學問題求解

解決複雜的數學問題，如方程求解和證明。

在MATH500基準測試中pass@1達96.2%（推理開啟模式）。

開發工具

代碼生成

根據自然語言描述生成可執行的Python代碼。

在MBPP 0-shot基準測試中pass@1達85.8%（推理開啟模式）。

🚀 Llama-3.1-Nemotron-Nano-4B-v1.1

Unsloth Dynamic 2.0實現了卓越的準確性，性能優於其他領先的量化方法。本模型是一個大語言模型，在準確性和效率之間取得了很好的平衡，適用於多種AI應用場景。

Unsloth Dynamic 2.0 實現了卓越的準確性，性能優於其他領先的量化方法。

🚀 快速開始

快速上手和使用建議

推理模式（開啟/關閉）通過系統提示進行控制，必須按照以下示例進行設置。所有指令應包含在用戶提示中。
對於推理開啟模式，建議將溫度設置為 0.6，Top P 設置為 0.95。
對於推理關閉模式，建議使用貪心解碼。
對於每個需要特定模板的基準測試，我們提供了用於評估的提示列表。

我們的代碼要求 transformers 包的版本為 4.44.2 或更高。

示例代碼

“推理開啟”示例

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   temperature=0.6,
   top_p=0.95,
   **model_kwargs
)

# Thinking can be "on" or "off"
thinking = "on"

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))

“推理關閉”示例

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   do_sample=False,
   **model_kwargs
)

# Thinking can be "on" or "off"
thinking = "off"

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))

防止模型推理示例

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Thinking can be "on" or "off"
thinking = "off"

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   do_sample=False,
   **model_kwargs
)

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))

運行支持工具調用的vLLM服務器

Llama-3.1-Nemotron-Nano-4B-v1.1支持工具調用。此HF倉庫託管了一個工具調用解析器以及一個Jinja聊天模板，可用於啟動vLLM服務器。

使用Docker啟動vLLM服務器示例

#!/bin/bash

CWD=$(pwd)
PORT=5000
git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
docker run -it --rm \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -p ${PORT}:${PORT} \
    -v ${CWD}:${CWD} \
    vllm/vllm-openai:v0.6.6 \
    --model $CWD/Llama-3.1-Nemotron-Nano-4B-v1.1 \
    --trust-remote-code \
    --seed 1 \
    --host "0.0.0.0" \
    --port $PORT \
    --served-model-name "Llama-Nemotron-Nano-4B-v1.1" \
    --tensor-parallel-size 1 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.95 \
    --enforce-eager \
    --enable-auto-tool-choice \
    --tool-parser-plugin "${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py" \
    --tool-call-parser "llama_nemotron_json" \
    --chat-template "${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja"

使用虛擬環境啟動vLLM服務器示例

$ git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1

$ conda create -n vllm python=3.12 -y
$ conda activate vllm

$ python -m vllm.entrypoints.openai.api_server \
  --model Llama-3.1-Nemotron-Nano-4B-v1.1 \
  --trust-remote-code \
  --seed 1 \
  --host "0.0.0.0" \
  --port 5000 \
  --served-model-name "Llama-Nemotron-Nano-4B-v1.1" \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --enable-auto-tool-choice \
  --tool-parser-plugin "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py" \
  --tool-call-parser "llama_nemotron_json" \
  --chat-template "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja"

調用支持工具調用的vLLM服務器示例

>>> from openai import OpenAI
>>> client = OpenAI(
        base_url="http://0.0.0.0:5000/v1",
        api_key="dummy",
    )

>>> completion = client.chat.completions.create(
      model="Llama-Nemotron-Nano-4B-v1.1",
      messages=[
        {"role": "system", "content": "detailed thinking on"},
        {"role": "user", "content": "My bill is $100. What will be the amount for 18% tip?"},
      ],
      tools=[
        {"type": "function", "function": {"name": "calculate_tip", "parameters": {"type": "object", "properties": {"bill_total": {"type": "integer", "description": "The total amount of the bill"}, "tip_percentage": {"type": "integer", "description": "The percentage of tip to be applied"}}, "required": ["bill_total", "tip_percentage"]}}},
        {"type": "function", "function": {"name": "convert_currency", "parameters": {"type": "object", "properties": {"amount": {"type": "integer", "description": "The amount to be converted"}, "from_currency": {"type": "string", "description": "The currency code to convert from"}, "to_currency": {"type": "string", "description": "The currency code to convert to"}}, "required": ["from_currency", "amount", "to_currency"]}}},
      ],
    )

>>> completion.choices[0].message.content
'<think>\nOkay, let\'s see. The user has a bill of $100 and wants to know the amount of a 18% tip. So, I need to calculate the tip amount. The available tools include calculate_tip, which requires bill_total and tip_percentage. The parameters are both integers. The bill_total is 100, and the tip percentage is 18. So, the function should multiply 100 by 18% and return 18.0. But wait, maybe the user wants the total including the tip? The question says "the amount for 18% tip," which could be interpreted as the tip amount itself. Since the function is called calculate_tip, it\'s likely that it\'s designed to compute the tip, not the total. So, using calculate_tip with bill_total=100 and tip_percentage=18 should give the correct result. The other function, convert_currency, isn\'t relevant here. So, I should call calculate_tip with those values.\n</think>\n\n'

>>> completion.choices[0].message.tool_calls
[ChatCompletionMessageToolCall(id='chatcmpl-tool-2972d86817344edc9c1e0f9cd398e999', function=Function(arguments='{"bill_total": 100, "tip_percentage": 18}', name='calculate_tip'), type='function')]

✨ 主要特性

高性能：Unsloth Dynamic 2.0實現了卓越的準確性，性能優於其他領先的量化方法。
平衡的準確性和效率：在模型準確性和效率之間取得了很好的平衡，適合在單個RTX GPU上運行並可本地使用。
多語言支持：支持英語和多種編碼語言，也支持其他非英語語言，如德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語。
推理模式控制：通過系統提示可以控制推理模式（開啟/關閉）。
工具調用支持：支持工具調用，可用於更復雜的任務。

📚 詳細文檔

模型概述

Accuracy Comparison Plot

Llama-3.1-Nemotron-Nano-4B-v1.1是一個大語言模型（LLM），它是 nvidia/Llama-3.1-Minitron-4B-Width-Base 的衍生模型，該基礎模型是使用我們的大語言模型壓縮技術從Llama 3.1 8B創建而來，在模型準確性和效率方面有所改進。它是一個經過後訓練的推理模型，適用於推理、人類對話偏好和各種任務，如RAG和工具調用。

該模型在準確性和效率之間取得了很好的平衡，適合在單個RTX GPU上運行並可本地使用，支持128K的上下文長度。

此模型經過了多階段的後訓練過程，以增強其推理和非推理能力。這包括針對數學、代碼、推理和工具調用的有監督微調階段，以及使用獎勵感知偏好優化（RPO）算法進行的多個強化學習（RL）階段，用於對話和指令遵循。最終的模型檢查點是在合併最終的SFT和RPO檢查點後獲得的。

該模型是Llama Nemotron系列的一部分，您可以在以下鏈接找到該系列的其他模型：

該模型可用於商業用途。

許可證/使用條款

適用條款：您對本模型的使用受 NVIDIA開放模型許可證約束。附加信息：Llama 3.1社區許可協議。本模型基於Llama構建。

模型開發者：NVIDIA

模型訓練時間：2024年8月至2025年5月

數據時效性：預訓練數據的截止日期為2023年6月。

使用場景

適用於設計AI代理系統、聊天機器人、RAG系統和其他AI應用程序的開發者，也適用於典型的指令遵循任務。該模型在模型準確性和計算效率之間取得了平衡（適合在單個RTX GPU上運行並可本地使用）。

發佈日期

2025年5月20日

參考文獻

模型架構

屬性	詳情
架構類型	密集型僅解碼器Transformer模型
網絡架構	Llama 3.1 Minitron Width 4B Base

預期用途

Llama-3.1-Nemotron-Nano-4B-v1.1是一個通用的推理和對話模型，旨在用於英語和編碼語言，也支持其他非英語語言（德語、法語、意大利語、葡萄牙語、印地語、西班牙語和泰語）。

輸入

屬性	詳情
輸入類型	文本
輸入格式	字符串
輸入參數	一維（1D）
其他輸入相關屬性	上下文長度最大為131,072個標記

輸出

屬性	詳情
輸出類型	文本
輸出格式	字符串
輸出參數	一維（1D）
其他輸出相關屬性	上下文長度最大為131,072個標記

模型版本

1.1（2025年5月20日）

軟件集成

屬性	詳情
運行時引擎	NeMo 24.12
推薦的硬件微架構兼容性	NVIDIA Hopper、NVIDIA Ampere

推理

屬性	詳情
推理引擎	Transformers
測試硬件	BF16：1x RTX 50系列GPU、1x RTX 40系列GPU、1x RTX 30系列GPU、1x H100 - 80GB GPU、1x A100 - 80GB GPU
首選/支持的操作系統	Linux

訓練數據集

後訓練管道使用了大量的訓練數據，包括手動標註數據和合成數據。

用於代碼、數學和推理改進的多階段後訓練階段的數據是SFT和RL數據的集合，支持改進原始Llama指令模型的數學、代碼、一般推理和指令遵循能力。

提示語來自公共開放語料庫或合成生成。響應由多種模型合成生成，一些提示語包含推理開啟和關閉模式的響應，用於訓練模型區分這兩種模式。

屬性	詳情
訓練數據集的數據收集方式	混合：自動化、人工、合成
訓練數據集的數據標註方式	不適用

評估數據集

我們使用以下數據集對Llama-3.1-Nemotron-Nano-4B-v1.1進行評估。

屬性	詳情
評估數據集的數據收集方式	混合：人工/合成
評估數據集的數據標註方式	混合：人工/合成/自動

評估結果

這些結果包含“推理開啟”和“推理關閉”兩種模式。我們建議在“推理開啟”模式下使用溫度 0.6、Top P 0.95，在“推理關閉”模式下使用貪心解碼。所有評估均使用32k的序列長度進行。我們最多運行16次基準測試並取平均分數以提高準確性。

⚠️ 重要提示

在適用的情況下，將提供提示模板。在完成基準測試時，請確保按照提供的提示解析正確的輸出格式，以重現以下基準測試結果。

MT-Bench

推理模式	分數
推理關閉	7.4
推理開啟	8.0

MATH500

推理模式	單次通過率
推理關閉	71.8%
推理開啟	96.2%

用戶提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

AIME25

推理模式	單次通過率
推理關閉	13.3%
推理開啟	46.3%

用戶提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

GPQA-D

推理模式	單次通過率
推理關閉	33.8%
推理開啟	55.1%

用戶提示模板：

"What is the correct answer to this question: {question}\nChoices:\nA. {option_A}\nB. {option_B}\nC. {option_C}\nD. {option_D}\nLet's think step by step, and put the final answer (should be a single letter A, B, C, or D) into a \boxed{}"

IFEval

推理模式	嚴格提示通過率	嚴格指令通過率
推理關閉	70.1%	78.5%
推理開啟	75.5%	82.6%

BFCL v2 Live

推理模式	分數
推理關閉	63.6%
推理開啟	67.9%

用戶提示模板：

<AVAILABLE_TOOLS>{functions}</AVAILABLE_TOOLS>

{user_prompt}

MBPP 0-shot

推理模式	單次通過率
推理關閉	61.9%
推理開啟	85.8%

用戶提示模板：

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Here is the given problem and test examples:
{prompt}
Please use the python programming language to solve this problem.
Please make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.
Please return all completed codes in one code block.
This code block should be in the following format:
```python
# Your codes here
```