Llama-3.1-Nemotron-Nano-4B-v1.1开源语言模型 - 本地单卡运行，推理与任务执行更高效

首页

Llama 3.1 Nemotron Nano 4B V1.1

由 unsloth 开发

Llama-3.1-Nemotron-Nano-4B-v1.1 是一个基于Llama 3.1 8B压缩而来的大型语言模型，优化了推理能力和任务执行效率，适用于单块RTX显卡本地运行。

大型语言模型

Transformers

英语开源协议:其他 #推理优化 #工具调用 #单卡部署

下载量 219

发布时间 : 5/21/2025

模型简介

该模型通过多阶段后训练流程增强其推理和非推理能力，包括数学、代码、推理和工具调用的监督微调，以及对话和指令跟随的强化学习。

模型特点

高效推理

通过LLM压缩技术优化，适配单块RTX显卡，支持本地运行。

多阶段训练

结合监督微调（SFT）和强化学习（RL）提升模型在数学、代码、推理和对话任务中的表现。

长上下文支持

支持最高131,072个标记的上下文长度，适合处理长文本任务。

工具调用支持

内置工具调用解析器，支持动态工具选择和执行。

模型能力

文本生成

数学推理

代码生成

工具调用

多语言支持

指令跟随

使用案例

AI代理系统

聊天机器人

用于构建高效的对话系统，支持自然语言交互。

在MT-Bench基准测试中得分8.0（推理开启模式）。

RAG系统

支持检索增强生成任务，适用于知识密集型应用。

教育

数学问题求解

解决复杂的数学问题，如方程求解和证明。

在MATH500基准测试中pass@1达96.2%（推理开启模式）。

开发工具

代码生成

根据自然语言描述生成可执行的Python代码。

在MBPP 0-shot基准测试中pass@1达85.8%（推理开启模式）。

🚀 Llama-3.1-Nemotron-Nano-4B-v1.1

Unsloth Dynamic 2.0实现了卓越的准确性，性能优于其他领先的量化方法。本模型是一个大语言模型，在准确性和效率之间取得了很好的平衡，适用于多种AI应用场景。

Unsloth Dynamic 2.0 实现了卓越的准确性，性能优于其他领先的量化方法。

🚀 快速开始

快速上手和使用建议

推理模式（开启/关闭）通过系统提示进行控制，必须按照以下示例进行设置。所有指令应包含在用户提示中。
对于推理开启模式，建议将温度设置为 0.6，Top P 设置为 0.95。
对于推理关闭模式，建议使用贪心解码。
对于每个需要特定模板的基准测试，我们提供了用于评估的提示列表。

我们的代码要求 transformers 包的版本为 4.44.2 或更高。

示例代码

“推理开启”示例

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   temperature=0.6,
   top_p=0.95,
   **model_kwargs
)

# Thinking can be "on" or "off"
thinking = "on"

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))

“推理关闭”示例

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   do_sample=False,
   **model_kwargs
)

# Thinking can be "on" or "off"
thinking = "off"

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}]))

防止模型推理示例

import torch
import transformers

model_id = "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "auto"}
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

# Thinking can be "on" or "off"
thinking = "off"

pipeline = transformers.pipeline(
   "text-generation",
   model=model_id,
   tokenizer=tokenizer,
   max_new_tokens=32768,
   do_sample=False,
   **model_kwargs
)

print(pipeline([{"role": "system", "content": f"detailed thinking {thinking}"}, {"role": "user", "content": "Solve x*(sin(x)+2)=0"}, {"role":"assistant", "content":"<think>\n</think>"}]))

运行支持工具调用的vLLM服务器

Llama-3.1-Nemotron-Nano-4B-v1.1支持工具调用。此HF仓库托管了一个工具调用解析器以及一个Jinja聊天模板，可用于启动vLLM服务器。

使用Docker启动vLLM服务器示例

#!/bin/bash

CWD=$(pwd)
PORT=5000
git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
docker run -it --rm \
    --runtime=nvidia \
    --gpus all \
    --shm-size=16GB \
    -p ${PORT}:${PORT} \
    -v ${CWD}:${CWD} \
    vllm/vllm-openai:v0.6.6 \
    --model $CWD/Llama-3.1-Nemotron-Nano-4B-v1.1 \
    --trust-remote-code \
    --seed 1 \
    --host "0.0.0.0" \
    --port $PORT \
    --served-model-name "Llama-Nemotron-Nano-4B-v1.1" \
    --tensor-parallel-size 1 \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.95 \
    --enforce-eager \
    --enable-auto-tool-choice \
    --tool-parser-plugin "${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py" \
    --tool-call-parser "llama_nemotron_json" \
    --chat-template "${CWD}/Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja"

使用虚拟环境启动vLLM服务器示例

$ git clone https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1

$ conda create -n vllm python=3.12 -y
$ conda activate vllm

$ python -m vllm.entrypoints.openai.api_server \
  --model Llama-3.1-Nemotron-Nano-4B-v1.1 \
  --trust-remote-code \
  --seed 1 \
  --host "0.0.0.0" \
  --port 5000 \
  --served-model-name "Llama-Nemotron-Nano-4B-v1.1" \
  --tensor-parallel-size 1 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --enable-auto-tool-choice \
  --tool-parser-plugin "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_toolcall_parser.py" \
  --tool-call-parser "llama_nemotron_json" \
  --chat-template "Llama-3.1-Nemotron-Nano-4B-v1.1/llama_nemotron_nano_generic_tool_calling.jinja"

调用支持工具调用的vLLM服务器示例

>>> from openai import OpenAI
>>> client = OpenAI(
        base_url="http://0.0.0.0:5000/v1",
        api_key="dummy",
    )

>>> completion = client.chat.completions.create(
      model="Llama-Nemotron-Nano-4B-v1.1",
      messages=[
        {"role": "system", "content": "detailed thinking on"},
        {"role": "user", "content": "My bill is $100. What will be the amount for 18% tip?"},
      ],
      tools=[
        {"type": "function", "function": {"name": "calculate_tip", "parameters": {"type": "object", "properties": {"bill_total": {"type": "integer", "description": "The total amount of the bill"}, "tip_percentage": {"type": "integer", "description": "The percentage of tip to be applied"}}, "required": ["bill_total", "tip_percentage"]}}},
        {"type": "function", "function": {"name": "convert_currency", "parameters": {"type": "object", "properties": {"amount": {"type": "integer", "description": "The amount to be converted"}, "from_currency": {"type": "string", "description": "The currency code to convert from"}, "to_currency": {"type": "string", "description": "The currency code to convert to"}}, "required": ["from_currency", "amount", "to_currency"]}}},
      ],
    )

>>> completion.choices[0].message.content
'<think>\nOkay, let\'s see. The user has a bill of $100 and wants to know the amount of a 18% tip. So, I need to calculate the tip amount. The available tools include calculate_tip, which requires bill_total and tip_percentage. The parameters are both integers. The bill_total is 100, and the tip percentage is 18. So, the function should multiply 100 by 18% and return 18.0. But wait, maybe the user wants the total including the tip? The question says "the amount for 18% tip," which could be interpreted as the tip amount itself. Since the function is called calculate_tip, it\'s likely that it\'s designed to compute the tip, not the total. So, using calculate_tip with bill_total=100 and tip_percentage=18 should give the correct result. The other function, convert_currency, isn\'t relevant here. So, I should call calculate_tip with those values.\n</think>\n\n'

>>> completion.choices[0].message.tool_calls
[ChatCompletionMessageToolCall(id='chatcmpl-tool-2972d86817344edc9c1e0f9cd398e999', function=Function(arguments='{"bill_total": 100, "tip_percentage": 18}', name='calculate_tip'), type='function')]

✨ 主要特性

高性能：Unsloth Dynamic 2.0实现了卓越的准确性，性能优于其他领先的量化方法。
平衡的准确性和效率：在模型准确性和效率之间取得了很好的平衡，适合在单个RTX GPU上运行并可本地使用。
多语言支持：支持英语和多种编码语言，也支持其他非英语语言，如德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语。
推理模式控制：通过系统提示可以控制推理模式（开启/关闭）。
工具调用支持：支持工具调用，可用于更复杂的任务。

📚 详细文档

模型概述

Accuracy Comparison Plot

Llama-3.1-Nemotron-Nano-4B-v1.1是一个大语言模型（LLM），它是 nvidia/Llama-3.1-Minitron-4B-Width-Base 的衍生模型，该基础模型是使用我们的大语言模型压缩技术从Llama 3.1 8B创建而来，在模型准确性和效率方面有所改进。它是一个经过后训练的推理模型，适用于推理、人类对话偏好和各种任务，如RAG和工具调用。

该模型在准确性和效率之间取得了很好的平衡，适合在单个RTX GPU上运行并可本地使用，支持128K的上下文长度。

此模型经过了多阶段的后训练过程，以增强其推理和非推理能力。这包括针对数学、代码、推理和工具调用的有监督微调阶段，以及使用奖励感知偏好优化（RPO）算法进行的多个强化学习（RL）阶段，用于对话和指令遵循。最终的模型检查点是在合并最终的SFT和RPO检查点后获得的。

该模型是Llama Nemotron系列的一部分，您可以在以下链接找到该系列的其他模型：

该模型可用于商业用途。

许可证/使用条款

适用条款：您对本模型的使用受 NVIDIA开放模型许可证约束。附加信息：Llama 3.1社区许可协议。本模型基于Llama构建。

模型开发者：NVIDIA

模型训练时间：2024年8月至2025年5月

数据时效性：预训练数据的截止日期为2023年6月。

使用场景

适用于设计AI代理系统、聊天机器人、RAG系统和其他AI应用程序的开发者，也适用于典型的指令遵循任务。该模型在模型准确性和计算效率之间取得了平衡（适合在单个RTX GPU上运行并可本地使用）。

发布日期

2025年5月20日

参考文献

模型架构

属性	详情
架构类型	密集型仅解码器Transformer模型
网络架构	Llama 3.1 Minitron Width 4B Base

预期用途

Llama-3.1-Nemotron-Nano-4B-v1.1是一个通用的推理和对话模型，旨在用于英语和编码语言，也支持其他非英语语言（德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语）。

输入

属性	详情
输入类型	文本
输入格式	字符串
输入参数	一维（1D）
其他输入相关属性	上下文长度最大为131,072个标记

输出

属性	详情
输出类型	文本
输出格式	字符串
输出参数	一维（1D）
其他输出相关属性	上下文长度最大为131,072个标记

模型版本

1.1（2025年5月20日）

软件集成

属性	详情
运行时引擎	NeMo 24.12
推荐的硬件微架构兼容性	NVIDIA Hopper、NVIDIA Ampere

推理

属性	详情
推理引擎	Transformers
测试硬件	BF16：1x RTX 50系列GPU、1x RTX 40系列GPU、1x RTX 30系列GPU、1x H100 - 80GB GPU、1x A100 - 80GB GPU
首选/支持的操作系统	Linux

训练数据集

后训练管道使用了大量的训练数据，包括手动标注数据和合成数据。

用于代码、数学和推理改进的多阶段后训练阶段的数据是SFT和RL数据的集合，支持改进原始Llama指令模型的数学、代码、一般推理和指令遵循能力。

提示语来自公共开放语料库或合成生成。响应由多种模型合成生成，一些提示语包含推理开启和关闭模式的响应，用于训练模型区分这两种模式。

属性	详情
训练数据集的数据收集方式	混合：自动化、人工、合成
训练数据集的数据标注方式	不适用

评估数据集

我们使用以下数据集对Llama-3.1-Nemotron-Nano-4B-v1.1进行评估。

属性	详情
评估数据集的数据收集方式	混合：人工/合成
评估数据集的数据标注方式	混合：人工/合成/自动

评估结果

这些结果包含“推理开启”和“推理关闭”两种模式。我们建议在“推理开启”模式下使用温度 0.6、Top P 0.95，在“推理关闭”模式下使用贪心解码。所有评估均使用32k的序列长度进行。我们最多运行16次基准测试并取平均分数以提高准确性。

⚠️ 重要提示

在适用的情况下，将提供提示模板。在完成基准测试时，请确保按照提供的提示解析正确的输出格式，以重现以下基准测试结果。

MT-Bench

推理模式	分数
推理关闭	7.4
推理开启	8.0

MATH500

推理模式	单次通过率
推理关闭	71.8%
推理开启	96.2%

用户提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

AIME25

推理模式	单次通过率
推理关闭	13.3%
推理开启	46.3%

用户提示模板：

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

GPQA-D

推理模式	单次通过率
推理关闭	33.8%
推理开启	55.1%

用户提示模板：

"What is the correct answer to this question: {question}\nChoices:\nA. {option_A}\nB. {option_B}\nC. {option_C}\nD. {option_D}\nLet's think step by step, and put the final answer (should be a single letter A, B, C, or D) into a \boxed{}"

IFEval

推理模式	严格提示通过率	严格指令通过率
推理关闭	70.1%	78.5%
推理开启	75.5%	82.6%

BFCL v2 Live

推理模式	分数
推理关闭	63.6%
推理开启	67.9%

用户提示模板：

<AVAILABLE_TOOLS>{functions}</AVAILABLE_TOOLS>

{user_prompt}

MBPP 0-shot

推理模式	单次通过率
推理关闭	61.9%
推理开启	85.8%

用户提示模板：

You are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.

@@ Instruction
Here is the given problem and test examples:
{prompt}
Please use the python programming language to solve this problem.
Please make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.
Please return all completed codes in one code block.
This code block should be in the following format:
```python
# Your codes here
```