Llama-3_3-Nemotron-Super-49B-v1-FP8开源大模型 - 优化推理与对话，支持长上下文任务

Llama 3 3 Nemotron Super 49B V1 FP8

Developed by nvidia

Llama-3.3-Nemotron-Super-49B-v1-FP8 是一个大型语言模型，基于 Meta Llama-3.3-70B-Instruct 衍生而来，经过优化以增强推理能力、对话偏好和任务执行能力，支持 128K 令牌的上下文长度。

大型语言模型

Transformers

EnglishOpen Source License:Other #128K长上下文推理 #NAS高效架构 #多模态任务优化

Downloads 81

Release Time : 5/13/2025

Model Overview

该模型通过神经架构搜索（NAS）方法优化了精度和效率的平衡，适用于 AI 代理系统、聊天机器人、RAG 系统等应用。

Model Features

高效推理

通过神经架构搜索（NAS）方法优化模型结构，实现精度和效率的平衡，适合高负载环境下的单 GPU 部署。

多阶段训练

经过监督微调和强化学习（RL）阶段，增强数学、代码、推理和对话能力。

长上下文支持

支持 128K 令牌的上下文长度，适合处理复杂任务和大规模数据。

Model Capabilities

文本生成

推理任务

代码生成

数学问题求解

多语言支持

Use Cases

AI 代理系统

聊天机器人

用于构建高性能的对话系统，支持多轮对话和复杂指令。

在 IFEval 基准测试中达到 86.70 的严格指令分数。

教育

数学问题求解

用于解答复杂的数学问题，支持逐步推理和答案生成。

在 MATH500 基准测试中达到 95.6 的 pass@1 分数。

编程辅助

代码生成

生成符合描述的 Python 程序，并通过测试用例。

在 LiveCodeBench 基准测试中达到 41.22 的分数。

🚀 Llama-3.3-Nemotron-Super-49B-v1-FP8

Llama-3.3-Nemotron-Super-49B-v1-FP8是基于Meta Llama-3.3-70B-Instruct衍生的大语言模型，经过多阶段后训练，在推理和非推理能力上表现出色，支持128K上下文长度，在准确性和效率间取得了良好平衡。

🚀 快速开始

推理模式控制

推理模式（开启/关闭）通过系统提示控制，系统提示必须按以下示例设置，所有指令应包含在用户提示中。

参数设置建议

推理开启模式：建议将温度设置为 0.6，Top P 设置为 0.95。
推理关闭模式：建议使用贪婪解码。

试用链接

你可以通过以下链接使用预览 API 试用该模型：Llama-3_3-Nemotron-Super-49B-v1。

使用 vLLM 服务示例

pip install vllm==0.8.3

python3 -m vllm.entrypoints.openai.api_server \
  --model "nvidia/Llama-3_3-Nemotron-Super-49B-v1-FP8" \
  --trust-remote-code \
  --seed=1 \
  --host="0.0.0.0" \
  --port=5000 \
  --served-model-name "nvidia/Llama-3_3-Nemotron-Super-49B-v1-FP8" \
  --tensor-parallel-size=8 \
  --max-model-len=32768 \
  --gpu-memory-utilization 0.95 \
  --enforce-eager \
  --quantization=modelopt

✨ 主要特性

模型优势

准确性与效率平衡：通过新颖的神经架构搜索（NAS）方法，大幅减少模型内存占用，可在单个 GPU（H200）上处理高负载工作，同时能在准确性和效率之间进行权衡选择。
多阶段后训练：经过多阶段后训练，增强了推理和非推理能力，包括监督微调阶段和多个强化学习阶段。

支持语言

支持英语和编码语言，同时也支持其他非英语语言，如德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语。

上下文长度

支持长达 128K 标记的上下文长度。

📦 安装指南

运行引擎

使用 Transformers 作为运行引擎。

硬件兼容性

推荐使用 NVIDIA Hopper 和 NVIDIA Ampere 硬件微架构。

💻 使用示例

基础用法

在推理模式控制方面，需按如下方式设置系统提示：

# 系统提示设置示例
{系统提示内容}

所有指令应包含在用户提示中：

# 用户提示示例
{用户提示内容}

高级用法

在推理开启模式下，设置温度为 0.6，Top P 为 0.95：

# 推理开启模式参数设置示例
parameters = {
    "temperature": 0.6,
    "top_p": 0.95
}

在推理关闭模式下，使用贪婪解码：

# 推理关闭模式贪婪解码示例
parameters = {
    "decoding_method": "greedy"
}

📚 详细文档

模型概述

Llama-3.3-Nemotron-Super-49B-v1-FP8 是基于 Meta Llama-3.3-70B-Instruct 的大语言模型，经过后训练用于推理、满足人类聊天偏好和处理特定任务，如 RAG 和工具调用。该模型支持 128K 标记的上下文长度。

模型架构

架构类型：密集解码器 Transformer 模型。
网络架构：基于 Meta 的 Llama 3.3 70B Instruct，使用神经架构搜索（NAS）算法，产生非标准和非重复的块。
- 跳过注意力：在某些块中，完全跳过注意力或用单个线性层替换。
- 可变 FFN：FFN 层中的扩展/压缩比在不同块之间不同。

训练过程

模型经过块级蒸馏和知识蒸馏（KD）过程，KD 步骤使用了 400 亿标记，包含 FineWeb、Buzz-V1.2 和 Dolma 三个数据集的混合。

预期用途

适用于英语和编码语言的通用推理和聊天场景。

输入输出

输入：文本，字符串格式，一维参数，上下文长度可达 131,072 标记。
输出：文本，字符串格式，一维参数，上下文长度可达 131,072 标记。

模型版本

1.0（2025 年 3 月 18 日）

软件集成

运行引擎：Transformers
首选操作系统：Linux
推荐硬件兼容性：NVIDIA Hopper、NVIDIA Ampere

评估数据集

使用混合方式（人工/合成）收集和标记评估数据集。

评估结果

包含推理开启和关闭两种模式的评估结果，所有评估均在 32k 序列长度下进行，运行基准测试最多 16 次并取平均分数以提高准确性。

伦理考虑

NVIDIA 认为可信 AI 是共同责任，开发者应确保模型符合相关行业和用例要求，避免产品滥用。详细的伦理考虑信息可查看模型卡的相关子卡。

引用信息

@misc{bercovich2025llamanemotronefficientreasoningmodels,
      title={Llama-Nemotron: Efficient Reasoning Models}, 
      author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk},
      year={2025},
      eprint={2505.00949},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.00949}, 
}

🔧 技术细节

神经架构搜索（NAS）

通过 NAS 算法选择满足所需吞吐量和内存要求的模型，同时最小化质量下降。

知识蒸馏（KD）

专注于英语单轮和多轮聊天用例，使用 400 亿标记的混合数据集进行知识蒸馏。

训练数据集

知识蒸馏阶段：使用 FineWeb、Buzz-V1.2 和 Dolma 三个数据集。
多阶段后训练：使用 SFT 和 RL 数据，提高数学、代码、推理和指令跟随能力。

评估数据集

使用混合方式（人工/合成）收集和标记评估数据集。

📄 许可证

本模型的使用受 NVIDIA 开放模型许可证约束。更多信息请参考 Llama 3.3 社区许可协议。

模型开发者

NVIDIA

模型训练时间

2024 年 11 月至 2025 年 2 月

数据新鲜度

预训练数据截止到 2023 年（根据 Meta Llama 3.3 70B）

用例

适用于设计 AI 代理系统、聊天机器人、RAG 系统和其他 AI 应用程序的开发者，也适用于典型的指令跟随任务。

发布日期

2025 年 3 月 18 日

参考资料

模型家族

可在以下链接找到 Llama Nemotron 系列的其他模型：

评估结果表格

评估数据集	推理模式	pass@1
Arena Hard	推理关闭	88.6
BFCL v2	推理关闭	72.10
BFCL v2	推理开启	71.70
MATH500	推理开启	95.6
AIME25	推理开启	53.96
GPQA	推理开启	64.77

用户提示模板

MATH500 和 AIME25

"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \boxed{}.\nQuestion: {question}"

GPQA

"What is the correct answer to this question: {question}\nChoices:\nA. {option_A}\nB. {option_B}\nC. {option_C}\nD. {option_D}\nLet's think step by step, and put the final answer (should be a single letter A, B, C, or D) into a \boxed{}"

LiveCodeBench（无起始代码）

"You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests.

Question: {prompt}

Read the inputs from stdin solve the problem and write the answer to stdout (do not directly test on the sample inputs). Enclose your code within delimiters as follows. Ensure that when the python program runs, it reads the inputs, runs the algorithm and writes output to STDOUT.
```python
# YOUR CODE HERE


#### LiveCodeBench（有起始代码）
```plaintext
You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests.

Question: {prompt}

You will use the following starter code to write the solution to the problem and enclose your code within delimiters.
```python
{starter_code}


### 伦理考虑
请参考模型卡的 [可解释性](./EXPLAINABILITY.md)、[偏差](./BIAS.md)、[安全与保障](./SAFETY_and_SECURITY.md) 和 [隐私](./PRIVACY.md) 子卡获取更多详细信息。

### 安全报告
请 [在此](https://www.nvidia.com/en-us/support/submit-security-vulnerability/) 报告安全漏洞或 NVIDIA AI 相关问题。