Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct Open Source Large Model - Exceptional Performance in Free Processing of Ultra-Long Texts

Llama 3.1 Nemotron 8B UltraLong 1M Instruct

Developed by nvidia

A large language model specifically designed for processing ultra-long text sequences (supporting up to 1 million, 2 million, and 4 million tokens) while maintaining outstanding performance in standard benchmarks.

Large Language Model

Transformers

English#Ultra-long context understanding #Million-level token processing #Multi-domain instruction fine-tuning

Downloads 4,025

Release Time : 3/4/2025

Model Overview

An ultra-long context language model based on the Llama-3.1 architecture, significantly enhancing long-context understanding and instruction-following capabilities through efficient continual pre-training and instruction fine-tuning.

Model Features

Ultra-Long Context Support

Supports processing ultra-long text sequences of up to 4 million tokens

Efficient Training Solution

Combines efficient continual pre-training with instruction fine-tuning to significantly improve long-context understanding

Performance Retention

Maintains general performance while expanding the context window

Diverse Evaluation

Excels in both long-context tasks and standard benchmarks

Model Capabilities

Ultra-long text sequence processing

Instruction following

General text generation

Mathematical reasoning

Code generation

Use Cases

Long Document Processing

Legal Document Analysis

Processing and analyzing ultra-long legal contracts and documents

Accurately understands and extracts key information from lengthy documents

Research Paper Summarization

Summarizing and extracting key information from lengthy research papers

Maintains coherent understanding of the full text

Dialogue Systems

Long Dialogue Memory

Supports memory and contextual understanding of ultra-long dialogue histories

Maintains consistent responses in extended conversations

🚀 Nemotron-UltraLong-8B

Introducing Nemotron-UltraLong-8B, a series of ultra-long context language models that can process extensive text sequences (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks.

🚀 Quick Start

Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import transformers
import torch

model_id = "nvidia/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

✨ Features

We introduce Nemotron-UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

📚 Documentation

The UltraLong Models

Model Card

Property	Details
Base model	meta-llama/Llama-3.1-8B-Instruct
Continued Pretraining	The training data consists of 1B tokens sourced from a pretraining corpus using per-domain upsampling based on sample length. The model was trained for 125 iterations with a sequence length of 1M and a global batch size of 8.
Supervised fine-tuning (SFT)	1B tokens on open-source instruction datasets across general, mathematics, and code domains. We subsample the data from the ‘general_sft_stage2’ from AceMath-Instruct.
Maximum context window	1M tokens

Evaluation Results

We evaluate Nemotron-UltraLong-8B on a diverse set of benchmarks, including long-context tasks (e.g., RULER, LV-Eval, and InfiniteBench) and standard tasks (e.g., MMLU, MATH, GSM-8K, and HumanEval). UltraLong-8B achieves superior performance on ultra-long context tasks while maintaining competitive results on standard benchmarks.

Needle in a Haystack

Long context evaluation

Standard capability evaluation

Correspondence to

Chejian Xu (chejian2@illinois.edu), Wei Ping (wping@nvidia.com)

Citation

@article{ulralong2025,
  title={From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models},
  author={Xu, Chejian and Ping, Wei and Xu, Peng and Liu, Zihan and Wang, Boxin and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint},
  year={2025}
 }

📄 License

This project is licensed under the cc-by-nc-4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご