Llama-3.1-8B-UltraLong-4M-Instruct Open Source Large Model - Exceptional Performance in Processing Ultra-Long Text Sequences

Llama 3.1 8B UltraLong 4M Instruct

Developed by nvidia

A large language model specifically designed for processing ultra-long text sequences (supporting up to 1 million, 2 million, and 4 million tokens), maintaining excellent performance in standard benchmarks

Large Language Model

Transformers

English#Ultra-long context understanding #4 million token window #Continuous pre-training optimization

Downloads 264

Release Time : 3/4/2025

Model Overview

An ultra-long context language model based on Llama-3.1 architecture, significantly improving long-context understanding and instruction-following capabilities through systematic training methods of efficient continuous pre-training and instruction fine-tuning

Model Features

Ultra-long context support

Supports processing ultra-long text sequences of up to 4 million tokens

Efficient training scheme

Combines systematic training methods of continuous pre-training and instruction fine-tuning to expand context windows while maintaining general performance

Multi-domain adaptability

Excels in general, mathematical, and coding domains

Model Capabilities

Ultra-long text understanding

Instruction following

Mathematical reasoning

Code generation

Multi-turn dialogue

Use Cases

Long document processing

Legal document analysis

Processing and analyzing ultra-long legal contracts and documents

Accurately understands long-range dependencies in documents

Academic paper summarization

Summarizing and extracting key information from lengthy academic papers

Maintains coherent understanding of the full text content

Dialogue systems

Ultra-long conversation memory

Maintaining context consistency in long conversations

Accurately tracks historical information in ultra-long conversations

🚀 Nemotron-UltraLong-8B

A series of ultra-long context language models that can process extensive text sequences while maintaining competitive performance on standard benchmarks.

🚀 Quick Start

Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import transformers
import torch

model_id = "nvidia/Llama-3.1-Nemotron-8B-UltraLong-4M-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

✨ Features

We introduce Nemotron-UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

📚 Documentation

The UltraLong Models

Model Card

Property	Details
Base model	meta-llama/Llama-3.1-8B-Instruct
Continued Pretraining	The training data consists of 1B tokens sourced from a pretraining corpus using per - domain upsampling based on sample length. The model was trained for 150 iterations with a sequence length of 4M and a global batch size of 2.
Supervised fine - tuning (SFT)	1B tokens on open - source instruction datasets across general, mathematics, and code domains. We subsample the data from the ‘general_sft_stage2’ from AceMath-Instruct.
Maximum context window	4M tokens

Evaluation Results

We evaluate Nemotron-UltraLong-8B on a diverse set of benchmarks, including long-context tasks (e.g., RULER, LV-Eval, and InfiniteBench) and standard tasks (e.g., MMLU, MATH, GSM-8K, and HumanEval). UltraLong-8B achieves superior performance on ultra-long context tasks while maintaining competitive results on standard benchmarks.

Needle in a Haystack

Long context evaluation

Standard capability evaluation

Correspondence to

Chejian Xu (chejian2@illinois.edu), Wei Ping (wping@nvidia.com)

Citation

@article{ulralong2025,
  title={From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models},
  author={Xu, Chejian and Ping, Wei and Xu, Peng and Liu, Zihan and Wang, Boxin and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint},
  year={2025}
 }

📄 License

cc-by-nc-4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご