Llama-3.1-8B-UltraLong-1M-Instruct Open-Source Language Model - Free for Handling Ultra-Long Texts with a 4-Million Token Context

Llama 3.1 8B UltraLong 1M Instruct

Developed by nvidia

The Nemotron-UltraLong-8B series is a language model specifically designed for processing ultra-long text sequences, supporting a context window of up to 4 million tokens while maintaining exceptional performance.

Large Language Model

Transformers

English#Ultra-long context understanding #Million-level token processing #Instruction fine-tuning optimization

Downloads 1,387

Release Time : 3/4/2025

Model Overview

An ultra-long context language model based on the Llama-3.1 architecture, enhanced through efficient continual pre-training and instruction fine-tuning to improve long-context understanding and instruction-following capabilities.

Model Features

Ultra-long context support

Supports a context window of up to 4 million tokens, specifically designed for processing ultra-long text sequences.

Efficient training approach

Combines continual pre-training with instruction fine-tuning to significantly enhance long-context understanding and instruction-following capabilities.

Performance balance

Maintains exceptional performance in standard benchmark tests while expanding the context window.

Model Capabilities

Ultra-long text sequence processing

Instruction following

General text generation

Mathematical reasoning

Code generation

Use Cases

Long document processing

Legal document analysis

Processing and analyzing ultra-long legal documents to extract key information.

Efficiently understands long document content and accurately extracts information.

Academic paper summarization

Summarizing and extracting key points from lengthy academic papers.

Generates accurate and comprehensive summaries.

Dialogue systems

Long-context chatbot

Building chatbots capable of remembering and referencing long conversation histories.

Provides coherent and contextually relevant responses.

🚀 Nemotron-UltraLong-8B

A series of ultra-long context language models designed to process extensive text sequences while maintaining competitive performance on standard benchmarks.

🚀 Quick Start

Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import transformers
import torch

model_id = "nvidia/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

✨ Features

We introduce Nemotron-UltraLong-8B, a series of ultra-long context language models. It can process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

📚 Documentation

The UltraLong Models

Model Card

Property	Details
Base model	meta-llama/Llama-3.1-8B-Instruct
Continued Pretraining	The training data consists of 1B tokens sourced from a pretraining corpus using per-domain upsampling based on sample length. The model was trained for 125 iterations with a sequence length of 1M and a global batch size of 8.
Supervised fine - tuning (SFT)	1B tokens on open - source instruction datasets across general, mathematics, and code domains. We subsample the data from the ‘general_sft_stage2’ from AceMath-Instruct.
Maximum context window	1M tokens

Evaluation Results

We evaluate Nemotron-UltraLong-8B on a diverse set of benchmarks, including long-context tasks (e.g., RULER, LV-Eval, and InfiniteBench) and standard tasks (e.g., MMLU, MATH, GSM-8K, and HumanEval). UltraLong-8B achieves superior performance on ultra-long context tasks while maintaining competitive results on standard benchmarks.

Needle in a Haystack

Long context evaluation

Standard capability evaluation

Correspondence to

Chejian Xu (chejian2@illinois.edu), Wei Ping (wping@nvidia.com)

Citation

@article{ulralong2025,
  title={From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models},
  author={Xu, Chejian and Ping, Wei and Xu, Peng and Liu, Zihan and Wang, Boxin and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint},
  year={2025}
 }

📄 License

This project is licensed under the cc-by-nc-4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご