Instella-3B-Long-Instruct Open-Source Language Model - Supports 128K Long Context and Excels in Long Text Processing

Instella 3B Long Instruct

Developed by amd

Instella-Long is an open-source language model with 3B parameters developed by AMD, supporting a context length of 128K and performing excellently in long-context benchmark tests.

Large Language Model

Transformers

Open Source License:Other #128K long context #Fully open source #Efficient training technology

Downloads 240

Release Time : 5/28/2025

Model Overview

Instella-Long is a fully open-source language model capable of handling long contexts. It is continuously trained on the AMD Instinct™ MI300X GPU based on Instella-3B-Instruct, supports a context length of 128K, and outperforms similar open-source models in terms of performance.

Model Features

Long context support

Supports a context length of 128K and performs excellently in long-context tasks.

Fully open source

The model weights, training configurations, datasets, and code are all open source, facilitating community collaboration and innovation.

Efficient training technology

Adopts efficient training technologies such as sequence parallelism, FlashAttention-2, Torch Compile, and FSDP to achieve high-performance training on AMD hardware.

Multi-stage training

Optimizes the model performance through three stages: continuous pre-training, supervised fine-tuning, and direct preference optimization.

Model Capabilities

Long text processing

Question answering generation

Instruction following

Text generation

Use Cases

Information retrieval and question answering

Long document question answering

Process documents up to 128K tokens long and generate accurate question-answer pairs.

Outperforms similar open-source models in the Helmet benchmark test.

Multi-document information integration

Integrate information from multiple documents and generate comprehensive answers.

Performs excellently in RAG tasks.

Academic research

Academic paper summarization and question answering

Process academic papers and generate summaries or answer related questions.

Performs well on the ArXiv dataset.

🚀 Instella-Long✨: Fully Open Language Model with Long-context Capability

AMD is thrilled to unveil Instella-Long, a long-context language model continuously trained from Instella-3B-Instruct on AMD Instinct™ MI300X GPUs. As far as we know, Instella-Long makes the Instella series the first fully open language model trained from scratch to support long contexts. Instella-Long can handle a 128K context length and achieves competitive performance, outshining open-weights models like Phi-3.5-mini, Gemma-3-4B, and Qwen2.5-3B in long-context benchmarks.

Training Instella with long context extension on Instinct MI300X GPUs showcases our hardware's ability and scalability in managing demanding AI training workloads, presenting a viable option in the AI hardware market. In line with AMD's commitment to open source, we are sharing all model weights, detailed training configurations, datasets, and code, enabling the AI community to collaborate, replicate, and innovate, thus accelerating progress.

✨ Features

New Model Announcement: AMD has developed Instella-Long, a 3B long-context language model supporting 128K context length, trained on 64 Instinct MI300X GPUs.
Fully Open and Long-context Support: To our knowledge, Instella-Long makes the Instella series the first fully open language model trained from scratch with long-context support. The Hugging Face model, training data, and training code are all fully open-sourced.
Efficient Training Techniques: Backed by the AMD ROCm software stack, Instella-Long uses efficient training techniques such as Sequence Parallelism, FlashAttention - 2, Torch Compile, and FSDP to distribute model training across 8 MI300 nodes, each with 8 GPUs.

🚀 Quick Start

Instella-Long

Instella-Long is based on the Instella model released in March. Specifically, it is continuously trained from Instella-3B-Instruct and follows the same model architecture. The training of Instella-Long consists of three stages:

Continued Pre - Training
- Training: We conduct a two - phase pre - training starting from Instella-3B-Instruct (4K context length).
  - Phase 1: We extend the context length from 4,096 to 65,536 tokens and train the model on 20B tokens. We follow the RoPE scaling law to increase the base frequency of RoPE from 10,000 to 514,640.
  - Phase 2: As suggested by Prolong, it is beneficial to train the model on data with a context length longer than the target context length. In this phase, we train the model on 20B tokens with a maximum context length of 262,144 (2× the target context length of 128K). We increase the RoPE base frequency to 3,691,950 according to the RoPE scaling law.
- Data: Our continued pre - training data comes from the data mix created by Prolong. We use the text data curated by Prolong and tokenize it with our tokenizer. In each phase of the continued pre - training, we train on a mix of long and short context data. The specific details are as follows:

Training Phase	64K Long Data	256K Long Data	Short Data
Phase 1	Code repos (30%), Books (30%), Textbooks (3%)	-	FineWeb - Edu (10%), FineWeb (10%), StackExchange (4%), Wikipedia (5%), ArXiv (3%), OpenWebMath (5%)
Phase 2	Code repos (10%), Books (15%)	Code repos (20%), Books (15%), Textbooks (2%)	FineWeb - Edu (10%), FineWeb (10%), StackExchange (4%), Wikipedia (5%), ArXiv (4%), OpenWebMath (5%)

Supervised Finetuning (SFT)
- Training: After continued training on the long - context pre - training data, we perform supervised finetuning on long - context instruction data. We train the model on a 1B - token mixture of short - and long - context instruction data.
- Data: Similar to the continued pre - training stage, we train the model on a mixture of short - and long - context instruction data with a ratio of 4 to 6. For short - context instruction data, we use Ultrachat 200K, [OpenMathinstruct - 2](https://huggingface.co/datasets/nvidia/OpenMathInstruct - 2), [Tülu - 3 Instruction Following](https://huggingface.co/datasets/allenai/tulu - 3 - sft - personas - instruction - following), and MMLU auxiliary train set. For long - context instruction data, we construct a synthetic long - context instruction dataset due to the lack of long - context SFT data.
  - We use long documents from Books in our continued pre - training corpus. We select documents with a minimum length of 8K tokens and truncate those exceeding 128K tokens to a maximum length of 128K. Then, we use [Qwen2.5 - 14B - Instruct - 1M](https://huggingface.co/Qwen/Qwen2.5 - 14B - Instruct - 1M) as a teacher model to synthetically generate question - answer pairs for the documents. To speed up the process, we randomly choose a sub - part of the document for QA generation, with the length of the sub - part randomly set between 2K and 8K tokens. We use the NLTK sentence tokenizer to ensure the selected sub - part has complete sentences. The generated question and answer are appended to the end of the long document as a complete single - round instruction - following data sample.
  - We also generate long - context instruction data using short documents to increase dataset diversity. We use ArXiv from our continued pre - training corpus and the DCLM subset from [Dolmino - Mix - 1124](https://huggingface.co/datasets/allenai/dolmino - mix - 1124). We first generate QA for each short document following the same pipeline. Then, we iteratively concatenate different short documents until it reaches 128K tokens. The concatenated document can exceed 128K as we do not truncate the last document. Finally, we randomly choose one QA corresponding to one of the short documents and append it to the end of the concatenated document. The final data mixture for the SFT stage is as follows:

Short Data	Long Data
Ultrachat 200K (25%), OpenMathinstruct - 2 (10%), MMLU auxiliary train set (3%), Tülu - 3 Instruction Following (2%)	Books (44%), DCLM (10%), ArXiv (6%)

Direct Preference Optimization (DPO)
- Training: At the last training stage, we perform human preference alignment training using Direct Preference Optimization. We use the same DPO training as Instella - 3B - Instruct with the same data. Different from previous training stages, in the DPO stage, we only train on short data with a maximum context length of 2K. Consistent with the findings of other open - weights models, we observe that performing DPO only on short data continues to improve the model's performance on long - context tasks.
- Data: We use the [OLMo - 2 - 1124 - 7B - Preference - Mix](https://huggingface.co/datasets/allenai/olmo - 2 - 1124 - 7b - preference - mix) dataset as our DPO data, which contains 0.76B tokens.

Sequence Parallelism

To enable training with extremely long inputs, we implement sequence parallelism based on Deepspeed Ulysses. The sequence parallelism distributes the attention heads across GPUs during the attention computation. It is more efficient than Ring - Attention in GPU communications. We use four GPUs as a sequence parallelism group for the Phase 2 continued pre - training and SFT due to the long inputs.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "amd/Instella-3B-Long-Instruct"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)

prompt = [{"role": "user", "content": "What are the benefits of open-source AI research?"}]
inputs = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    return_tensors='pt'
)

tokens = model.generate(
    inputs.to(model.device),
    max_new_tokens=1024,
    temperature=0.8,
    do_sample=True
)

print(tokenizer.decode(tokens[0], skip_special_tokens=False))

📚 Documentation

Results

Long - context Evaluation: We evaluate the long - context performance on [Helmet](https://princeton - nlp.github.io/HELMET/), a recent and comprehensive long - context evaluation benchmark covering diverse categories. Helmet shows better consistency with human perception than previous long - context benchmarks.
Performance Comparison: Instella - 3B - Long - Instruct outperforms open weights models such as Phi - 3.5 - mini - instruct, Gemma - 3 - 4B - it, Qwen2.5 - 3B - Instruct, and MiniCPM - 2B - 128k in most tasks of the Helmet benchmark.
Side - by - Side Comparison: We made a side - by - side comparison at 8K, 16K, and 32K context lengths with Qwen2.5 - 3B - Instruct (whose context length is 32K). Instella - 3B - Long - Instruct outperforms Qwen2.5 - 3B - Instruct by 2.75% on average.

Models	Size	Training Tokens (from scratch)	Natural Questions (RAG)	TriviaQA (RAG)	HotpotQA (RAG)	InfiniteBench QA	InfiniteBench MC	NarrativeQA	NIAH (multi value needles)	Average
Open Weight Models
Llama - 3.2 - 3B - Instruct	3.21B	~9T	51.8	86.2	56.4	38.7	56.0	26.0	99.2	59.19
Phi - 3.5 - mini - instruct	3.82B	-	41.2	78.6	48.6	24.0	55.0	27.7	87.0	51.73
gemma - 3 - 4b - it	4.3B	~4T	47.2	76.8	45.2	21.0	49.0	20.7	74.0	47.70
Qwen2.5 - 3B - Instruct	3.09B	~18T	34.6	65.8	41.8	14.7	35.0	21.0	80.4	41.90
MiniCPM - 2B - 128k	2.4B	~1T	28.4	61.6	30.8	3.7	22.0	3.3	46.6	28.06
Fully Open Models
Instella - 3B - Long - Instruct	3.11B	~4T	43.6	73.0	51.6	30.7	54.0	32.3	84.0	52.74

Table 1: Long - context evaluation on the Helmet benchmark. The NIAH and RAG tasks are evaluated at five context lengths: 8K, 16K, 32K, 64K, and 128K, and the number is reported by averaging across the five context lengths. The InfiniteBench QA, InfiniteBench MC, and NarrativeQA are evaluated at 128K context length. The InfiniteBench is reimplemented by Helmet.

Model	NIAH (multi value needles)			Natural Questions (RAG)			TriviaQA (RAG)			HotpotQA (RAG)			Average
	8K	16K	32K	8K	16K	32K	8K	16K	32K	8K	16K	32K
Instella - 3B - Long - Instruct	98	95	87	53	49	46	79	73	75	59	59	51	68.67
Qwen2.5 - 3B - Instruct	95	94	95	48	42	39	77	78	74	51	50	48	65.92

Table 2: Comparison with Qwen2.5 - 3B - Instruct at 8K, 16K, 32K context lengths.

Evaluation Metric: We use substring exact match (SubEM) for the RAG tasks including Natural Questions, TriviaQA, and HotpotQA. We use recall for NIAH and exact match for InfiniteBench MC. For InfiniteBench QA and NarrativeQA, where the answers are open - ended, we use gpt - 4o - mini to evaluate the answers against the ground truth using the prompt and metric provided by the Helmet.

Models	MMLU	IFEval	MT - Bench	TruthfulQA	Toxigen (↓)	Crows - Pair
Instella - 3B - Instruct	58.90	71.35	7.23	55.47	57.02	58.86
Instella - 3B - Long - Instruct	57.44	68.76	6.83	55.52	42.34	60.05

Table 3: Short - context benchmark comparison with Intsella - 3B - Instruct.

Short - context Results: We observe performance drops on some short - context benchmarks compared to Instella - 3B - Instruct. Interestingly, TruthfulQA remains stable, while Crows - Pair shows a slight improvement, indicating potential gains in certain responsible AI metrics. The reduction in Toxigen (57.02 → 42.34, lower is better) suggests improved toxicity avoidance in the long - context variant. We hypothesize that these results reflect a trade - off between optimizing for longer context lengths and retaining short - context performance, which may be more pronounced at the 3B parameter scale compared to larger models.

Training Data

Stage	Dataset	License
Continued Pre - Training - Phase 1	https://huggingface.co/datasets/amd/Instella - Long/tree/main/pretrain - phase - 1	ResearchRAIL
Continued Pre - Training - Phase 2	https://huggingface.co/datasets/amd/Instella - Long/tree/main/pretrain - phase - 2	ResearchRAIL
SFT	https://huggingface.co/datasets/amd/Instella - Long/tree/main/sft	ResearchRAIL
DPO	[https://huggingface.co/datasets/allenai/olmo - 2 - 1124 - 7b - preference - mix](https://huggingface.co/datasets/allenai/olmo - 2 - 1124 - 7b - preference - mix)	ODC - BY - 1.0

⚠️ Important Note

Further information regarding the training datasets, including applicable licensing terms and use restrictions, can be found at the linked source location.

🔧 Technical Details

The release of the Instella - Long model is a significant step forward in advancing open - source AI and showcases the capabilities of AMD hardware in language model training. As far as we know, Instella - Long makes the Instella series the first fully open language model trained from scratch to support long contexts, while achieving competitive performance compared to open - weights models.

By fully open - sourcing the Instella - Long model, including weights, training configurations, datasets, and code, we aim to foster innovation and collaboration within the AI community. We believe that transparency, reproducibility, and community participation are crucial for the development of AI technology.

📄 License

The license information can be found here. The license type is "other".

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご