Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Instella-Long✨: Fully Open Language Model with Long-context Capability
AMD is thrilled to unveil Instella-Long, a long-context language model continuously trained from Instella-3B-Instruct on AMD Instinct™ MI300X GPUs. As far as we know, Instella-Long makes the Instella series the first fully open language model trained from scratch to support long contexts. Instella-Long can handle a 128K context length and achieves competitive performance, outshining open-weights models like Phi-3.5-mini, Gemma-3-4B, and Qwen2.5-3B in long-context benchmarks.
Training Instella with long context extension on Instinct MI300X GPUs showcases our hardware's ability and scalability in managing demanding AI training workloads, presenting a viable option in the AI hardware market. In line with AMD's commitment to open source, we are sharing all model weights, detailed training configurations, datasets, and code, enabling the AI community to collaborate, replicate, and innovate, thus accelerating progress.
✨ Features
- New Model Announcement: AMD has developed Instella-Long, a 3B long-context language model supporting 128K context length, trained on 64 Instinct MI300X GPUs.
- Fully Open and Long-context Support: To our knowledge, Instella-Long makes the Instella series the first fully open language model trained from scratch with long-context support. The Hugging Face model, training data, and training code are all fully open-sourced.
- Efficient Training Techniques: Backed by the AMD ROCm software stack, Instella-Long uses efficient training techniques such as Sequence Parallelism, FlashAttention - 2, Torch Compile, and FSDP to distribute model training across 8 MI300 nodes, each with 8 GPUs.
🚀 Quick Start
Instella-Long
Instella-Long is based on the Instella model released in March. Specifically, it is continuously trained from Instella-3B-Instruct and follows the same model architecture. The training of Instella-Long consists of three stages:
- Continued Pre - Training
- Training: We conduct a two - phase pre - training starting from Instella-3B-Instruct (4K context length).
- Phase 1: We extend the context length from 4,096 to 65,536 tokens and train the model on 20B tokens. We follow the RoPE scaling law to increase the base frequency of RoPE from 10,000 to 514,640.
- Phase 2: As suggested by Prolong, it is beneficial to train the model on data with a context length longer than the target context length. In this phase, we train the model on 20B tokens with a maximum context length of 262,144 (2× the target context length of 128K). We increase the RoPE base frequency to 3,691,950 according to the RoPE scaling law.
- Data: Our continued pre - training data comes from the data mix created by Prolong. We use the text data curated by Prolong and tokenize it with our tokenizer. In each phase of the continued pre - training, we train on a mix of long and short context data. The specific details are as follows:
- Training: We conduct a two - phase pre - training starting from Instella-3B-Instruct (4K context length).
Training Phase | 64K Long Data | 256K Long Data | Short Data |
---|---|---|---|
Phase 1 | Code repos (30%), Books (30%), Textbooks (3%) |
- | FineWeb - Edu (10%), FineWeb (10%), StackExchange (4%), Wikipedia (5%), ArXiv (3%), OpenWebMath (5%) |
Phase 2 | Code repos (10%), Books (15%) | Code repos (20%), Books (15%), Textbooks (2%) |
FineWeb - Edu (10%), FineWeb (10%), StackExchange (4%), Wikipedia (5%), ArXiv (4%), OpenWebMath (5%) |
- Supervised Finetuning (SFT)
- Training: After continued training on the long - context pre - training data, we perform supervised finetuning on long - context instruction data. We train the model on a 1B - token mixture of short - and long - context instruction data.
- Data: Similar to the continued pre - training stage, we train the model on a mixture of short - and long - context instruction data with a ratio of 4 to 6. For short - context instruction data, we use Ultrachat 200K, [OpenMathinstruct - 2](https://huggingface.co/datasets/nvidia/OpenMathInstruct - 2), [Tülu - 3 Instruction Following](https://huggingface.co/datasets/allenai/tulu - 3 - sft - personas - instruction - following), and MMLU auxiliary train set. For long - context instruction data, we construct a synthetic long - context instruction dataset due to the lack of long - context SFT data.
- We use long documents from Books in our continued pre - training corpus. We select documents with a minimum length of 8K tokens and truncate those exceeding 128K tokens to a maximum length of 128K. Then, we use [Qwen2.5 - 14B - Instruct - 1M](https://huggingface.co/Qwen/Qwen2.5 - 14B - Instruct - 1M) as a teacher model to synthetically generate question - answer pairs for the documents. To speed up the process, we randomly choose a sub - part of the document for QA generation, with the length of the sub - part randomly set between 2K and 8K tokens. We use the NLTK sentence tokenizer to ensure the selected sub - part has complete sentences. The generated question and answer are appended to the end of the long document as a complete single - round instruction - following data sample.
- We also generate long - context instruction data using short documents to increase dataset diversity. We use ArXiv from our continued pre - training corpus and the DCLM subset from [Dolmino - Mix - 1124](https://huggingface.co/datasets/allenai/dolmino - mix - 1124). We first generate QA for each short document following the same pipeline. Then, we iteratively concatenate different short documents until it reaches 128K tokens. The concatenated document can exceed 128K as we do not truncate the last document. Finally, we randomly choose one QA corresponding to one of the short documents and append it to the end of the concatenated document. The final data mixture for the SFT stage is as follows:
Short Data | Long Data |
---|---|
Ultrachat 200K (25%), OpenMathinstruct - 2 (10%), MMLU auxiliary train set (3%), Tülu - 3 Instruction Following (2%) |
Books (44%), DCLM (10%), ArXiv (6%) |
- Direct Preference Optimization (DPO)
- Training: At the last training stage, we perform human preference alignment training using Direct Preference Optimization. We use the same DPO training as Instella - 3B - Instruct with the same data. Different from previous training stages, in the DPO stage, we only train on short data with a maximum context length of 2K. Consistent with the findings of other open - weights models, we observe that performing DPO only on short data continues to improve the model's performance on long - context tasks.
- Data: We use the [OLMo - 2 - 1124 - 7B - Preference - Mix](https://huggingface.co/datasets/allenai/olmo - 2 - 1124 - 7b - preference - mix) dataset as our DPO data, which contains 0.76B tokens.
Sequence Parallelism
To enable training with extremely long inputs, we implement sequence parallelism based on Deepspeed Ulysses. The sequence parallelism distributes the attention heads across GPUs during the attention computation. It is more efficient than Ring - Attention in GPU communications. We use four GPUs as a sequence parallelism group for the Phase 2 continued pre - training and SFT due to the long inputs.
💻 Usage Examples
Basic Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "amd/Instella-3B-Long-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)
prompt = [{"role": "user", "content": "What are the benefits of open-source AI research?"}]
inputs = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
return_tensors='pt'
)
tokens = model.generate(
inputs.to(model.device),
max_new_tokens=1024,
temperature=0.8,
do_sample=True
)
print(tokenizer.decode(tokens[0], skip_special_tokens=False))
📚 Documentation
Results
- Long - context Evaluation: We evaluate the long - context performance on [Helmet](https://princeton - nlp.github.io/HELMET/), a recent and comprehensive long - context evaluation benchmark covering diverse categories. Helmet shows better consistency with human perception than previous long - context benchmarks.
- Performance Comparison: Instella - 3B - Long - Instruct outperforms open weights models such as Phi - 3.5 - mini - instruct, Gemma - 3 - 4B - it, Qwen2.5 - 3B - Instruct, and MiniCPM - 2B - 128k in most tasks of the Helmet benchmark.
- Side - by - Side Comparison: We made a side - by - side comparison at 8K, 16K, and 32K context lengths with Qwen2.5 - 3B - Instruct (whose context length is 32K). Instella - 3B - Long - Instruct outperforms Qwen2.5 - 3B - Instruct by 2.75% on average.
Models | Size | Training Tokens (from scratch) | Natural Questions (RAG) | TriviaQA (RAG) | HotpotQA (RAG) | InfiniteBench QA | InfiniteBench MC | NarrativeQA | NIAH (multi value needles) | Average |
---|---|---|---|---|---|---|---|---|---|---|
Open Weight Models | ||||||||||
Llama - 3.2 - 3B - Instruct | 3.21B | ~9T | 51.8 | 86.2 | 56.4 | 38.7 | 56.0 | 26.0 | 99.2 | 59.19 |
Phi - 3.5 - mini - instruct | 3.82B | - | 41.2 | 78.6 | 48.6 | 24.0 | 55.0 | 27.7 | 87.0 | 51.73 |
gemma - 3 - 4b - it | 4.3B | ~4T | 47.2 | 76.8 | 45.2 | 21.0 | 49.0 | 20.7 | 74.0 | 47.70 |
Qwen2.5 - 3B - Instruct | 3.09B | ~18T | 34.6 | 65.8 | 41.8 | 14.7 | 35.0 | 21.0 | 80.4 | 41.90 |
MiniCPM - 2B - 128k | 2.4B | ~1T | 28.4 | 61.6 | 30.8 | 3.7 | 22.0 | 3.3 | 46.6 | 28.06 |
Fully Open Models | ||||||||||
Instella - 3B - Long - Instruct | 3.11B | ~4T | 43.6 | 73.0 | 51.6 | 30.7 | 54.0 | 32.3 | 84.0 | 52.74 |
Table 1: Long - context evaluation on the Helmet benchmark. The NIAH and RAG tasks are evaluated at five context lengths: 8K, 16K, 32K, 64K, and 128K, and the number is reported by averaging across the five context lengths. The InfiniteBench QA, InfiniteBench MC, and NarrativeQA are evaluated at 128K context length. The InfiniteBench is reimplemented by Helmet.
Model | NIAH (multi value needles) | Natural Questions (RAG) | TriviaQA (RAG) | HotpotQA (RAG) | Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8K | 16K | 32K | 8K | 16K | 32K | 8K | 16K | 32K | 8K | 16K | 32K | ||
Instella - 3B - Long - Instruct | 98 | 95 | 87 | 53 | 49 | 46 | 79 | 73 | 75 | 59 | 59 | 51 | 68.67 |
Qwen2.5 - 3B - Instruct | 95 | 94 | 95 | 48 | 42 | 39 | 77 | 78 | 74 | 51 | 50 | 48 | 65.92 |
Table 2: Comparison with Qwen2.5 - 3B - Instruct at 8K, 16K, 32K context lengths.
Evaluation Metric: We use substring exact match (SubEM) for the RAG tasks including Natural Questions, TriviaQA, and HotpotQA. We use recall for NIAH and exact match for InfiniteBench MC. For InfiniteBench QA and NarrativeQA, where the answers are open - ended, we use gpt - 4o - mini to evaluate the answers against the ground truth using the prompt and metric provided by the Helmet.
Models | MMLU | IFEval | MT - Bench | TruthfulQA | Toxigen (↓) | Crows - Pair |
---|---|---|---|---|---|---|
Instella - 3B - Instruct | 58.90 | 71.35 | 7.23 | 55.47 | 57.02 | 58.86 |
Instella - 3B - Long - Instruct | 57.44 | 68.76 | 6.83 | 55.52 | 42.34 | 60.05 |
Table 3: Short - context benchmark comparison with Intsella - 3B - Instruct.
Short - context Results: We observe performance drops on some short - context benchmarks compared to Instella - 3B - Instruct. Interestingly, TruthfulQA remains stable, while Crows - Pair shows a slight improvement, indicating potential gains in certain responsible AI metrics. The reduction in Toxigen (57.02 → 42.34, lower is better) suggests improved toxicity avoidance in the long - context variant. We hypothesize that these results reflect a trade - off between optimizing for longer context lengths and retaining short - context performance, which may be more pronounced at the 3B parameter scale compared to larger models.
Training Data
Stage | Dataset | License |
---|---|---|
Continued Pre - Training - Phase 1 | https://huggingface.co/datasets/amd/Instella - Long/tree/main/pretrain - phase - 1 | ResearchRAIL |
Continued Pre - Training - Phase 2 | https://huggingface.co/datasets/amd/Instella - Long/tree/main/pretrain - phase - 2 | ResearchRAIL |
SFT | https://huggingface.co/datasets/amd/Instella - Long/tree/main/sft | ResearchRAIL |
DPO | [https://huggingface.co/datasets/allenai/olmo - 2 - 1124 - 7b - preference - mix](https://huggingface.co/datasets/allenai/olmo - 2 - 1124 - 7b - preference - mix) | ODC - BY - 1.0 |
⚠️ Important Note
Further information regarding the training datasets, including applicable licensing terms and use restrictions, can be found at the linked source location.
🔧 Technical Details
The release of the Instella - Long model is a significant step forward in advancing open - source AI and showcases the capabilities of AMD hardware in language model training. As far as we know, Instella - Long makes the Instella series the first fully open language model trained from scratch to support long contexts, while achieving competitive performance compared to open - weights models.
By fully open - sourcing the Instella - Long model, including weights, training configurations, datasets, and code, we aim to foster innovation and collaboration within the AI community. We believe that transparency, reproducibility, and community participation are crucial for the development of AI technology.
📄 License
The license information can be found here. The license type is "other".

