🚀 NorMistral-11b-warm
NorMistral-11b-warm is a large Norwegian language model. Initialized from Mistral-Nemo-Base-2407, it's continually pretrained on 250 billion subword tokens. The data mix includes Scandinavian, Sámi, English, and code data. It's introduced in the paper Small Languages, Big Models: A Study of Continual Training on Languages of Norway and is part of the NORA.LLM family by the Language Technology Group at the University of Oslo (LTG).
Disclaimer: This model is pretrained on raw textual data. It's not finetuned for instructions and can generate harmful content. It's mainly for research.
🚀 Quick Start
The NorMistral-11b-warm model offers both causal language generation and bidirectional masked language modeling capabilities. You can use it for various natural language processing tasks, such as translation and text completion.
✨ Features
- Multilingual Pretraining: Trained on a diverse dataset including Norwegian, Sámi, and other Scandinavian languages, as well as English and code data.
- Hybrid Training: Utilizes a combination of causal and masked training objectives, enabling bidirectional text processing.
- Efficient Tokenizer: A custom tokenizer trained for the target languages, providing faster inference compared to the base model.
- Flexible Usage: Can be used as a causal generative model or a bidirectional encoder model, and can be finetuned like other BERT models.
📦 Installation
To use the NorMistral-11b-warm model, you need to install the transformers
library. You can install it using pip
:
pip install transformers torch
💻 Usage Examples
Basic Usage
Causal Language Model for Translation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b").cuda().eval()
prompt = """Engelsk: {0}
Bokmål:"""
eos_token_ids = [
token_id
for token_id in range(tokenizer.vocab_size)
if '\n' in tokenizer.decode([token_id])
]
@torch.no_grad()
def generate(text):
text = prompt.format(text)
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
prediction = model.generate(
input_ids,
max_new_tokens=64,
do_sample=False,
eos_token_id=eos_token_ids
)
return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()
generate("I'm excited to try this new Norwegian language model!")
Memory-Efficient Loading
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-warm")
model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-11b",
device_map='auto',
load_in_8bit=True,
torch_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-11b",
device_map='auto',
load_in_4bit=True,
torch_dtype=torch.bfloat16
)
Bidirectional Masked Language Modeling
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained(
"norallm/normistral-11b-warm"
)
model = AutoModelForCausalLM.from_pretrained(
"norallm/normistral-11b-warm"
).cuda().eval()
text = "En søt lundefugl flyr over de<mask>norske fjorder."
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
attention_mask = torch.zeros(input_ids.size(0), 1, input_ids.size(1), input_ids.size(1), device=input_ids.device)
output_logits = model(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=True
).logits
predictions = output_logits[0, :, :].argmax(dim=-1)
print(f"{tokenizer.decode(input_ids[0, 1:])} -> {tokenizer.decode(predictions[:-1])}")
📚 Documentation
Pretraining Corpus
The model is pretrained on a combination of publicly available data and a custom web crawl for Sámi. The total training corpus consists of 250 billion tokens from the following sources:
- Norwegian Text: A collection from the National Library of Norway, including parts of the Norwegian Colossal Corpus (NCC), CulturaX, and HPLT corpus v1.2.
- Northern Sámi Texts: Sourced from Glot500, the SIKOR North Saami free corpus, and a custom web crawl
ltg/saami-web
.
- Additional Languages: Danish, Swedish, Icelandic, Faroese from CulturaX and Glot500, high-quality English from FineWeb-edu, and programming code from The Stack v2.
Tokenizer
The model uses a custom tokenizer trained for the target languages. Here are the subword-to-word split ratios across different languages:
Tokenizer |
# tokens |
Bokmål |
Nynorsk |
Sámi |
Danish |
Swedish |
Mistral-Nemo-Base-2407 |
131072 |
1.79 |
1.87 |
2.63 |
1.82 |
2.00 |
NorMistral-11b-warm |
51200 |
1.22 |
1.28 |
1.82 |
1.33 |
1.39 |
Evaluation
More details about the evaluation setup and the new Norwegian benchmarks will be described in upcoming papers.
Model Details
Property |
Details |
Model Developers |
Language Technology Group at the University of Oslo in collaboration with NORA.LLM |
Architecture |
Mistral architecture based on an improved Llama design, with pre-normalization, SwiGLU activation, rotary positional embeddings, grouped-query attention, 40 transformer layers, hidden dimension of 5,120, intermediate dimension of 14,336, 32 query heads and 8 key & value heads (dimension 128), vocabulary size of 51,200 tokens, and 11.4 billion total parameters |
Training Details |
Training tokens: 250 billion; Batch size: 1,024 × 4,096 tokens; Training steps: 60,000; Peak learning rate: 1e-4; Warm-up steps: 1,000; Learning rate decay steps: 10,000; Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8); Weight decay: 0.1; Training precision: bfloat16; Hardware: 256 AMD MI250X GPUs (128 GB); Training time: 8.5 days; Theoretical computation: 2.0e22 FLOP/s; Model FLOP/s utilization (MFU): 38% |
Unique Features |
Hybrid masked-causal training (90% causal LM, 10% masked next-token prediction), can be used as both a causal generative model and a bidirectional encoder model, three-stage continual pretraining (tokenizer optimization, embedding weight realignment, full model training) |
Base Model |
Initialized from Mistral-Nemo-Base-2407 |
License |
Apache 2.0 |
🔧 Technical Details
- Hybrid Training: The model uses a combination of causal and masked training objectives, allowing it to process text bidirectionally.
- Three-Stage Continual Pretraining: The model undergoes tokenizer optimization, embedding weight realignment, and full model training during pretraining.
- Efficient Tokenizer: The custom tokenizer is trained specifically for the target languages, resulting in faster inference compared to the base model.
📄 License
We release the model under the Apache 2.0 license, indicating no additional constraints on the model weights. However, we do not own the data in the training collection.
Citation
@misc{samuel2025smalllanguagesbigmodels,
title={Small Languages, Big Models: A Study of Continual Training on Languages of Norway},
author={David Samuel and Vladislav Mikhailov and Erik Velldal and Lilja Øvrelid and Lucas Georges Gabriel Charpentier and Andrey Kutuzov and Stephan Oepen},
year={2025},
eprint={2412.06484},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.06484},
}
Contact
Please write a community message or contact David Samuel (davisamu@ifi.uio.no) if you have any questions about this model.