NorMistral-11b-warm Open-source Language Model - Supports communication in Norwegian and multiple Scandinavian languages, as well as Sami.

Normistral 11b Warm

Developed by norallm

NorMistral-11b-warm is a Norwegian large language model initialized based on Mistral-Nemo-Base-2407, supporting multiple Scandinavian languages and Sami.

Large Language Model

Transformers

Open Source License:Apache-2.0 #Norwegian language optimization #Mixed masked training #Multilingual transfer

Downloads 959

Release Time : 9/26/2024

Model Overview

This model is a powerful Norwegian model that can be used for various natural language processing tasks, such as text generation, translation, and masked language modeling.

Model Features

Mixed masked causal training

The model is pre-trained on a mixed causal masking objective and can be used both as a causal generation model and a bidirectional encoder model.

Multilingual support

Supports multiple Scandinavian languages and Sami, with special optimization for Norwegian performance.

Efficient tokenizer

Uses a new tokenizer specifically trained for the target language, significantly improving inference speed.

Three-stage continuous pre-training

Includes tokenizer optimization for the target language, realignment of embedding weights, and full-model training to ensure the best performance of the model on the target language.

Model Capabilities

Text generation

Translation

Masked language modeling

Multilingual processing

Use Cases

Translation

English to Norwegian translation

Use a zero-shot prompt template to translate English text into Norwegian.

For example: 'I'm excited to try this new Norwegian language model!' -> 'Jeg er spent på å prøve denne nye norske språkmodellen!'

Text completion

Masked language modeling

Use a bidirectional attention mechanism to complete partially masked text.

For example: 'En søt lundefugl flyr over de<mask>norske fjorder.' -> 'En søt lundefugl flyr over de vakre norske fjorder.'

🚀 NorMistral-11b-warm

NorMistral-11b-warm is a large Norwegian language model. Initialized from Mistral-Nemo-Base-2407, it's continually pretrained on 250 billion subword tokens. The data mix includes Scandinavian, Sámi, English, and code data. It's introduced in the paper Small Languages, Big Models: A Study of Continual Training on Languages of Norway and is part of the NORA.LLM family by the Language Technology Group at the University of Oslo (LTG).

Disclaimer: This model is pretrained on raw textual data. It's not finetuned for instructions and can generate harmful content. It's mainly for research.

🚀 Quick Start

The NorMistral-11b-warm model offers both causal language generation and bidirectional masked language modeling capabilities. You can use it for various natural language processing tasks, such as translation and text completion.

✨ Features

Multilingual Pretraining: Trained on a diverse dataset including Norwegian, Sámi, and other Scandinavian languages, as well as English and code data.
Hybrid Training: Utilizes a combination of causal and masked training objectives, enabling bidirectional text processing.
Efficient Tokenizer: A custom tokenizer trained for the target languages, providing faster inference compared to the base model.
Flexible Usage: Can be used as a causal generative model or a bidirectional encoder model, and can be finetuned like other BERT models.

📦 Installation

To use the NorMistral-11b-warm model, you need to install the transformers library. You can install it using pip:

pip install transformers torch

💻 Usage Examples

Basic Usage

Causal Language Model for Translation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Import the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b")
model = AutoModelForCausalLM.from_pretrained("norallm/normistral-11b").cuda().eval()

# Define zero-shot translation prompt template
prompt = """Engelsk: {0}
Bokmål:"""

# Define tokens that should end the generation (any token with a newline)
eos_token_ids = [
    token_id
    for token_id in range(tokenizer.vocab_size)
    if '\n' in tokenizer.decode([token_id])
]

# Generation function
@torch.no_grad()
def generate(text):
    text = prompt.format(text)
    input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()
    prediction = model.generate(
        input_ids,
        max_new_tokens=64,
        do_sample=False,
        eos_token_id=eos_token_ids
    )
    return tokenizer.decode(prediction[0, input_ids.size(1):]).strip()

# Example usage
generate("I'm excited to try this new Norwegian language model!")
# > Expected output: 'Jeg er spent på å prøve denne nye norske språkmodellen!'

Memory-Efficient Loading

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("norallm/normistral-11b-warm")

# Load in 8-bit mode (requires ~12GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b",
    device_map='auto',
    load_in_8bit=True,
    torch_dtype=torch.bfloat16
)

# Or load in 4-bit mode (requires ~8GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b",
    device_map='auto',
    load_in_4bit=True,
    torch_dtype=torch.bfloat16
)

Bidirectional Masked Language Modeling

from transformers import AutoTokenizer, AutoModelForCausalLM

# First, we will have to import the tokenizer and the language model
# we can use CausalLM instead of MaskedLM just fine
tokenizer = AutoTokenizer.from_pretrained(
    "norallm/normistral-11b-warm"
)
model = AutoModelForCausalLM.from_pretrained(
    "norallm/normistral-11b-warm"
).cuda().eval()

# A partially-masked input text string
text = "En søt lundefugl flyr over de<mask>norske fjorder."
input_ids = tokenizer(text, return_tensors='pt').input_ids.cuda()

# An empty attention mask allows uncontrained bidirectional attention
attention_mask = torch.zeros(input_ids.size(0), 1, input_ids.size(1), input_ids.size(1), device=input_ids.device)

output_logits = model(
    input_ids=input_ids,
    attention_mask=attention_mask,
    return_dict=True
).logits
predictions = output_logits[0, :, :].argmax(dim=-1)

# Expected output:
# En søt lundefugl flyr over de<mask> norske fjorder. -> En søt lundefugl flyr over de vakre norske fjorder.
print(f"{tokenizer.decode(input_ids[0, 1:])} -> {tokenizer.decode(predictions[:-1])}")

📚 Documentation

Pretraining Corpus

The model is pretrained on a combination of publicly available data and a custom web crawl for Sámi. The total training corpus consists of 250 billion tokens from the following sources:

Norwegian Text: A collection from the National Library of Norway, including parts of the Norwegian Colossal Corpus (NCC), CulturaX, and HPLT corpus v1.2.
Northern Sámi Texts: Sourced from Glot500, the SIKOR North Saami free corpus, and a custom web crawl ltg/saami-web.
Additional Languages: Danish, Swedish, Icelandic, Faroese from CulturaX and Glot500, high-quality English from FineWeb-edu, and programming code from The Stack v2.

Tokenizer

The model uses a custom tokenizer trained for the target languages. Here are the subword-to-word split ratios across different languages:

Tokenizer	# tokens	Bokmål	Nynorsk	Sámi	Danish	Swedish
Mistral-Nemo-Base-2407	131072	1.79	1.87	2.63	1.82	2.00
NorMistral-11b-warm	51200	1.22	1.28	1.82	1.33	1.39

Evaluation

More details about the evaluation setup and the new Norwegian benchmarks will be described in upcoming papers.

Model Details

Property	Details
Model Developers	Language Technology Group at the University of Oslo in collaboration with NORA.LLM
Architecture	Mistral architecture based on an improved Llama design, with pre-normalization, SwiGLU activation, rotary positional embeddings, grouped-query attention, 40 transformer layers, hidden dimension of 5,120, intermediate dimension of 14,336, 32 query heads and 8 key & value heads (dimension 128), vocabulary size of 51,200 tokens, and 11.4 billion total parameters
Training Details	Training tokens: 250 billion; Batch size: 1,024 × 4,096 tokens; Training steps: 60,000; Peak learning rate: 1e-4; Warm-up steps: 1,000; Learning rate decay steps: 10,000; Optimizer: AdamW (β₁=0.9, β₂=0.95, ε=1e-8); Weight decay: 0.1; Training precision: bfloat16; Hardware: 256 AMD MI250X GPUs (128 GB); Training time: 8.5 days; Theoretical computation: 2.0e22 FLOP/s; Model FLOP/s utilization (MFU): 38%
Unique Features	Hybrid masked-causal training (90% causal LM, 10% masked next-token prediction), can be used as both a causal generative model and a bidirectional encoder model, three-stage continual pretraining (tokenizer optimization, embedding weight realignment, full model training)
Base Model	Initialized from Mistral-Nemo-Base-2407
License	Apache 2.0

🔧 Technical Details

Hybrid Training: The model uses a combination of causal and masked training objectives, allowing it to process text bidirectionally.
Three-Stage Continual Pretraining: The model undergoes tokenizer optimization, embedding weight realignment, and full model training during pretraining.
Efficient Tokenizer: The custom tokenizer is trained specifically for the target languages, resulting in faster inference compared to the base model.

📄 License

We release the model under the Apache 2.0 license, indicating no additional constraints on the model weights. However, we do not own the data in the training collection.

Citation

@misc{samuel2025smalllanguagesbigmodels,
  title={Small Languages, Big Models: A Study of Continual Training on Languages of Norway}, 
  author={David Samuel and Vladislav Mikhailov and Erik Velldal and Lilja Øvrelid and Lucas Georges Gabriel Charpentier and Andrey Kutuzov and Stephan Oepen},
  year={2025},
  eprint={2412.06484},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2412.06484}, 
}

Contact

Please write a community message or contact David Samuel (davisamu@ifi.uio.no) if you have any questions about this model.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご