EuroLLM-9B-Instruct: An Open-Source Multilingual Large Model - Supports General Instructions and Machine Translation for EU Languages

Eurollm 9B Instruct

Developed by utter-project

EuroLLM-9B-Instruct is a 9-billion-parameter multilingual large language model focused on EU languages and other related languages, fine-tuned for optimized general instruction following and machine translation tasks.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #EU Multilingual #Instruction Fine-tuning #9 Billion Parameters

Downloads 7,899

Release Time : 11/22/2024

Model Overview

The EuroLLM project aims to create a suite of large language models capable of understanding and generating text in all EU languages and other related languages. EuroLLM-9B-Instruct is instruction fine-tuned on top of EuroLLM-9B, specializing in general instruction following and machine translation tasks.

Model Features

Multilingual Support

Supports 35 languages, covering all EU official languages and other related languages.

Efficient Inference

Utilizes Grouped Query Attention (GQA) and pre-layer normalization to enhance inference speed while maintaining performance.

Instruction Fine-tuning

Fine-tuned on the EuroBlocks dataset to optimize general instruction following and machine translation tasks.

High Performance

Excels in multilingual benchmarks, outperforming other European-developed models and matching the performance of non-European models.

Model Capabilities

Text generation

Machine translation

Instruction following

Multilingual understanding

Use Cases

Education

Language Learning Assistant

Helps students learn and practice multiple EU languages.

Provides accurate language explanations and translations.

Business

Multilingual Customer Support

Offers multilingual customer support for multinational enterprises.

Capable of understanding and generating customer responses in multiple languages.

Translation

Machine Translation

Delivers high-quality translation services between EU languages.

Performs exceptionally well in machine translation tasks.

🚀 Model Card for EuroLLM-9B-Instruct

This model card provides details about EuroLLM-9B-Instruct, a multilingual large language model. You can also explore its pre-trained version.

🚀 Quick Start

To run the EuroLLM-9B-Instruct model, you can use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "utter-project/EuroLLM-9B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": "You are EuroLLM --- an AI assistant specialized in European languages that provides safe, educational and helpful answers.",
    },
    {
        "role": "user", "content": "What is the capital of Portugal? How would you describe it?"
    },
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

✨ Features

Multilingual Capability: Capable of understanding and generating text in all European Union languages as well as some additional relevant languages, including Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Arabic, Catalan, Chinese, Galician, Hindi, Japanese, Korean, Norwegian, Russian, Turkish, and Ukrainian.
Instruction Tuned: EuroLLM-9B-Instruct was further instruction tuned on EuroBlocks, an instruction tuning dataset with focus on general instruction-following and machine translation.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "utter-project/EuroLLM-9B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

messages = [
    {
        "role": "system",
        "content": "You are EuroLLM --- an AI assistant specialized in European languages that provides safe, educational and helpful answers.",
    },
    {
        "role": "user", "content": "What is the capital of Portugal? How would you describe it?"
    },
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

📚 Documentation

Model Details

The EuroLLM project aims to create a suite of LLMs capable of understanding and generating text in all European Union languages and some additional relevant languages. EuroLLM-9B is a 9B parameter model trained on 4 trillion tokens from various data sources such as web data, parallel data, and high-quality datasets. EuroLLM-9B-Instruct was further instruction tuned on EuroBlocks.

Model Description

EuroLLM uses a standard, dense Transformer architecture with the following features:

Grouped Query Attention (GQA): With 8 key-value heads to increase inference speed while maintaining downstream performance.
Pre-layer Normalization: Using RMSNorm for improved training stability and faster computation.
SwiGLU Activation Function: Shown to lead to good results on downstream tasks.
Rotary Positional Embeddings (RoPE): Used in every layer to achieve good performance and allow context length extension.

The model was pre-trained using 400 Nvidia H100 GPUs of the Marenostrum 5 supercomputer with a constant batch size of 2,800 sequences (approximately 12 million tokens), the Adam optimizer, and BF16 precision.

Here is a summary of the model hyper-parameters:

Property	Details
Sequence Length	4,096
Number of Layers	42
Embedding Size	4,096
FFN Hidden Size	12,288
Number of Heads	32
Number of KV Heads (GQA)	8
Activation Function	SwiGLU
Position Encodings	RoPE (\Theta=10,000)
Layer Norm	RMSNorm
Tied Embeddings	No
Embedding Parameters	0.524B
LM Head Parameters	0.524B
Non-embedding Parameters	8.105B
Total Parameters	9.154B

Results

EU Languages

image/png Table 1: Comparison of open-weight LLMs on multilingual benchmarks. The borda count corresponds to the average ranking of the models (see (Colombo et al., 2022)). For Arc-challenge, Hellaswag, and MMLU we are using Okapi datasets (Lai et al., 2023) which include 11 languages. For MMLU-Pro and MUSR we translate the English version with Tower (Alves et al., 2024) to 6 EU languages.
* As there are no public versions of the pre-trained models, we evaluated them using the post-trained versions.

The results in Table 1 highlight EuroLLM-9B's superior performance on multilingual tasks compared to other European-developed models (as shown by the Borda count of 1.0), as well as its strong competitiveness with non-European models, achieving results comparable to Gemma-2-9B and outperforming the rest on most benchmarks.

English

image/png

Table 2: Comparison of open-weight LLMs on English general benchmarks.
* As there are no public versions of the pre-trained models, we evaluated them using the post-trained versions.

The results in Table 2 demonstrate EuroLLM's strong performance on English tasks, surpassing most European-developed models and matching the performance of Mistral-7B (obtaining the same Borda count).

🔧 Technical Details

The EuroLLM project uses a standard, dense Transformer architecture with specific techniques such as grouped query attention, pre-layer normalization, SwiGLU activation function, and rotary positional embeddings. The model was pre-trained on a large amount of data using a supercomputer with specific hyper-parameters and precision.

📄 License

The model is licensed under the Apache License 2.0.

⚠️ Important Note

EuroLLM-9B has not been aligned to human preferences, so the model may generate problematic outputs (e.g., hallucinations, harmful content, or false statements).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご