Teuken-7B-instruct-research-v0.4 Open-Source Multilingual Large Model - Supports 24 EU languages to handle multilingual tasks

Teuken 7B Instruct Research V0.4

Developed by openGPT-X

Teuken-7B-instruct-research-v0.4 is a 7-billion-parameter multilingual large language model fine-tuned for instructions, supporting 24 official EU languages, with a focus on European values and multilingual task scenarios.

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Other #EU Multilingual #Instruction Fine-tuning #7 Billion Parameters

Downloads 1,443

Release Time : 9/23/2024

Model Overview

This model is pre-trained on 4 trillion tokens and is particularly suitable for multilingual tasks and academic research, covering all 24 official EU languages, providing more stable results compared to English-centric models.

Model Features

Multilingual Support

Supports 24 official EU languages, delivering more stable results across these languages.

European Values

Better reflects European values compared to English-centric models.

Academic Research Applicability

Particularly suitable for multilingual task scenarios and academic research.

Model Capabilities

Multilingual Text Generation

Instruction Following

Dialogue Systems

Use Cases

Multilingual Applications

Multilingual Dialogue Systems

Build dialogue assistants supporting 24 official EU languages.

Provides stable conversational experiences in multilingual environments.

Academic Research

Used for multilingual-related academic research.

Supports European language studies.

🚀 Model Card for Teuken-7B-instruct-research-v0.4

Teuken-7B-instruct-research-v0.4 is an instruction-tuned 7B parameter multilingual large language model (LLM). It was pre-trained with 4T tokens within the research project OpenGPT-X. The base model Teuken-7B-base-v0.4 is available on request 📧 contact@opengpt-x.de.

🚀 Quick Start

Prerequisites

The model requires a few libraries that can be installed in your python environment:

python -m pip install numpy torch huggingface_hub transformers sentencepiece

Usage Example

After installation, here's an example of how to use the model: As this model is a fine-tuned model, it must be used with the provided prompt template. Using the model without the prompt template is not intended and is not recommended. The prompt template is defined as follows:

user="Hi!"
lang_code = "DE"
system_messages={
            "EN": "A chat between a human and an artificial intelligence assistant."
            " The assistant gives helpful and polite answers to the human's questions.",
            "DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
            " Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
        }
 
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:"

The prompt template is also directly integrated in the Tokenizer and can be used as follows:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "openGPT-X/Teuken-7B-instruct-research-v0.4"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
model = model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    use_fast=False,
    trust_remote_code=True,
)

messages = [{"role": "User", "content": "Hallo"}]
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
prediction = model.generate(
    prompt_ids.to(model.device),
    max_length=512,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.7,
    num_return_sequences=1,
)
prediction_text = tokenizer.decode(prediction[0].tolist())
print(prediction_text)

Usage with vLLM Server

Starting the vLLM Server:

vllm serve openGPT-X/Teuken-7B-instruct-research-v0.4 --trust-remote-code

Use Chat API with vLLM and pass the language of the Chat-Template as extra body:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
completion = client.chat.completions.create(
    model="openGPT-X/Teuken-7B-instruct-research-v0.4",
    messages=[{"role": "User", "content": "Hallo"}],
    extra_body={"chat_template":"DE"}
)
print(f"Assistant: {completion}")

The default language of the Chat-Template can also be set when starting the vLLM Server. For this create a new file with the name lang and the content DE and start the vLLM Server as follows:

vllm serve openGPT-X/Teuken-7B-instruct-research-v0.4 --trust-remote-code --chat-template lang

Usage with vLLM Offline Batched Inference

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(temperature=0.01, max_tokens=1024, stop=["</s>"])
llm = LLM(model="openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True, dtype="bfloat16") 
outputs = llm.chat(
    messages=[{"role": "User", "content": "Hallo"}], 
    sampling_params=sampling_params, 
    chat_template="DE"
)
print(f"Prompt: {outputs[0].prompt}")
print(f"Assistant: {outputs[0].outputs[0].text}")

✨ Features

Multilingual Support: Teuken-7B-instruct-research-v0.4 focuses on covering all 24 EU languages and therefore renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is specialized for use in multilingual tasks.
Research Use: Since the underlying base model is trained on all 24 EU languages, Teuken-7B-instruct-research-v0.4 is also intended for research use in these 24 languages.

📦 Installation

The installation steps are as follows:

python -m pip install numpy torch huggingface_hub transformers sentencepiece

📚 Documentation

Model Description

Developed by: Fraunhofer, Forschungszentrum Jülich, TU Dresden, DFKI
Funded by: German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
Model type: Transformer based decoder-only model
Language(s) (NLP): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
Shared by: OpenGPT-X

Uses

This model is specialized for multilingual tasks and is also intended for research use in 24 EU languages.

Disclaimer Toxic Content

This Large Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been filtered to minimize such outputs, the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.

Out-of-Scope Use

The model is not intended for use in math and coding tasks.

Bias, Risks, and Limitations

Teuken-7B-instruct-research-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4 (base model is available on request 📧 contact@opengpt-x.de) that is not completely free from biases and hallucinations.

Training Details

Pre-Training Data

Teuken-7B-instruct-research-v0.4 was pre-trained on 4 trillion tokens of data from publicly available sources. The pretraining data has a cutoff of September 2023.

Instruction-Tuning Data

For the dataset composition, we used a selection of English and German datasets from which we sampled our final dataset with equal distribution between German and English, as shown in the following tables.

English

Dataset	Sample Count
anon8231489123/ShareGPT_Vicuna_unfiltered	37.6K
MBZUAI/Bactrian-X	26.9K
Open-Orca/OpenOrca	26.9K
WizardLM/WizardLM_evol_instruct_70k	26.9K
WizardLM/WizardLM_evol_instruct_V2_196k	26.8K
sahil2801/CodeAlpaca-20k	12.1K
lmsys/lmsys-chat-1m	11.2K
HuggingFaceH4/ultrachat_200k	7.0K
total	175.5K

German

Dataset	Sample Count
MBZUAI/Bactrian-X DE	63.7K
FreedomIntelligence/evol-instruct-deutsch	55.9K
FreedomIntelligence/alpaca-gpt4-deutsch	47.5K
FreedomIntelligence/sharegpt-deutsch	5.8K
LeoLM/German_Songs	943
LeoLM/German_Poems	378
bjoernp/ultrachat_de	909
total	175.13K

Training Procedure

Instruction fined tuned version of Teuken-7B-base-v0.4. More information regarding the pre-training are available in our model preprint "Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs".

Training Hyperparameters

Training regime: bf16 mixed precision

Evaluation

Results on multilingual benchmarks for 21 European languages with instruction-tuned models

Model	Avg.	EU21-ARC	EU21-HeSw	EU21-TQA	EU21-MMLU
Meta-Llama-3.1-8B-Instruct	.563	.563	.579	.532	.576
Mistral-7B-Instruct-v0.3	.527	.530	.538	.548	.491
Salamandra-7B-Instruct	.543	.595	.637	.482	.459
Aya-23-8B	.485	.475	.535	.476	.455
Occiglot-7B-eu5-Instruct	.475	.484	.519	.471	.428
Pharia-1-LLM-7B-C-A	.417	.396	.438	.469	.366
Bloomz-7B1	.358	.316	.354	.461	.302
Teuken-7B-instruct-research-v0.4	.543	.581	.624	.543	.425

More information regarding the quality of our translated benchmarks are available in our Evaluation preprint "Towards Multilingual LLM Evaluation for European Languages". More evaluation results regarding Teuken-7B-instruct-research-v0.4 are available in our model preprint "Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs". The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can also be seen in the European LLM Leaderboard.

🔧 Technical Details

Model Architecture and Objective

Property	Details
Training Objective	CLM
Activation Function	SwiGLU
Seq Length	4096
Position Embeddings	Rotary
Num Layers	32
Hidden Size	4096
FFN Hidden Size	13440
Num Attention Heads	32
Head Dim	128
Group Query Attention	yes
Num Query Groups	2
Normalization	RMSNorm
Learning rate	3e-4
Min learning rate	3e-5
Disable bias in linear	yes

📄 License

The license is other.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご