Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Model Card for Teuken-7B-instruct-research-v0.4
Teuken-7B-instruct-research-v0.4 is an instruction-tuned 7B parameter multilingual large language model (LLM). It was pre-trained with 4T tokens within the research project OpenGPT-X. The base model Teuken-7B-base-v0.4 is available on request 📧 contact@opengpt-x.de.
🚀 Quick Start
Prerequisites
The model requires a few libraries that can be installed in your python environment:
python -m pip install numpy torch huggingface_hub transformers sentencepiece
Usage Example
After installation, here's an example of how to use the model: As this model is a fine-tuned model, it must be used with the provided prompt template. Using the model without the prompt template is not intended and is not recommended. The prompt template is defined as follows:
user="Hi!"
lang_code = "DE"
system_messages={
"EN": "A chat between a human and an artificial intelligence assistant."
" The assistant gives helpful and polite answers to the human's questions.",
"DE": "Ein Gespräch zwischen einem Menschen und einem Assistenten mit künstlicher Intelligenz."
" Der Assistent gibt hilfreiche und höfliche Antworten auf die Fragen des Menschen.",
}
prompt = f"System: {system_messages[lang_code]}\nUser: {user}\nAssistant:"
The prompt template is also directly integrated in the Tokenizer and can be used as follows:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "openGPT-X/Teuken-7B-instruct-research-v0.4"
model = AutoModelForCausalLM.from_pretrained(
model_name,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
model = model.to(device).eval()
tokenizer = AutoTokenizer.from_pretrained(
model_name,
use_fast=False,
trust_remote_code=True,
)
messages = [{"role": "User", "content": "Hallo"}]
prompt_ids = tokenizer.apply_chat_template(messages, chat_template="DE", tokenize=True, add_generation_prompt=True, return_tensors="pt")
prediction = model.generate(
prompt_ids.to(model.device),
max_length=512,
do_sample=True,
top_k=50,
top_p=0.95,
temperature=0.7,
num_return_sequences=1,
)
prediction_text = tokenizer.decode(prediction[0].tolist())
print(prediction_text)
Usage with vLLM Server
Starting the vLLM Server:
vllm serve openGPT-X/Teuken-7B-instruct-research-v0.4 --trust-remote-code
Use Chat API with vLLM and pass the language of the Chat-Template as extra body:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
completion = client.chat.completions.create(
model="openGPT-X/Teuken-7B-instruct-research-v0.4",
messages=[{"role": "User", "content": "Hallo"}],
extra_body={"chat_template":"DE"}
)
print(f"Assistant: {completion}")
The default language of the Chat-Template can also be set when starting the vLLM Server. For this create a new file with the name lang
and the content DE
and start the vLLM Server as follows:
vllm serve openGPT-X/Teuken-7B-instruct-research-v0.4 --trust-remote-code --chat-template lang
Usage with vLLM Offline Batched Inference
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.01, max_tokens=1024, stop=["</s>"])
llm = LLM(model="openGPT-X/Teuken-7B-instruct-research-v0.4", trust_remote_code=True, dtype="bfloat16")
outputs = llm.chat(
messages=[{"role": "User", "content": "Hallo"}],
sampling_params=sampling_params,
chat_template="DE"
)
print(f"Prompt: {outputs[0].prompt}")
print(f"Assistant: {outputs[0].outputs[0].text}")
✨ Features
- Multilingual Support: Teuken-7B-instruct-research-v0.4 focuses on covering all 24 EU languages and therefore renders more stable results across these languages and better reflects European values in its answers than English-centric models. It is specialized for use in multilingual tasks.
- Research Use: Since the underlying base model is trained on all 24 EU languages, Teuken-7B-instruct-research-v0.4 is also intended for research use in these 24 languages.
📦 Installation
The installation steps are as follows:
python -m pip install numpy torch huggingface_hub transformers sentencepiece
📚 Documentation
Model Description
- Developed by: Fraunhofer, Forschungszentrum Jülich, TU Dresden, DFKI
- Funded by: German Federal Ministry of Economics and Climate Protection (BMWK) in the context of the OpenGPT-X project
- Model type: Transformer based decoder-only model
- Language(s) (NLP): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv
- Shared by: OpenGPT-X
Uses
This model is specialized for multilingual tasks and is also intended for research use in 24 EU languages.
Disclaimer Toxic Content
This Large Language Model (LLM) may generate content that is inappropriate, offensive, or harmful. While the dataset has been filtered to minimize such outputs, the model may still produce text that is biased or toxic due to the large scale and diverse nature of the data.
Out-of-Scope Use
The model is not intended for use in math and coding tasks.
Bias, Risks, and Limitations
Teuken-7B-instruct-research-v0.4 is an instruction-tuned version of Teuken-7B-base-v0.4 (base model is available on request 📧 contact@opengpt-x.de) that is not completely free from biases and hallucinations.
Training Details
Pre-Training Data
Teuken-7B-instruct-research-v0.4 was pre-trained on 4 trillion tokens of data from publicly available sources. The pretraining data has a cutoff of September 2023.
Instruction-Tuning Data
For the dataset composition, we used a selection of English and German datasets from which we sampled our final dataset with equal distribution between German and English, as shown in the following tables.
English
Dataset | Sample Count |
---|---|
anon8231489123/ShareGPT_Vicuna_unfiltered | 37.6K |
MBZUAI/Bactrian-X | 26.9K |
Open-Orca/OpenOrca | 26.9K |
WizardLM/WizardLM_evol_instruct_70k | 26.9K |
WizardLM/WizardLM_evol_instruct_V2_196k | 26.8K |
sahil2801/CodeAlpaca-20k | 12.1K |
lmsys/lmsys-chat-1m | 11.2K |
HuggingFaceH4/ultrachat_200k | 7.0K |
total | 175.5K |
German
Dataset | Sample Count |
---|---|
MBZUAI/Bactrian-X DE | 63.7K |
FreedomIntelligence/evol-instruct-deutsch | 55.9K |
FreedomIntelligence/alpaca-gpt4-deutsch | 47.5K |
FreedomIntelligence/sharegpt-deutsch | 5.8K |
LeoLM/German_Songs | 943 |
LeoLM/German_Poems | 378 |
bjoernp/ultrachat_de | 909 |
total | 175.13K |
Training Procedure
Instruction fined tuned version of Teuken-7B-base-v0.4. More information regarding the pre-training are available in our model preprint "Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs".
Training Hyperparameters
- Training regime: bf16 mixed precision
Evaluation
Results on multilingual benchmarks for 21 European languages with instruction-tuned models
Model | Avg. | EU21-ARC | EU21-HeSw | EU21-TQA | EU21-MMLU |
---|---|---|---|---|---|
Meta-Llama-3.1-8B-Instruct | .563 | .563 | .579 | .532 | .576 |
Mistral-7B-Instruct-v0.3 | .527 | .530 | .538 | .548 | .491 |
Salamandra-7B-Instruct | .543 | .595 | .637 | .482 | .459 |
Aya-23-8B | .485 | .475 | .535 | .476 | .455 |
Occiglot-7B-eu5-Instruct | .475 | .484 | .519 | .471 | .428 |
Pharia-1-LLM-7B-C-A | .417 | .396 | .438 | .469 | .366 |
Bloomz-7B1 | .358 | .316 | .354 | .461 | .302 |
Teuken-7B-instruct-research-v0.4 | .543 | .581 | .624 | .543 | .425 |
More information regarding the quality of our translated benchmarks are available in our Evaluation preprint "Towards Multilingual LLM Evaluation for European Languages". More evaluation results regarding Teuken-7B-instruct-research-v0.4 are available in our model preprint "Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs". The model was evaluated in 21 languages on ARC, GSM8K, HellaSwag, TruthfulQA, Translation and MMLU. Results can also be seen in the European LLM Leaderboard.
🔧 Technical Details
Model Architecture and Objective
Property | Details |
---|---|
Training Objective | CLM |
Activation Function | SwiGLU |
Seq Length | 4096 |
Position Embeddings | Rotary |
Num Layers | 32 |
Hidden Size | 4096 |
FFN Hidden Size | 13440 |
Num Attention Heads | 32 |
Head Dim | 128 |
Group Query Attention | yes |
Num Query Groups | 2 |
Normalization | RMSNorm |
Learning rate | 3e-4 |
Min learning rate | 3e-5 |
Disable bias in linear | yes |
📄 License
The license is other.

