đ GEITje 7B ultra
A conversational model for Dutch, aligned through AI feedback.
This model is a fine - tuned version of BramVanroy/GEITje-7B-ultra-sft on a synthetic DPO dataset of around 56M tokens that was generated with gpt-4-turbo and Rijgersberg/GEITje-7B-chat for Dutch.
đĄ Usage Tip
đ Looking for the fast GGUF version? You can find it, and how to use it with ollama
, here. đ
đ Quick Start
The following shows how to use the model in one - off and interactive conversation scenarios:
Basic Usage
from transformers import pipeline, Conversation
chatbot = pipeline("conversational", model="BramVanroy/GEITje-7B-ultra", model_kwargs={"load_in_8bit": True}, device_map="auto")
start_messages = [
{"role": "system", "content": "Je bent een grappige chatbot die Bert heet. Je maakt vaak mopjes."},
{"role": "user", "content": "Hallo, ik ben Bram. Ik wil vanavond graag een film kijken. Heb je enkele suggesties?"}
]
conversation = Conversation(start_messages)
conversation = chatbot(conversation)
response = conversation.messages[-1]["content"]
print(response)
Advanced Usage
from transformers import pipeline, Conversation
chatbot = pipeline("conversational", model="BramVanroy/GEITje-7B-ultra", model_kwargs={"load_in_8bit": True, "attn_implementation": "flash_attention_2"}, device_map="auto")
while (system_message := input("System message ('q' to quit): ")) != "q":
start_messages = [
{"role": "system", "content": system_message},
]
conversation = Conversation(start_messages)
while (user_input := input("User ('r' to reset): ")) != "r":
conversation.add_user_input(user_input)
conversation = chatbot(conversation)
response = conversation.messages[-1]["content"]
print("Assistant:", response)
⨠Features
- Conversational Capability: It is a conversational model for Dutch, aligned through AI feedback.
- Based on Strong Architecture: Ultimately based on Mistral and aligned with AI feedback via DPO.
đ Documentation
Citation
If you use GEITje 7B Ultra (SFT) or any of its derivatives or quantizations, place cite the following paper:
@misc{vanroy2024geitje7bultraconversational,
title={GEITje 7B Ultra: A Conversational Model for Dutch},
author={Bram Vanroy},
year={2024},
eprint={2412.04092},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.04092},
}
Intended uses & limitations
â ī¸ Important Note
Although the model has been aligned with gpt - 4 - turbo output, which has strong content filters, the model could still generate wrong, misleading, and potentially even offensive content. Use at your own risk.
â ī¸ Important Note
Because the model was trained on synthetic data created with OpenAI/Azure services, this model cannot be used for commercial purposes.
Training and evaluation data
The training data consists of a synthetic dataset based on UltraFeedback binarized created with gpt - 4 - turbo and geitje - chat. A given prompt, translated from the original dataset, is given to the two models who then generated an answer. Then, gpt - 4 - turbo is always selected as the best answer which DPO will optimise for. While this is not completely fair, the author did not have the budget to actually have gpt - 4 rate both replies. Furthermore, while an impressive model, GEITje chat still seems behind gpt - 4 - turbo in the testing that the author has done.
In total the dataset consists of 56,137,090 tokens (combination of prompt + rejected + chosen) and a test set of 6,178,969 tokens (11.00%).
Training procedure
The great alignment handbook was used for training, with a custom slurm script for compatibility with the cluster. It was trained in full, without LoRA or other adapters.
The model was trained in bfloat16 with flash attention 2 on two nodes of four A100 80GB each for around 11 hours. The author thanks the Flemish Super Computer for their compute.
For conversational usage, the model relies on the Zephyr chat template, which is compatible with system messages. A small portion of the data of *-sft contained system messages, so it is assumed the model can handle system messages at least a little bit.
In earlier iterations, using the alignment handbook's defaults (beta = 0.01) led to poor results (hallucinations of random tokens). After investigation, it seems that such a low beta does not work well for this dataset as it gives the model too much room to deviate from its initial base model. After a hyperparameter search and manual analysis of the resulting metrics, the current model was selected as the best one, with a beta of 0.1.
Training hyperparameters
The following hyperparameters were used during training:
Property |
Details |
learning_rate |
5e - 07 |
train_batch_size |
4 |
eval_batch_size |
4 |
seed |
42 |
distributed_type |
multi - GPU |
num_devices |
8 |
gradient_accumulation_steps |
4 |
total_train_batch_size |
128 |
total_eval_batch_size |
32 |
optimizer |
Adam with betas=(0.9,0.999) and epsilon = 1e - 08 |
lr_scheduler_type |
cosine |
lr_scheduler_warmup_ratio |
0.1 |
num_epochs |
1.0 |
Training results
Training Loss |
Epoch |
Step |
Validation Loss |
Rewards/chosen |
Rewards/rejected |
Rewards/accuracies |
Rewards/margins |
Logps/rejected |
Logps/chosen |
Logits/rejected |
Logits/chosen |
0.03 |
0.22 |
100 |
0.0260 |
-0.9740 |
-9.8635 |
0.9913 |
8.8895 |
-524.8940 |
-508.1891 |
-3.0753 |
-3.0315 |
0.0184 |
0.44 |
200 |
0.0164 |
-1.7162 |
-12.4772 |
0.9926 |
10.7610 |
-551.0317 |
-515.6115 |
-3.0349 |
-2.9873 |
0.0121 |
0.66 |
300 |
0.0142 |
-2.0575 |
-13.6818 |
0.9938 |
11.6244 |
-563.0778 |
-519.0242 |
-3.0325 |
-2.9835 |
0.0198 |
0.88 |
400 |
0.0139 |
-2.1431 |
-13.8857 |
0.9950 |
11.7426 |
-565.1163 |
-519.8801 |
-3.0293 |
-2.9801 |
Open LLM Leaderboard Evaluation Results
Results for the English Open LLM Leaderboard. For results specific to Dutch, check out ScandEval.
Detailed results can be found here
Metric |
Value |
Avg. |
10.91 |
IFEval (0 - Shot) |
37.23 |
BBH (3 - Shot) |
12.88 |
MATH Lvl 5 (4 - Shot) |
0.91 |
GPQA (0 - shot) |
1.68 |
MuSR (0 - shot) |
1.52 |
MMLU - PRO (5 - shot) |
11.24 |
đ License
This model is licensed under cc - by - nc - 4.0.