GEITje 7B Ultra: An Open-Source Dutch Dialogue Model - Optimized with AI Feedback, Enabling Free and Fluent Communication

Geitje 7B Ultra

Developed by BramVanroy

GEITje 7B Ultra is a Dutch conversation model based on Mistral, optimized for Dutch dialogue through DPO and AI feedback alignment.

Large Language Model

Transformers

Other#Dutch conversation #AI feedback alignment #Multi-turn interaction optimization

Downloads 952

Release Time : 1/27/2024

Model Overview

This model is a Dutch instruction/chat model aligned with AI feedback via DPO, built on the Mistral architecture and pre-trained/fine-tuned on Dutch data.

Model Features

Dutch optimization

Specifically pre-trained and fine-tuned for Dutch, providing a more natural Dutch conversation experience.

DPO alignment

Uses DPO (Direct Preference Optimization) aligned with AI feedback to improve model output quality.

High performance

Outperforms the original GEITje model in Dutch benchmark tests.

Large context window

Supports a maximum context length of 2048 tokens.

Model Capabilities

Dutch conversation generation

Instruction following

Text generation

Use Cases

Chatbot

Dutch customer service assistant

Used to provide Dutch customer service support

Capable of generating natural and helpful Dutch responses

Entertainment chat

Acts as an entertainment-oriented chat partner

Capable of generating fun and humorous conversations

Education

Dutch learning assistant

Helps learners practice Dutch conversation

Provides natural Dutch conversation examples

🚀 GEITje 7B ultra

A conversational model for Dutch, aligned through AI feedback.

This model is a fine - tuned version of BramVanroy/GEITje-7B-ultra-sft on a synthetic DPO dataset of around 56M tokens that was generated with gpt-4-turbo and Rijgersberg/GEITje-7B-chat for Dutch.

💡 Usage Tip

🚀 Looking for the fast GGUF version? You can find it, and how to use it with ollama, here. 🚀

🚀 Quick Start

The following shows how to use the model in one - off and interactive conversation scenarios:

Basic Usage

from transformers import pipeline, Conversation

# load_in_8bit: lower precision but saves a lot of GPU memory
# device_map=auto: loads the model across multiple GPUs
chatbot = pipeline("conversational", model="BramVanroy/GEITje-7B-ultra", model_kwargs={"load_in_8bit": True}, device_map="auto")

start_messages = [
    {"role": "system", "content": "Je bent een grappige chatbot die Bert heet. Je maakt vaak mopjes."},
    {"role": "user", "content": "Hallo, ik ben Bram. Ik wil vanavond graag een film kijken. Heb je enkele suggesties?"}
]
conversation = Conversation(start_messages)
conversation = chatbot(conversation)
response = conversation.messages[-1]["content"]
print(response)

Advanced Usage

from transformers import pipeline, Conversation

# load_in_8bit: lower precision but saves a lot of memory
# device_map=auto: loads the model across multiple GPUs
# attn_implementation: uses flash attention, if your device supports it - otherwise remove it
chatbot = pipeline("conversational", model="BramVanroy/GEITje-7B-ultra", model_kwargs={"load_in_8bit": True, "attn_implementation": "flash_attention_2"}, device_map="auto")

while (system_message := input("System message ('q' to quit): ")) != "q":
    start_messages = [
        {"role": "system", "content": system_message},
    ]
    conversation = Conversation(start_messages)
    while (user_input := input("User ('r' to reset): ")) != "r":
        conversation.add_user_input(user_input)
        conversation = chatbot(conversation)
        response = conversation.messages[-1]["content"]
        print("Assistant:", response)

✨ Features

Conversational Capability: It is a conversational model for Dutch, aligned through AI feedback.
Based on Strong Architecture: Ultimately based on Mistral and aligned with AI feedback via DPO.

📚 Documentation

Citation

If you use GEITje 7B Ultra (SFT) or any of its derivatives or quantizations, place cite the following paper:

@misc{vanroy2024geitje7bultraconversational,
      title={GEITje 7B Ultra: A Conversational Model for Dutch}, 
      author={Bram Vanroy},
      year={2024},
      eprint={2412.04092},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.04092}, 
}

Intended uses & limitations

⚠️ Important Note

Although the model has been aligned with gpt - 4 - turbo output, which has strong content filters, the model could still generate wrong, misleading, and potentially even offensive content. Use at your own risk.

⚠️ Important Note

Because the model was trained on synthetic data created with OpenAI/Azure services, this model cannot be used for commercial purposes.

Training and evaluation data

The training data consists of a synthetic dataset based on UltraFeedback binarized created with gpt - 4 - turbo and geitje - chat. A given prompt, translated from the original dataset, is given to the two models who then generated an answer. Then, gpt - 4 - turbo is always selected as the best answer which DPO will optimise for. While this is not completely fair, the author did not have the budget to actually have gpt - 4 rate both replies. Furthermore, while an impressive model, GEITje chat still seems behind gpt - 4 - turbo in the testing that the author has done.

In total the dataset consists of 56,137,090 tokens (combination of prompt + rejected + chosen) and a test set of 6,178,969 tokens (11.00%).

Training procedure

The great alignment handbook was used for training, with a custom slurm script for compatibility with the cluster. It was trained in full, without LoRA or other adapters.

The model was trained in bfloat16 with flash attention 2 on two nodes of four A100 80GB each for around 11 hours. The author thanks the Flemish Super Computer for their compute.

For conversational usage, the model relies on the Zephyr chat template, which is compatible with system messages. A small portion of the data of *-sft contained system messages, so it is assumed the model can handle system messages at least a little bit.

In earlier iterations, using the alignment handbook's defaults (beta = 0.01) led to poor results (hallucinations of random tokens). After investigation, it seems that such a low beta does not work well for this dataset as it gives the model too much room to deviate from its initial base model. After a hyperparameter search and manual analysis of the resulting metrics, the current model was selected as the best one, with a beta of 0.1.

Training hyperparameters

The following hyperparameters were used during training:

Property	Details
learning_rate	5e - 07
train_batch_size	4
eval_batch_size	4
seed	42
distributed_type	multi - GPU
num_devices	8
gradient_accumulation_steps	4
total_train_batch_size	128
total_eval_batch_size	32
optimizer	Adam with betas=(0.9,0.999) and epsilon = 1e - 08
lr_scheduler_type	cosine
lr_scheduler_warmup_ratio	0.1
num_epochs	1.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.03	0.22	100	0.0260	-0.9740	-9.8635	0.9913	8.8895	-524.8940	-508.1891	-3.0753	-3.0315
0.0184	0.44	200	0.0164	-1.7162	-12.4772	0.9926	10.7610	-551.0317	-515.6115	-3.0349	-2.9873
0.0121	0.66	300	0.0142	-2.0575	-13.6818	0.9938	11.6244	-563.0778	-519.0242	-3.0325	-2.9835
0.0198	0.88	400	0.0139	-2.1431	-13.8857	0.9950	11.7426	-565.1163	-519.8801	-3.0293	-2.9801

Open LLM Leaderboard Evaluation Results

Results for the English Open LLM Leaderboard. For results specific to Dutch, check out ScandEval.

Detailed results can be found here

Metric	Value
Avg.	10.91
IFEval (0 - Shot)	37.23
BBH (3 - Shot)	12.88
MATH Lvl 5 (4 - Shot)	0.91
GPQA (0 - shot)	1.68
MuSR (0 - shot)	1.52
MMLU - PRO (5 - shot)	11.24

📄 License

This model is licensed under cc - by - nc - 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご