ChocoLlama-2-7B-base Open-source Model - A Fundamental Language Tool Specially Adapted to Dutch

Chocollama 2 7B Base

Developed by ChocoLlama

A Dutch-adapted version based on Meta Llama-2-7b, fine-tuned using LoRa technology with 32B Dutch Llama-2 tokens as the base model

Large Language Model

Transformers

Other#Dutch language optimization #LoRa fine-tuning #Enterprise document processing

Downloads 26

Release Time : 4/7/2024

Model Overview

ChocoLlama is an open-source large model family specifically optimized for Dutch, advancing Dutch-language open-source large model technology. This is the base model version, not optimized for dialogue scenarios.

Model Features

Dutch language optimization

Specifically adapted for Dutch language, fine-tuned with 32B Dutch tokens

LoRa fine-tuning technology

Utilizes LoRa technology for efficient fine-tuning, maintaining model performance while reducing computational costs

Multi-domain coverage

Training data covers multiple domains including legal, recruitment, literature, ensuring broad applicability

Model Capabilities

Dutch text generation

Dutch text comprehension

Multi-domain text processing

Use Cases

Business applications

Job description analysis/generation

Automatically generate or analyze recruitment job descriptions

Enterprise document processing

Process and analyze various corporate documents

Legal applications

Legal document generation

Automatically generate legal-related documents

🚀 ChocoLlama

A Llama-2/3-based family of Dutch language models

🚀 Quick Start

We present ChocoLlama-2-7B-base, a language - adapted version of Meta's Llama - 2 - 7b. It's fine - tuned on 32B Dutch Llama - 2 tokens (104GB) using LoRa. Note that this is a base model, not optimized for conversational behavior. If you need conversational capabilities, we recommend finetuning this model on your own Dutch data or using the instruction - finetuned version, ChocoLlama-2-7B-instruct.

Use the following code to start using the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/ChocoLlama-2-7B-base')

✨ Features

ChocoLlama is a family of open LLMs adapted to Dutch, advancing the state - of - the - art of Dutch open LLMs in their weight class. We offer 6 variants (3 base and 3 instruction - tuned models):

ChocoLlama-2-7B-base (link): A language - adapted Llama - 2 - 7b, fine - tuned on 32B Dutch Llama - 2 tokens (104GB) using LoRa.
ChocoLlama-2-7B-instruct (link): An instruction - tuned version of ChocoLlama - 2 - 7B - base, fine - tuned on Dutch translations of instruction - tuning datasets, using SFT followed by DPO.
ChocoLlama-2-7B-tokentrans-base (link): A language - adapted Llama - 2 - 7b, using a Dutch RoBERTa - based tokenizer. Token embeddings are reinitialized using the algorithm by Remy et al.. Fine - tuned on the same Dutch dataset as ChocoLlama - 2 - 7B - base using LoRa.
ChocoLlama-2-7B-tokentrans-instruct (link): An instruction - tuned version of ChocoLlama - 2 - 7B - tokentrans - base, fine - tuned on the same dataset as ChocoLlama - 2 - 7B - instruct, using SFT followed by DPO.
Llama-3-ChocoLlama-8B-base (link): A language - adapted Llama - 8 - 8B, fine - tuned on the same Dutch dataset as ChocoLlama - 2 - 7B - base using LoRa.
Llama-3-ChocoLlama-instruct (link): An instruction - tuned version of Llama - 3 - ChocoLlama - 8B - base, fine - tuned on the same dataset as ChocoLlama - 2 - 7B - instruct, using SFT followed by DPO.

For benchmark results of all models, including comparisons with base models and other Dutch LLMs, refer to our paper here.

📚 Documentation

Model Description

Developed by: Matthieu Meeus, Anthony Rathé
Funded by: Vlaams Supercomputer Centrum, through a grant of about 40K GPU hours (NVIDIA A100 - 80GB)
Language(s): Dutch
License: Llama-2 Community License
Finetuned from model: Llama-2-7b-hf

Model Sources

Repository: on Github here.
Paper: on ArXiv here.

💻 Usage Examples

Direct Use

Since this is a base model, we don't recommend direct use. Instead, we suggest:

Fine - tuning the model for your specific use - case.
Using the instruction - tuned version of the model.

Downstream Use

As a base model, it can be easily adapted to specific use - cases requiring Dutch language understanding and generation. We expect it to be useful for domains covered in our dataset, such as analyzing and/or generating Dutch job descriptions, corporate filings, and legislation.

Out - of - Scope Use

Use - cases needing a chat - style interface: As a base model, it can't be used reliably for turn - based chat. Use the instruction - tuned version instead.
Use - cases requiring understanding or generation of non - Dutch text: The fine - tuning dataset contains only Dutch data, so significant catastrophic forgetting may occur for English, the original training language of Llama - 2.

🔧 Technical Details

Bias, Risks, and Limitations

We've included only widely used and high - quality data in our dataset, some filtered by original creators. However, we didn't explicitly filter for biased or harmful content.

Recommendations

We recommend fine - tuning the model on your curated data to avoid undesirable outputs.

Training Details

Training Data

We collected a diverse set of Dutch natural language:

OSCAR: The majority of our data comes from the Dutch part of [OSCAR](https://oscar - corpus.com), January 2023 version, based on Common Crawl. It has 93 GB of text (~28.6B tokens).
Open Subtitles: We gathered Dutch movie subtitle text, focusing on unique Dutch movies or those with Dutch subtitles. This dataset has 5 GB of text (~1.54B tokens) from 214k samples.
Project Gutenberg: We downloaded 970 full Dutch books from Project Gutenberg using a public scraper. The dataset has 0.3 GB of text (~92M tokens) and is available on [Hugging Face](https://huggingface.co/datasets/ChocoLlama/gutenberg - dutch).
Wikipedia: Using the March 2023 Wikipedia dump, we included 2.5 GB of text (~769M tokens). Despite some overlap with OSCAR, Wikipedia's quality justifies its inclusion.
Job Descriptions (TechWolf): A sample of 750k Dutch job descriptions collected over five years from public websites, provided by TechWolf. This dataset has 1.5 GB of text (~462M tokens).
Staatsblad (Bizzy): A sample of 80k legal filings from Het Belgisch Staatsblad. Documents were OCR - processed, and personal data was excluded. This dataset has 1.4 GB of text (~431M tokens), collected with Bizzy's help.
Legislation (ML6): 15k documents from Flemish legislation accessed via the [Open Data API](https://www.vlaanderen.be/vlaams - parlement/de - vlaamse - codex). This dataset has 0.2 GB of text (~62M tokens), collected with ML6's support.

Training Procedure

This model was fine - tuned using low - rank (LoRa) adaptation with trainable embeddings, with a total of 544M trainable parameters.

Training Hyperparameters

Training regime: bf16 non - mixed precision
Epochs: 1
LoRa parameters:
- R: 8
- Alpha: 32
- Trainable modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj, embed_tokens, lm_head
- LoRa dropout: 0.05
Learning Rate:
- Scheduler: StepLR
- Step size: 6212
- Learning rate: 0.0003
- Gamma: 0.85
Other parameters:
- Minibatch size: 16
- Gradient accumulation steps: 8
- Parallelization factor: 8
- Weight decay: 0

Evaluation

Quantitative evaluation

We evaluated our models on several industry - standard Dutch benchmarks, translated from their original versions. The results are in the table below, along with results from other prominent Dutch models.

Model	ARC	HellaSwag	MMLU	TruthfulQA	Avg.
Llama-3-ChocoLlama-instruct	0.48	0.66	0.49	0.49	0.53
llama-3-8B-rebatch	0.44	0.64	0.46	0.48	0.51
llama-3-8B-instruct	0.47	0.59	0.47	0.52	0.51
llama-3-8B	0.44	0.64	0.47	0.45	0.5
Reynaerde-7B-Chat	0.44	0.62	0.39	0.52	0.49
Llama-3-ChocoLlama-base	0.45	0.64	0.44	0.44	0.49
zephyr-7b-beta	0.43	0.58	0.43	0.53	0.49
geitje-7b-ultra	0.40	0.66	0.36	0.49	0.48
ChocoLlama-2-7B-tokentrans-instruct	0.45	0.62	0.34	0.42	0.46
mistral-7b-v0.1	0.43	0.58	0.37	0.45	0.46
ChocoLlama-2-7B-tokentrans-base	0.42	0.61	0.32	0.43	0.45
ChocoLlama-2-7B-instruct	0.36	0.57	0.33	0.45	0.43
ChocoLlama-2-7B-base	0.35	0.56	0.31	0.43	0.41
llama-2-7b-chat-hf	0.36	0.49	0.33	0.44	0.41
llama-2-7b-hf	0.36	0.51	0.32	0.41	0.40

On average, Llama - 3 - ChocoLlama - instruct outperforms the previous state - of - the - art on these benchmarks.

Qualitative evaluation

In our paper, we also provide an additional qualitative evaluation of all models, which we find more reliable empirically. For details, refer to the paper and our benchmark ChocoLlama-Bench.

Compute Infrastructure

All ChocoLlama models were trained on the compute cluster provided by the Flemish Supercomputer Center (VSC). We used 8 to 16 NVIDIA A100 GPUs with 80 GB of VRAM.

📄 License

The model is under the Llama-2 Community License.

Citation

If you find this useful for your work, please cite our paper:

@article{meeus2024chocollama,
  title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch},
  author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas},
  journal={arXiv preprint arXiv:2412.07633},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご