🚀 RobBERT-2023: Keeping Dutch Language Models Up-To-Date
RobBERT-2023 is the 2023 release of the Dutch RobBERT model. It addresses the evolving Dutch language by leveraging the 2023 version of the OSCAR dataset. With both base and large models available, it outperforms previous versions and other Dutch BERT - like models on benchmarks.
🚀 Quick Start
RobBERT-2023 is the 2023 release of the Dutch RobBERT model. It's a new version of the original pdelobelle/robbert-v2-dutch-base model, trained on the 2023 version of the OSCAR dataset. We've released a base model and an additional large model with 355M parameters (x3 over robbert - 2022 - base). Both models outperform robbert - v2 - base and robbert - 2022 - base on the DUMB benchmark from GroNLP, and robbert - 2023 - dutch - large
surpasses BERTje by +18.6 points.
✨ Features
- Up - to - date Training: Trained on the 2023 version of the OSCAR dataset to account for the evolving Dutch language.
- Large Model Available: A large model with 355M parameters is released, offering more powerful performance.
- High Benchmark Performance: Surpasses previous models and BERTje on the DUMB benchmark.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-base")
model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-base")
Advanced Usage
You can use most of HuggingFace's BERT - based notebooks for finetuning RobBERT - 2022 on your type of Dutch language dataset. By default, RobBERT - 2023 has the masked language model head used in training, which can be used as a zero - shot way to fill masks in sentences. You can test it for free on RobBERT's Hosted infererence API of Huggingface. You can also create a new prediction head for your own task by using any of HuggingFace's [RoBERTa - runners](https://huggingface.co/transformers/v2.7.0/examples.html#language - model - training) or their fine - tuning notebooks by changing the model name to pdelobelle/robbert - 2023 - dutch - large
.
📚 Documentation
Comparison of Available Dutch BERT models
There is a wide variety of Dutch BERT - based models available for fine - tuning on your tasks. Here's a quick summary:
Property |
Details |
[DTAI - KULeuven/robbert - 2023 - dutch - large](https://huggingface.co/DTAI - KULeuven/robbert - 2023 - dutch - large) |
The first Dutch large (355M parameters) model, trained on OSCAR2023 with a new tokenizer using our Tik - to - Tok method. |
(this model) [DTAI - KULeuven/robbert - 2023 - dutch - base](https://huggingface.co/DTAI - KULeuven/robbert - 2023 - dutch - base) |
A new RobBERT model on the OSCAR2023 dataset with a completely new tokenizer, helpful for tasks relying on recent words and information. |
[DTAI - KULeuven/robbert - 2022 - dutch - base](https://huggingface.co/DTAI - KULeuven/robbert - 2022 - dutch - base) |
A further pre - trained RobBERT model on the OSCAR2022 dataset, useful for tasks related to recent events. |
[pdelobelle/robbert - v2 - dutch - base](https://huggingface.co/pdelobelle/robbert - v2 - dutch - base) |
For years, the best - performing BERT - like model for most language tasks, trained on a large Dutch web - crawled dataset (OSCAR) using the RoBERTa architecture. |
[DTAI - KULeuven/robbertje - 1 - gb - merged](https://huggingface.co/DTAI - KULeuven/robbertje - 1 - gb - mergedRobBERTje) |
A distilled version of RobBERT, about half the size and four times faster for inference, suitable for scalable language tasks. |
[GroNLP/bert - base - dutch - cased](https://huggingface.co/GroNLP/bert - base - dutch - cased) |
Uses the outdated basic BERT model and is trained on a smaller corpus of clean Dutch texts. |
How to Replicate Our Paper Experiments
Replicating our paper experiments is [described in detail on the RobBERT repository README](https://github.com/iPieter/RobBERT#how - to - replicate - our - paper - experiments). The pretraining depends on the model; for RobBERT - 2023, it's based on our Tik - to - Tok method.
Name Origin of RobBERT
Most BERT - like models have the word BERT in their name. We queried our original RobBERT model using its masked language model with various prompts, and it consistently called itself RobBERT. The name is fitting as it's a very Dutch name and has a high similarity to its root architecture, RoBERTa. Since "rob" is a Dutch word for a seal, we designed the RobBERT logo with a seal dressed like Bert from Sesame Street.
🔧 Technical Details
RobBERT - 2023 and RobBERT use the RoBERTa architecture and pre - training with a Dutch tokenizer and training data. RoBERTa is a robustly optimized English BERT model, making it more powerful than the original BERT. Given the same architecture, RobBERT can be easily finetuned and inferenced using code to finetune RoBERTa models and most code for BERT models, e.g., from the HuggingFace Transformers library.
📄 License
The RobBERT models are released under the MIT license.
Credits and citation
The suite of RobBERT models are created by Pieter Delobelle, Thomas Winters, Bettina Berendt and François Remy. If you would like to cite our paper or model, you can use the following BibTeX:
@misc{delobelle2023robbert2023conversion,
author = {Delobelle, P and Remy, F},
month = {Sep},
organization = {Antwerp, Belgium},
title = {RobBERT-2023: Keeping Dutch Language Models Up-To-Date at a Lower Cost Thanks to Model Conversion},
year = {2023},
startyear = {2023},
startmonth = {Sep},
startday = {22},
finishyear = {2023},
finishmonth = {Sep},
finishday = {22},
venue = {The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)},
day = {22},
publicationstatus = {published},
url= {https://clin33.uantwerpen.be/abstract/robbert-2023-keeping-dutch-language-models-up-to-date-at-a-lower-cost-thanks-to-model-conversion/}
}
@inproceedings{delobelle2022robbert2022,
doi = {10.48550/ARXIV.2211.08192},
url = {https://arxiv.org/abs/2211.08192},
author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use},
venue = {arXiv},
year = {2022},
}
@inproceedings{delobelle2020robbert,
title = "{R}ob{BERT}: a {D}utch {R}o{BERT}a-based {L}anguage {M}odel",
author = "Delobelle, Pieter and
Winters, Thomas and
Berendt, Bettina",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.292",
doi = "10.18653/v1/2020.findings-emnlp.292",
pages = "3255--3265"
}