RobBERT-2023 Dutch Language Model Open-Sourced - Adapting to Language Development to Boost Dutch Language Applications

Robbert 2023 Dutch Base

Developed by DTAI-KULeuven

RobBERT-2023 is the latest language model based on Dutch. It adopts the RoBERTa architecture and uses an updated Dutch tokenizer and training data to adapt to the latest development and changes in Dutch.

Large Language Model

Transformers

OtherOpen Source License:MIT #Dutch large model #RoBERTa architecture optimization #Training with the latest corpus

Downloads 339

Release Time : 12/5/2023

Model Overview

RobBERT-2023 is a language model optimized for Dutch, aiming to provide more accurate natural language processing capabilities, especially excelling in handling recent language changes and new vocabulary.

Model Features

Performance improvement

It outperforms the previous robbert-v2-base and robbert-2022-base models in the DUMB benchmark test. The large version scores 18.6 points higher than BERTje.

Language update

The training data is updated to 2022, enabling it to better handle new vocabulary introduced by the COVID-19 pandemic and changes in world facts.

Diverse model options

It offers two options: the basic version and the large version (355 million parameters) to meet different needs.

Architectural advantage

Based on the RoBERTa architecture, it performs robust optimization on the original BERT model, resulting in more powerful performance.

Model Capabilities

Text classification

Named entity recognition

Question-answering system

Text generation

Sentiment analysis

Use Cases

Academic research

Dutch linguistics research

Used to analyze language changes and new vocabulary usage in Dutch

It can accurately identify and handle the latest Dutch vocabulary

Business applications

Customer service automation

Used to handle Dutch customer inquiries and feedback

Improves the quality and efficiency of customer service responses

🚀 RobBERT-2023: Keeping Dutch Language Models Up-To-Date

RobBERT-2023 is the 2023 release of the Dutch RobBERT model. It addresses the evolving Dutch language by leveraging the 2023 version of the OSCAR dataset. With both base and large models available, it outperforms previous versions and other Dutch BERT - like models on benchmarks.

RobBERT-2023: A Dutch RoBERTa-based Language Model

🚀 Quick Start

RobBERT-2023 is the 2023 release of the Dutch RobBERT model. It's a new version of the original pdelobelle/robbert-v2-dutch-base model, trained on the 2023 version of the OSCAR dataset. We've released a base model and an additional large model with 355M parameters (x3 over robbert - 2022 - base). Both models outperform robbert - v2 - base and robbert - 2022 - base on the DUMB benchmark from GroNLP, and robbert - 2023 - dutch - large surpasses BERTje by +18.6 points.

✨ Features

Up - to - date Training: Trained on the 2023 version of the OSCAR dataset to account for the evolving Dutch language.
Large Model Available: A large model with 355M parameters is released, offering more powerful performance.
High Benchmark Performance: Surpasses previous models and BERTje on the DUMB benchmark.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-base")
model = AutoModelForSequenceClassification.from_pretrained("DTAI-KULeuven/robbert-2023-dutch-base")

Advanced Usage

You can use most of HuggingFace's BERT - based notebooks for finetuning RobBERT - 2022 on your type of Dutch language dataset. By default, RobBERT - 2023 has the masked language model head used in training, which can be used as a zero - shot way to fill masks in sentences. You can test it for free on RobBERT's Hosted infererence API of Huggingface. You can also create a new prediction head for your own task by using any of HuggingFace's [RoBERTa - runners](https://huggingface.co/transformers/v2.7.0/examples.html#language - model - training) or their fine - tuning notebooks by changing the model name to pdelobelle/robbert - 2023 - dutch - large.

📚 Documentation

Comparison of Available Dutch BERT models

There is a wide variety of Dutch BERT - based models available for fine - tuning on your tasks. Here's a quick summary:

Property	Details
[DTAI - KULeuven/robbert - 2023 - dutch - large](https://huggingface.co/DTAI - KULeuven/robbert - 2023 - dutch - large)	The first Dutch large (355M parameters) model, trained on OSCAR2023 with a new tokenizer using our Tik - to - Tok method.
(this model) [DTAI - KULeuven/robbert - 2023 - dutch - base](https://huggingface.co/DTAI - KULeuven/robbert - 2023 - dutch - base)	A new RobBERT model on the OSCAR2023 dataset with a completely new tokenizer, helpful for tasks relying on recent words and information.
[DTAI - KULeuven/robbert - 2022 - dutch - base](https://huggingface.co/DTAI - KULeuven/robbert - 2022 - dutch - base)	A further pre - trained RobBERT model on the OSCAR2022 dataset, useful for tasks related to recent events.
[pdelobelle/robbert - v2 - dutch - base](https://huggingface.co/pdelobelle/robbert - v2 - dutch - base)	For years, the best - performing BERT - like model for most language tasks, trained on a large Dutch web - crawled dataset (OSCAR) using the RoBERTa architecture.
[DTAI - KULeuven/robbertje - 1 - gb - merged](https://huggingface.co/DTAI - KULeuven/robbertje - 1 - gb - mergedRobBERTje)	A distilled version of RobBERT, about half the size and four times faster for inference, suitable for scalable language tasks.
[GroNLP/bert - base - dutch - cased](https://huggingface.co/GroNLP/bert - base - dutch - cased)	Uses the outdated basic BERT model and is trained on a smaller corpus of clean Dutch texts.

How to Replicate Our Paper Experiments

Replicating our paper experiments is [described in detail on the RobBERT repository README](https://github.com/iPieter/RobBERT#how - to - replicate - our - paper - experiments). The pretraining depends on the model; for RobBERT - 2023, it's based on our Tik - to - Tok method.

Name Origin of RobBERT

Most BERT - like models have the word BERT in their name. We queried our original RobBERT model using its masked language model with various prompts, and it consistently called itself RobBERT. The name is fitting as it's a very Dutch name and has a high similarity to its root architecture, RoBERTa. Since "rob" is a Dutch word for a seal, we designed the RobBERT logo with a seal dressed like Bert from Sesame Street.

🔧 Technical Details

RobBERT - 2023 and RobBERT use the RoBERTa architecture and pre - training with a Dutch tokenizer and training data. RoBERTa is a robustly optimized English BERT model, making it more powerful than the original BERT. Given the same architecture, RobBERT can be easily finetuned and inferenced using code to finetune RoBERTa models and most code for BERT models, e.g., from the HuggingFace Transformers library.

📄 License

The RobBERT models are released under the MIT license.

Credits and citation

The suite of RobBERT models are created by Pieter Delobelle, Thomas Winters, Bettina Berendt and François Remy. If you would like to cite our paper or model, you can use the following BibTeX:

@misc{delobelle2023robbert2023conversion,
author = {Delobelle, P and Remy, F},
month = {Sep},
organization = {Antwerp, Belgium},
title = {RobBERT-2023: Keeping Dutch Language Models Up-To-Date at a Lower Cost Thanks to Model Conversion},
year = {2023},
startyear = {2023},
startmonth = {Sep},
startday = {22},
finishyear = {2023},
finishmonth = {Sep},
finishday = {22},
venue = {The 33rd Meeting of Computational Linguistics in The Netherlands (CLIN 33)},
day = {22},
publicationstatus = {published},
url= {https://clin33.uantwerpen.be/abstract/robbert-2023-keeping-dutch-language-models-up-to-date-at-a-lower-cost-thanks-to-model-conversion/}
}

@inproceedings{delobelle2022robbert2022,
  doi = {10.48550/ARXIV.2211.08192},
  url = {https://arxiv.org/abs/2211.08192},
  author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {RobBERT-2022: Updating a Dutch Language Model to Account for Evolving Language Use},
  venue = {arXiv},
  year = {2022},
}

@inproceedings{delobelle2020robbert,
    title = "{R}ob{BERT}: a {D}utch {R}o{BERT}a-based {L}anguage {M}odel",
    author = "Delobelle, Pieter  and
      Winters, Thomas  and
      Berendt, Bettina",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.292",
    doi = "10.18653/v1/2020.findings-emnlp.292",
    pages = "3255--3265"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご