t5-base-dutch Open-Source Dutch Pretrained Model - Can Assist in Dutch-Related Content Processing

T5 Base Dutch

Developed by yhavinga

This is a Dutch pre-trained model based on the T5 architecture, with 222 million parameters, trained on the cleaned Dutch mC4 dataset.

Large Language Model OtherOpen Source License:Apache-2.0 #Dutch text generation #Multi-task fine-tuning base #512 sequence length

Downloads 102

Release Time : 3/2/2022

Model Overview

This model adopts the T5 architecture and is pre-trained with masked language modeling objectives, suitable for downstream NLP tasks requiring fine-tuning.

Model Features

Dutch Optimization

Specifically pre-trained for Dutch using the cleaned Dutch mC4 dataset

T5 Architecture

Adopts the standard T5-base architecture, supporting text-to-text transformation tasks

Efficient Pre-training

Trained for 1 epoch on TPU, taking 2 days and 9 hours, processing 35 billion tokens

Model Capabilities

Text generation

Text summarization

Machine translation

Text classification

Use Cases

Text Processing

News Summarization

Can be used to generate summaries of Dutch news articles

Achieved a Rouge1 score of 0.70 in evaluation

English-Dutch Translation

After fine-tuning, it can be used for translation between English and Dutch

Achieved a Bleu score of 0.78 in evaluation

🚀 t5-base-dutch

This is a Dutch T5 model pre - trained during the Hugging Face community week, aiming to provide powerful natural language processing capabilities for Dutch.

🚀 Quick Start

This t5-base-dutch model was created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace with TPU usage sponsored by Google, for the project Pre - train T5 from scratch in Dutch.

You can also check out the fine - tuned t5-base-dutch-demo model and the demo application Netherformer 📰, which are based on this model.

5 Jan 2022: Model updated. Evaluation accuracy increased from 0.64 to 0.70.

11 Jan 2022: See also yhavinga/t5-v1.1-base-dutch-cased with eval acc 0.78

✨ Features

Parameter Scale: This t5 model has 222M parameters.
Pre - training Details: It was pre - trained with masked language modeling (denoise token span corruption) objective on the dataset mc4_nl_cleaned config full for 1 epoch(s) and a duration of 2d9h, with a sequence length of 512, batch size 128 and 527500 total steps (35B tokens). Pre - training evaluation loss and accuracy are 1,38 and 0,70.

📚 Documentation

Tokenizer

The model uses a cased SentencePiece tokenizer configured with the Nmt, NFKC, Replace multi - space to single - space normalizers and has 32003 tokens. It was trained on Dutch mc4 with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language - modeling). See ./raw/main/tokenizer.json for details.

Dataset(s)

All models listed below are pre - trained on cleaned Dutch mC4, which is the original mC4, except:

Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed.
Sentences with less than 3 words are removed.
Sentences with a word of more than 1000 characters are removed.
Documents with less than 5 sentences are removed.
Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.

The Dutch and English models are pre - trained on a 50/50% mix of Dutch mC4 and English C4. The translation models are fine - tuned on CCMatrix.

Dutch T5 Models

Three types of [Dutch T5 models have been trained (blog)](https://huggingface.co/spaces/yhavinga/pre - training - dutch - t5 - models). t5 - base - dutch is the only model with an original T5 config. The other model types t5 - v1.1 and t5 - eff have gated - relu instead of relu as activation function, and trained with a drop - out of 0.0 unless training would diverge (t5 - v1.1 - large - dutch - cased). The T5 - eff models are models that differ in their number of layers.

	t5-base-dutch	t5-v1.1-base-dutch-uncased	t5-v1.1-base-dutch-cased	t5-v1.1-large-dutch-cased	t5-v1_1-base-dutch-english-cased	t5-v1_1-base-dutch-english-cased-1024	t5-small-24L-dutch-english	t5-xl-4L-dutch-english-cased	t5-base-36L-dutch-english-cased	t5-eff-xl-8l-dutch-english-cased	t5-eff-large-8l-dutch-english-cased
type	t5	t5-v1.1	t5-v1.1	t5-v1.1	t5-v1.1	t5-v1.1	t5 eff	t5 eff	t5 eff	t5 eff	t5 eff
d_model	768	768	768	1024	768	768	512	2048	768	1024	1024
d_ff	3072	2048	2048	2816	2048	2048	1920	5120	2560	16384	4096
num_heads	12	12	12	16	12	12	8	32	12	32	16
d_kv	64	64	64	64	64	64	64	64	64	128	64
num_layers	12	12	12	24	12	12	24	4	36	8	8
num parameters	223M	248M	248M	783M	248M	248M	250M	585M	729M	1241M	335M
feed_forward_proj	relu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu	gated-gelu
dropout	0.1	0.0	0.0	0.1	0.0	0.0	0.0	0.1	0.0	0.0	0.0
dataset	mc4_nl_cleaned	mc4_nl_cleaned full	mc4_nl_cleaned full	mc4_nl_cleaned	mc4_nl_cleaned small_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl	mc4_nl_cleaned large_en_nl
tr. seq len	512	1024	1024	512	512	1024	512	512	512	512	512
batch size	128	64	64	64	128	64	128	512	512	64	128
total steps	527500	1014525	1210154	1120k/2427498	2839630	1520k/3397024	851852	212963	212963	538k/1703705	851850
epochs	1	2	2	2	10	4	1	1	1	1	1
duration	2d9h	5d5h	6d6h	8d13h	11d18h	9d1h	4d10h	6d1h	17d15h	4d 19h	3d 23h
optimizer	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor	adafactor
lr	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.005	0.009	0.005	0.005
warmup	10000.0	10000.0	10000.0	10000.0	10000.0	5000.0	20000.0	2500.0	1000.0	5000.0	20000.0

📄 License

This project is licensed under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご