đ t5-base-dutch
This is a Dutch T5 model pre - trained during the Hugging Face community week, aiming to provide powerful natural language processing capabilities for Dutch.
đ Quick Start
This t5-base-dutch
model was created by Yeb Havinga & Dat Nguyen during the Hugging Face community week, organized by HuggingFace with TPU usage sponsored by Google, for the project Pre - train T5 from scratch in Dutch.
You can also check out the fine - tuned t5-base-dutch-demo model and the demo application Netherformer đ°, which are based on this model.
5 Jan 2022: Model updated. Evaluation accuracy increased from 0.64 to 0.70.
11 Jan 2022: See also yhavinga/t5-v1.1-base-dutch-cased with eval acc 0.78
⨠Features
- Parameter Scale: This t5 model has 222M parameters.
- Pre - training Details: It was pre - trained with masked language modeling (denoise token span corruption) objective on the dataset
mc4_nl_cleaned
config full
for 1 epoch(s) and a duration of 2d9h, with a sequence length of 512, batch size 128 and 527500 total steps (35B tokens). Pre - training evaluation loss and accuracy are 1,38 and 0,70.
đ Documentation
Tokenizer
The model uses a cased SentencePiece tokenizer configured with the Nmt, NFKC, Replace multi - space to single - space
normalizers and has 32003 tokens. It was trained on Dutch mc4 with scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language - modeling). See ./raw/main/tokenizer.json for details.
Dataset(s)
All models listed below are pre - trained on cleaned Dutch mC4, which is the original mC4, except:
- Documents that contained words from a selection of the Dutch and English List of Dirty Naught Obscene and Otherwise Bad Words are removed.
- Sentences with less than 3 words are removed.
- Sentences with a word of more than 1000 characters are removed.
- Documents with less than 5 sentences are removed.
- Documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie" are removed.
The Dutch and English models are pre - trained on a 50/50% mix of Dutch mC4 and English C4. The translation models are fine - tuned on CCMatrix.
Dutch T5 Models
Three types of [Dutch T5 models have been trained (blog)](https://huggingface.co/spaces/yhavinga/pre - training - dutch - t5 - models). t5 - base - dutch
is the only model with an original T5 config. The other model types t5 - v1.1 and t5 - eff have gated - relu
instead of relu
as activation function, and trained with a drop - out of 0.0
unless training would diverge (t5 - v1.1 - large - dutch - cased
). The T5 - eff models are models that differ in their number of layers.
|
t5-base-dutch |
t5-v1.1-base-dutch-uncased |
t5-v1.1-base-dutch-cased |
t5-v1.1-large-dutch-cased |
t5-v1_1-base-dutch-english-cased |
t5-v1_1-base-dutch-english-cased-1024 |
t5-small-24L-dutch-english |
t5-xl-4L-dutch-english-cased |
t5-base-36L-dutch-english-cased |
t5-eff-xl-8l-dutch-english-cased |
t5-eff-large-8l-dutch-english-cased |
type |
t5 |
t5-v1.1 |
t5-v1.1 |
t5-v1.1 |
t5-v1.1 |
t5-v1.1 |
t5 eff |
t5 eff |
t5 eff |
t5 eff |
t5 eff |
d_model |
768 |
768 |
768 |
1024 |
768 |
768 |
512 |
2048 |
768 |
1024 |
1024 |
d_ff |
3072 |
2048 |
2048 |
2816 |
2048 |
2048 |
1920 |
5120 |
2560 |
16384 |
4096 |
num_heads |
12 |
12 |
12 |
16 |
12 |
12 |
8 |
32 |
12 |
32 |
16 |
d_kv |
64 |
64 |
64 |
64 |
64 |
64 |
64 |
64 |
64 |
128 |
64 |
num_layers |
12 |
12 |
12 |
24 |
12 |
12 |
24 |
4 |
36 |
8 |
8 |
num parameters |
223M |
248M |
248M |
783M |
248M |
248M |
250M |
585M |
729M |
1241M |
335M |
feed_forward_proj |
relu |
gated-gelu |
gated-gelu |
gated-gelu |
gated-gelu |
gated-gelu |
gated-gelu |
gated-gelu |
gated-gelu |
gated-gelu |
gated-gelu |
dropout |
0.1 |
0.0 |
0.0 |
0.1 |
0.0 |
0.0 |
0.0 |
0.1 |
0.0 |
0.0 |
0.0 |
dataset |
mc4_nl_cleaned |
mc4_nl_cleaned full |
mc4_nl_cleaned full |
mc4_nl_cleaned |
mc4_nl_cleaned small_en_nl |
mc4_nl_cleaned large_en_nl |
mc4_nl_cleaned large_en_nl |
mc4_nl_cleaned large_en_nl |
mc4_nl_cleaned large_en_nl |
mc4_nl_cleaned large_en_nl |
mc4_nl_cleaned large_en_nl |
tr. seq len |
512 |
1024 |
1024 |
512 |
512 |
1024 |
512 |
512 |
512 |
512 |
512 |
batch size |
128 |
64 |
64 |
64 |
128 |
64 |
128 |
512 |
512 |
64 |
128 |
total steps |
527500 |
1014525 |
1210154 |
1120k/2427498 |
2839630 |
1520k/3397024 |
851852 |
212963 |
212963 |
538k/1703705 |
851850 |
epochs |
1 |
2 |
2 |
2 |
10 |
4 |
1 |
1 |
1 |
1 |
1 |
duration |
2d9h |
5d5h |
6d6h |
8d13h |
11d18h |
9d1h |
4d10h |
6d1h |
17d15h |
4d 19h |
3d 23h |
optimizer |
adafactor |
adafactor |
adafactor |
adafactor |
adafactor |
adafactor |
adafactor |
adafactor |
adafactor |
adafactor |
adafactor |
lr |
0.005 |
0.005 |
0.005 |
0.005 |
0.005 |
0.005 |
0.005 |
0.005 |
0.009 |
0.005 |
0.005 |
warmup |
10000.0 |
10000.0 |
10000.0 |
10000.0 |
10000.0 |
5000.0 |
20000.0 |
2500.0 |
1000.0 |
5000.0 |
20000.0 |
đ License
This project is licensed under the apache - 2.0
license.