đ T5 v1.1 Base finetuned for CNN news summarization in Dutch đŗđą
This model is a fine - tuned version of [t5 - v1.1 - base - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cased) on CNN Dailymail NL.
For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for the [Netherformer đ°](https://huggingface.co/spaces/flax - community/netherformer) example application!
The Rouge scores for this model are listed below.
đ Quick Start
This README provides a detailed introduction to the T5 v1.1 Base model fine - tuned for Dutch CNN news summarization, including tokenizer, dataset, model details, and Rouge scores.
⨠Features
Tokenizer
- A SentencePiece tokenizer was trained from scratch for Dutch on mC4 nl cleaned, using scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language - modeling).
Dataset
All the models below are trained on the full
configuration (39B tokens) of cleaned Dutch mC4. The cleaning process involves:
- Removing documents that contain words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List - of - Dirty - Naughty - Obscene - and - Otherwise - Bad - Words).
- Removing sentences with less than 3 words.
- Removing sentences with a word of more than 1000 characters.
- Removing documents with less than 5 sentences.
- Removing documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie".
Models
TL;DR: [yhavinga/t5 - v1.1 - base - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cased) is the best model.
yhavinga/t5 - base - dutch
is a re - trained Dutch T5 base v1.0 model from the summer 2021 Flax/Jax community week. Its accuracy was improved from 0.64 to 0.70.
- The two T5 v1.1 base models are an uncased and cased version of
t5 - v1.1 - base
, pre - trained from scratch on Dutch, with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the base models are trained with a dropout of 0.0. For fine - tuning, it is intended to set this back to 0.1.
- The large cased model is a pre - trained Dutch version of
t5 - v1.1 - large
. Training t5 - v1.1 - large
was difficult. Without dropout regularization, the training would diverge at a certain point. With dropout, training went better, but much slower than training the t5 - model. At some point, the convergence was too slow to continue training. The latest checkpoint, training scripts, and metrics are available for reference. For actual fine - tuning, the cased base model is probably a better choice.
Property |
Details |
Model Type |
T5, t5 - v1.1 |
Training Data |
Cleaned Dutch mC4, CNN Dailymail NL |
|
model |
train seq len |
acc |
loss |
batch size |
epochs |
steps |
dropout |
optim |
lr |
duration |
[yhavinga/t5 - base - dutch](https://huggingface.co/yhavinga/t5 - base - dutch) |
T5 |
512 |
0.70 |
1.38 |
128 |
1 |
528481 |
0.1 |
adafactor |
5e - 3 |
2d 9h |
[yhavinga/t5 - v1.1 - base - dutch - uncased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - uncased) |
t5 - v1.1 |
1024 |
0.73 |
1.20 |
64 |
2 |
1014525 |
0.0 |
adafactor |
5e - 3 |
5d 5h |
[yhavinga/t5 - v1.1 - base - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cased) |
t5 - v1.1 |
1024 |
0.78 |
0.96 |
64 |
2 |
1210000 |
0.0 |
adafactor |
5e - 3 |
6d 6h |
[yhavinga/t5 - v1.1 - large - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - large - dutch - cased) |
t5 - v1.1 |
512 |
0.76 |
1.07 |
64 |
1 |
1120000 |
0.1 |
adafactor |
5e - 3 |
86 13h |
The cased t5 - v1.1 Dutch models were fine - tuned on summarizing the CNN Daily Mail dataset.
|
model |
input len |
target len |
Rouge1 |
Rouge2 |
RougeL |
RougeLsum |
Test Gen Len |
epochs |
batch size |
steps |
duration |
[yhavinga/t5 - v1.1 - base - dutch - cnn - test](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cnn - test) |
t5 - v1.1 |
1024 |
96 |
34.8 |
13.6 |
25.2 |
32.1 |
79 |
6 |
64 |
26916 |
2h 40m |
[yhavinga/t5 - v1.1 - large - dutch - cnn - test](https://huggingface.co/yhavinga/t5 - v1.1 - large - dutch - cnn - test) |
t5 - v1.1 |
1024 |
96 |
34.4 |
13.6 |
25.3 |
31.7 |
81 |
5 |
16 |
89720 |
11h |
đ Documentation
The Rouge scores for the model yhavinga/t5 - v1.1 - base - dutch - cnn - test
on the test split of ml6team/cnn_dailymail_nl
are as follows:
Metric |
Value |
ROUGE - 1 |
38.5454 |
ROUGE - 2 |
15.7133 |
ROUGE - L |
25.9162 |
ROUGE - LSUM |
35.4489 |
loss |
2.0727603435516357 |
gen_len |
91.1699 |
đ License
This project is licensed under the Apache - 2.0 license.
đ Acknowledgements
This project would not have been possible without the generous compute resources provided by Google through the TPU Research Cloud. The HuggingFace đ¤ ecosystem was also crucial in many, if not all, parts of the training. The following repositories were helpful in setting up the TPU - VM and training the models:
- [Gsarti's Pretrain and Fine - tune a T5 model with Flax on GCP](https://github.com/gsarti/t5 - flax - gcp)
- [HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language - modeling)
- [Flax/Jax Community week t5 - base - dutch](https://huggingface.co/flax - community/t5 - base - dutch)
Created by [Yeb Havinga](https://www.linkedin.com/in/yeb - havinga - 86530825/)