The open-source Dutch news summarization model t5-v1.1-base-dutch-cnn-test can accurately distill the content of CNN posts.

T5 V1.1 Base Dutch Cnn Test

Developed by yhavinga

A Dutch news summarization model based on the T5 architecture, fine-tuned for the Dutch version of CNN Daily Mail

Text Generation OtherOpen Source License:Apache-2.0 #Dutch News Summarization #T5 Fine-tuned Model #ROUGE Optimization

Downloads 176

Release Time : 3/2/2022

Model Overview

This model is a sequence-to-sequence model fine-tuned on the Dutch T5 base model, specifically designed for generating summaries of Dutch news articles.

Model Features

Dutch Language Specialization

Trained on the cleaned Dutch mC4 dataset, specifically optimized for Dutch text processing

High-Quality Summaries

Fine-tuned on the Dutch CNN Daily Mail dataset, achieving ROUGE-L 25.9 summarization quality

Optimized Tokenizer

Uses a SentencePiece tokenizer specifically trained for Dutch, delivering better processing results

Data Cleaning

Training data underwent strict filtering to remove low-quality content and anomalies

Model Capabilities

Dutch Text Understanding

News Summarization

Long Text Compression

Use Cases

News Media

Automatic News Summarization

Automatically generates concise summaries for Dutch news articles

Average summary length of around 91 words, with ROUGE-L score of 25.9

Content Analysis

Key Information Extraction

Extracts core information from lengthy Dutch documents

🚀 T5 v1.1 Base finetuned for CNN news summarization in Dutch 🇳🇱

This model is a fine - tuned version of [t5 - v1.1 - base - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cased) on CNN Dailymail NL.

For a demo of the Dutch CNN summarization models, head over to the Hugging Face Spaces for the [Netherformer 📰](https://huggingface.co/spaces/flax - community/netherformer) example application!

The Rouge scores for this model are listed below.

🚀 Quick Start

This README provides a detailed introduction to the T5 v1.1 Base model fine - tuned for Dutch CNN news summarization, including tokenizer, dataset, model details, and Rouge scores.

✨ Features

Tokenizer

A SentencePiece tokenizer was trained from scratch for Dutch on mC4 nl cleaned, using scripts from the Huggingface Transformers [Flax examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language - modeling).

Dataset

All the models below are trained on the full configuration (39B tokens) of cleaned Dutch mC4. The cleaning process involves:

Removing documents that contain words from a selection of the Dutch and English [List of Dirty Naught Obscene and Otherwise Bad Words](https://github.com/LDNOOBW/List - of - Dirty - Naughty - Obscene - and - Otherwise - Bad - Words).
Removing sentences with less than 3 words.
Removing sentences with a word of more than 1000 characters.
Removing documents with less than 5 sentences.
Removing documents with "javascript", "lorum ipsum", "terms of use", "privacy policy", "cookie policy", "uses cookies", "use of cookies", "use cookies", "elementen ontbreken", "deze printversie".

Models

TL;DR: [yhavinga/t5 - v1.1 - base - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cased) is the best model.

yhavinga/t5 - base - dutch is a re - trained Dutch T5 base v1.0 model from the summer 2021 Flax/Jax community week. Its accuracy was improved from 0.64 to 0.70.
The two T5 v1.1 base models are an uncased and cased version of t5 - v1.1 - base, pre - trained from scratch on Dutch, with a tokenizer also trained from scratch. The t5 v1.1 models are slightly different from the t5 models, and the base models are trained with a dropout of 0.0. For fine - tuning, it is intended to set this back to 0.1.
The large cased model is a pre - trained Dutch version of t5 - v1.1 - large. Training t5 - v1.1 - large was difficult. Without dropout regularization, the training would diverge at a certain point. With dropout, training went better, but much slower than training the t5 - model. At some point, the convergence was too slow to continue training. The latest checkpoint, training scripts, and metrics are available for reference. For actual fine - tuning, the cased base model is probably a better choice.

Property	Details
Model Type	T5, t5 - v1.1
Training Data	Cleaned Dutch mC4, CNN Dailymail NL

	model	train seq len	acc	loss	batch size	epochs	steps	dropout	optim	lr	duration
[yhavinga/t5 - base - dutch](https://huggingface.co/yhavinga/t5 - base - dutch)	T5	512	0.70	1.38	128	1	528481	0.1	adafactor	5e - 3	2d 9h
[yhavinga/t5 - v1.1 - base - dutch - uncased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - uncased)	t5 - v1.1	1024	0.73	1.20	64	2	1014525	0.0	adafactor	5e - 3	5d 5h
[yhavinga/t5 - v1.1 - base - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cased)	t5 - v1.1	1024	0.78	0.96	64	2	1210000	0.0	adafactor	5e - 3	6d 6h
[yhavinga/t5 - v1.1 - large - dutch - cased](https://huggingface.co/yhavinga/t5 - v1.1 - large - dutch - cased)	t5 - v1.1	512	0.76	1.07	64	1	1120000	0.1	adafactor	5e - 3	86 13h

The cased t5 - v1.1 Dutch models were fine - tuned on summarizing the CNN Daily Mail dataset.

	model	input len	target len	Rouge1	Rouge2	RougeL	RougeLsum	Test Gen Len	epochs	batch size	steps	duration
[yhavinga/t5 - v1.1 - base - dutch - cnn - test](https://huggingface.co/yhavinga/t5 - v1.1 - base - dutch - cnn - test)	t5 - v1.1	1024	96	34.8	13.6	25.2	32.1	79	6	64	26916	2h 40m
[yhavinga/t5 - v1.1 - large - dutch - cnn - test](https://huggingface.co/yhavinga/t5 - v1.1 - large - dutch - cnn - test)	t5 - v1.1	1024	96	34.4	13.6	25.3	31.7	81	5	16	89720	11h

📚 Documentation

The Rouge scores for the model yhavinga/t5 - v1.1 - base - dutch - cnn - test on the test split of ml6team/cnn_dailymail_nl are as follows:

Metric	Value
ROUGE - 1	38.5454
ROUGE - 2	15.7133
ROUGE - L	25.9162
ROUGE - LSUM	35.4489
loss	2.0727603435516357
gen_len	91.1699

📄 License

This project is licensed under the Apache - 2.0 license.

👏 Acknowledgements

This project would not have been possible without the generous compute resources provided by Google through the TPU Research Cloud. The HuggingFace 🤗 ecosystem was also crucial in many, if not all, parts of the training. The following repositories were helpful in setting up the TPU - VM and training the models:

[Gsarti's Pretrain and Fine - tune a T5 model with Flax on GCP](https://github.com/gsarti/t5 - flax - gcp)
[HUggingFace Flax MLM examples](https://github.com/huggingface/transformers/tree/master/examples/flax/language - modeling)
[Flax/Jax Community week t5 - base - dutch](https://huggingface.co/flax - community/t5 - base - dutch)

Created by [Yeb Havinga](https://www.linkedin.com/in/yeb - havinga - 86530825/)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご