German News Title Gen MT5 Open-Source Model - Freely Deploy to Generate Authentic German News Titles

Home

German News Title Gen Mt5

Developed by aiautomationlab

An mT5-base fine-tuned model specifically designed for generating German news-style headlines

Text Generation

Transformers

GermanOpen Source License:MIT #German News Headline Generation #mT5 Fine-tuning #Bavarian News

Downloads 25

Release Time : 10/26/2022

Model Overview

This model is used to generate concise and accurate headlines from German news articles, with particular optimization for content from the Bavaria region

Model Features

News Style Optimization

Specially fine-tuned for headline length, structure, and linguistic style, distinct from general summarization models

Bavaria Region Focus

Training data primarily from BR24 news, with special attention to content relevant to Bavarian residents

Multi-Topic Coverage

Capable of generating headlines across various news topics including politics, sports, and culture

Model Capabilities

German news headline generation

Text summarization

Multi-topic content processing

Use Cases

News Media

Automatic Headline Generation for News Websites

Automatically generates engaging headlines for online news platforms

Saves editorial time but requires manual fact-checking

News Digest Creation

Quickly generates multiple headlines for daily news digests

Improves news organization efficiency

🚀 German News Title Generation Model

This is a model designed for the task of generating German news headlines. Although this task is similar to summarization, there are differences in length, structure, and language style. As a result, state - of - the - art summarization models may not be the best fit for headline generation, and further fine - tuning on this task is required.

✨ Features

Utilizes Google's [mT5 - base](https://huggingface.co/google/mt5 - base) as the foundation model.
Fine - tuned on a corpus of German news articles from BR24 published between 2015 and 2021.
Supports multiple usage methods similar to the T5 model and the Hugging Face summarization pipeline.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = ""
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

text = "Als Reaktion auf die Brandserie wurde am Mittwoch bei der Kriminalpolizei Würzburg eine Ermittlungskommission eingerichtet. Ich habe den Eindruck, der Brandstifter wird dreister, kommentiert Rosalinde Schraud, die Bürgermeisterin von Estenfeld, die Brandserie. Gerade die letzten beiden Brandstiftungen seien ungewöhnlich gewesen, da sie mitten am Tag und an frequentierten Straßen stattgefunden haben.Kommt der Brandstifter aus Estenfeld?Norbert Walz ist das letzte Opfer des Brandstifters von Estenfeld. Ein Unbekannter hat am Dienstagnachmittag sein Gartenhaus angezündet.Was da in seinem Kopf herumgeht, was da passiert – das ist ja unglaublich! Das kann schon jemand aus dem Ort sein, weil sich derjenige auskennt.Norbert Walz aus Estenfeld.Dass es sich beim Brandstifter wohl um einen Bürger ihrer Gemeinde handele, will die erste Bürgermeisterin von Estenfeld, Rosalinde Schraud, nicht bestätigen: In der Bevölkerung gibt es natürlich Spekulationen, an denen ich mich aber nicht beteiligen will. Laut Schraud reagiert die Bürgerschaft mit vermehrter Aufmerksamkeit auf die Brände: Man guckt mehr in die Nachbarschaft. Aufhören wird die Brandserie wohl nicht, solange der Täter nicht gefasst wird.Es wäre nicht ungewöhnlich, dass der Täter aus der Umgebung von Estenfeld stammt. Wir bitten deshalb Zeugen, die sachdienliche Hinweise sowohl zu den Bränden geben können, sich mit unserer Kriminalpolizei in Verbindung zu setzen.Philipp Hümmer, Sprecher des Polizeipräsidiums UnterfrankenFür Hinweise, die zur Ergreifung des Täters führen, hat das Bayerische Landeskriminalamt eine Belohnung von 2.000 Euro ausgesetzt."

input_text = "summarize: " + text
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, num_beams=5)
generated_headline = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_headline)

Advanced Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

model_id = ""
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
headline_generator = pipeline(
    "summarization",
    model=model,
    tokenizer=tokenizer,
    num_beams=5
)

text = "Als Reaktion auf die Brandserie wurde am Mittwoch bei der Kriminalpolizei Würzburg eine Ermittlungskommission eingerichtet. Ich habe den Eindruck, der Brandstifter wird dreister, kommentiert Rosalinde Schraud, die Bürgermeisterin von Estenfeld, die Brandserie. Gerade die letzten beiden Brandstiftungen seien ungewöhnlich gewesen, da sie mitten am Tag und an frequentierten Straßen stattgefunden haben.Kommt der Brandstifter aus Estenfeld?Norbert Walz ist das letzte Opfer des Brandstifters von Estenfeld. Ein Unbekannter hat am Dienstagnachmittag sein Gartenhaus angezündet.Was da in seinem Kopf herumgeht, was da passiert – das ist ja unglaublich! Das kann schon jemand aus dem Ort sein, weil sich derjenige auskennt.Norbert Walz aus Estenfeld.Dass es sich beim Brandstifter wohl um einen Bürger ihrer Gemeinde handele, will die erste Bürgermeisterin von Estenfeld, Rosalinde Schraud, nicht bestätigen: In der Bevölkerung gibt es natürlich Spekulationen, an denen ich mich aber nicht beteiligen will. Laut Schraud reagiert die Bürgerschaft mit vermehrter Aufmerksamkeit auf die Brände: Man guckt mehr in die Nachbarschaft. Aufhören wird die Brandserie wohl nicht, solange der Täter nicht gefasst wird.Es wäre nicht ungewöhnlich, dass der Täter aus der Umgebung von Estenfeld stammt. Wir bitten deshalb Zeugen, die sachdienliche Hinweise sowohl zu den Bränden geben können, sich mit unserer Kriminalpolizei in Verbindung zu setzen.Philipp Hümmer, Sprecher des Polizeipräsidiums UnterfrankenFür Hinweise, die zur Ergreifung des Täters führen, hat das Bayerische Landeskriminalamt eine Belohnung von 2.000 Euro ausgesetzt."
input_text = "summarize: " + text

generated_headline = headline_generator(input_text)[0]["summary_text"]
print(generated_headline)

📚 Documentation

Dataset & Preprocessing

The model was fine - tuned on a corpus of news articles from BR24 published between 2015 and 2021. The texts are in German and cover various news topics such as politics, sports, and culture, with a focus on topics relevant to the people in Bavaria, Germany.

In the preprocessing step, article - headline pairs meeting any of the following criteria were filtered out:

Very short articles (the number of words in the text is less than 3 times the number of words in the headline).
Articles with headlines containing only words not in the text (lemmatized and excluding stopwords).
Articles with headlines that are just the name of a known text format (e.g., "Das war der Tag", a format summarizing the most important topics of the day).

Additionally, the prefix summarize: was added to all articles to leverage the pretrained summarization capabilities of mT5.

After filtering, the corpus contained 89098 article - headline pairs, with 87306 used for training, 902 for validation, and 890 for testing.

Training

After multiple test runs of fine - tuning, the current model was further trained using the following parameters:

Foundation model: mT5 - base
Input prefix: "summarize: "
Num train epochs: 10
Learning rate: 5e - 5
Warmup ratio: 0.3
LR scheduler type: constant_with_warmup
Per device train batch size: 3
Gradient accumulation steps: 2
fp16: False

A checkpoint is stored and evaluated on the validation set every 5000 steps. After training, the checkpoint with the best cross - entropy loss on the validation set is saved as the final model.

🔧 Technical Details

The model uses Google's [mT5 - base](https://huggingface.co/google/mt5 - base) as the foundation model. During preprocessing, specific filtering criteria are applied to the article - headline pairs, and a prefix is added to the articles. The training process involves multiple parameters for fine - tuning, and checkpoints are saved and evaluated regularly.

📄 License

The model is released under the MIT license.

Limitations

Like most state - of - the - art summarization models, this model has issues with the factuality of the generated texts [^factuality].

⚠️ Important Note

It is therefore strongly advised having a human fact - check the generated headlines.

An analysis of possible biases reproduced by the model, regardless of whether they originate from the fine - tuning or the underlying mT5 model, is beyond the scope of this work. We assume that biases exist within the model, and an analysis will be a task for future work.

As the model was trained on news articles from 2015 - 2021, further biases and factual errors could emerge due to topic shifts in news articles and changes in the (e.g., political) situation.

Evaluation

Quantitative

The model was evaluated on a held - out test set of 890 article - headline pairs. Headlines were generated using beam search with a beam width of 5.

Property	Details
Model Type	aiautomationlab/german - news - title - gen - mt5
Training Data	News articles from BR24 published between 2015 and 2021

model	Rouge1	Rouge2	RougeL	RougeLsum
[T - Systems - onsite/mt5 - small - sum - de - en - v2](https://huggingface.co/T - Systems - onsite/mt5 - small - sum - de - en - v2)	0.107	0.0297	0.098	0.098
aiautomationlab/german - news - title - gen - mt5	0.3131	0.0873	0.1997	0.1997

For evaluating the factuality of the generated headlines, 3 state - of - the - art metrics for summary evaluation were used. Since these metrics are only available in English, the texts and generated headlines were translated from German to English using the [DeepL API](https://www.deepl.com/en/docs - api/) in a preprocessing step.

SummaC - CZ [^summac]
Yields a score between - 1 and 1, representing the difference between entailment probability and contradiction probability (- 1: the headline is not entailed in the text and is completely contradicted by it, 1: the headline is fully entailed in the text and not contradicted by it).

Parameters:
- model_name: [vitc](https://huggingface.co/tals/albert - xlarge - vitaminc - mnli)
QAFactEval [^qafacteval]
Using Lerc Quip score, which is reported to perform best in the corresponding paper. The score yields a value between 0 and 5 representing the overlap between answers based on the headline and text to questions generated from the headline (0: no overlap, 5: perfect overlap).

Parameters:
- use_lerc_quip: True
DAE (dependency arc entailment) [^dae]
Yields a binary value of either 0 or 1, representing whether all dependency arcs in the headline are entailed in the text (0: at least one dependency arc is not entailed, 1: all dependency arcs are entailed).

Parameters:
- Model checkpoint: DAE_xsum_human_best_ckpt
- model_type: model_type
- max_seq_length: 512

Each metric is calculated for all article - headline pairs in the test set, and the respective mean score over the test set is reported.

model	SummacCZ	QAFactEval	DAE
[T - Systems - onsite/mt5 - small - sum - de - en - v2](https://huggingface.co/T - Systems - onsite/mt5 - small - sum - de - en - v2)	0.6969	3.3023	0.8292
aiautomationlab/german - news - title - gen - mt5	0.4419	1.9265	0.7438

It can be observed that our model scores consistently lower than the T - Systems one. Following human evaluation, it seems that to match the structure and style specific to headlines, the headline generation model has to be more abstractive than a model for summarization, which leads to a higher frequency of hallucinations in the generated output.

Qualitative

A qualitative evaluation by members of the BR AI + Automation Lab showed that the model succeeds in producing headlines that match the language and style of news headlines, but also confirms that there are issues with the factual consistency common to state - of - the - art summarization models.

Future Work

Future work on this model will focus on generating headlines with higher factual consistency.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご