đ Automatic Press Article Summarization
This model is based on the facebook/mbart-large-50
architecture and fine - tuned using press articles from the MLSUM database. It assumes that article headlines make good reference summaries.
đ Quick Start
This model is designed for summarizing press articles. It's based on the pre - trained model facebook/mbart-large-50
and has been fine - tuned on the MLSUM dataset.
âš Features
- Model Architecture: Based on
facebook/mbart-large-50
.
- Training Data: Uses press articles from the MLSUM database.
- Task: Specialized for press article summarization.
đŠ Installation
No specific installation steps are provided in the original document.
đ» Usage Examples
Basic Usage
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import SummarizationPipeline
model_name = 'lincoln/mbart-mlsum-automatic-summarization'
loaded_tokenizer = AutoTokenizer.from_pretrained(model_name)
loaded_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
nlp = SummarizationPipeline(model=loaded_model, tokenizer=loaded_tokenizer)
nlp("""
« La veille de lâouverture, je vais faire venir un coach pour les salariĂ©s qui reprendront le travail.
Cela va me coĂ»ter 300 euros, mais aprĂšs des mois dâoisivetĂ© obligatoire, la reprise nâest pas simple.
Certains sont au chÎmage partiel depuis mars 2020 », raconte Alain Fontaine, propriétaire du restaurant Le Mesturet,
dans le quartier de la Bourse, Ă Paris. Cette date dâouverture, dĂ©sormais, il la connaĂźt. Emmanuel Macron a, en effet,
donnĂ© le feu vert pour un premier accueil des clients en terrasse, mercredi 19 mai. M. Fontaine imagine mĂȘme faire venir un orchestre ce jour-lĂ pour fĂȘter lâĂ©vĂ©nement.
Il lui reste toutefois Ă construire sa terrasse. Il pensait que les ouvriers passeraient samedi 1er mai pour lâinstaller, mais, finalement, le rendez-vous a Ă©tĂ© dĂ©calĂ©.
Pour lâinstant, le tas de bois est entreposĂ© dans la salle de restaurant qui nâa plus accueilli de convives depuis le 29 octobre 2020,
quand le couperet de la fermeture administrative est tombĂ©.M. Fontaine, prĂ©sident de lâAssociation française des maĂźtres restaurateurs,
ne manquera pas de concurrents prĂȘts Ă profiter de ce premier temps de rĂ©ouverture des bars et restaurants. MĂȘme si le couvre-feu limite le service Ă 21 heures.
Dâautant que la Mairie de Paris vient dâannoncer le renouvellement des terrasses Ă©phĂ©mĂšres installĂ©es en 2020 et leur gratuitĂ© jusquâĂ la fin de lâĂ©tĂ©.
""")
đ Documentation
Training
We tested two model architectures (T5 and BART) with input texts of 512 or 1024 tokens. Finally, the BART model with 512 tokens was selected. It was trained for 2 epochs (~700K articles) on a Tesla V100 (32 hours of training).
Results

We compared our model (mbart-large-512-full
in the graph) with two references:
- MBERT, which corresponds to the performance of the model trained by the team that created the MLSUM article database.
- Barthez, which is another model based on press articles from the OrangeSum database.
We can see that the novelty score (refer to the MLSUM paper) of our model is not yet comparable to these two references, and even less so to human - produced summaries. Nevertheless, the generated summaries are generally of good quality.
đ§ Technical Details
The model is fine - tuned from facebook/mbart-large-50
using press articles from the MLSUM database. The choice of the BART architecture with 512 - token input was based on testing different architectures and input lengths. The training was carried out on a Tesla V100 for 2 epochs on approximately 700K articles.
đ License
This project is licensed under the MIT license.
đ Citation
@article{scialom2020mlsum,
title={MLSUM: The Multilingual Summarization Corpus},
author={Thomas Scialom and Paul - Alexis Dray and Sylvain Lamprier and Benjamin Piwowarski and Jacopo Staiano},
year={2020},
eprint={2004.14900},
archivePrefix={arXiv},
primaryClass={cs.CL}
}