mbart-mlsum Automatic Summarization Model - Open-source and Free Implementation of Automatic Summarization for French News Articles

Mbart Mlsum Automatic Summarization

Developed by lincoln

An automatic summarization model for news articles based on mbart-large-50 architecture, fine-tuned using the MLSUM French news dataset

Text Generation

Transformers

FrenchOpen Source License:MIT #French News Summarization #BART Architecture #Suitable for Catering Industry

Downloads 98

Release Time : 3/2/2022

Model Overview

This model is specifically designed for generating automatic summaries of French news articles by extracting key information to produce concise overviews

Model Features

Optimized for French News Summarization

Fine-tuned specifically for French news content, effectively capturing the characteristics of French news

Efficient Training

Completed training on 700,000 articles in just 32 hours using a Tesla V100 GPU

Multi-architecture Testing

Tested with both T5 and BART architectures, ultimately selecting the optimal solution

Model Capabilities

French Text Comprehension

News Summarization Generation

Key Information Extraction

Use Cases

News Media

News Flash Generation

Automatically generates concise summaries for lengthy news articles

Produces good-quality summaries, though novelty scores fall short of manual summaries

Content Aggregation

News Aggregation Platform

Provides automatic summarization functionality for news aggregation applications

🚀 Automatic Press Article Summarization

This model is based on the facebook/mbart-large-50 architecture and fine - tuned using press articles from the MLSUM database. It assumes that article headlines make good reference summaries.

🚀 Quick Start

This model is designed for summarizing press articles. It's based on the pre - trained model facebook/mbart-large-50 and has been fine - tuned on the MLSUM dataset.

✨ Features

Model Architecture: Based on facebook/mbart-large-50.
Training Data: Uses press articles from the MLSUM database.
Task: Specialized for press article summarization.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import SummarizationPipeline

model_name = 'lincoln/mbart-mlsum-automatic-summarization'

loaded_tokenizer = AutoTokenizer.from_pretrained(model_name)
loaded_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

nlp = SummarizationPipeline(model=loaded_model, tokenizer=loaded_tokenizer)
nlp("""
« La veille de l’ouverture, je vais faire venir un coach pour les salariés qui reprendront le travail. 
Cela va me coûter 300 euros, mais après des mois d’oisiveté obligatoire, la reprise n’est pas simple. 
Certains sont au chômage partiel depuis mars 2020 », raconte Alain Fontaine, propriétaire du restaurant Le Mesturet, 
dans le quartier de la Bourse, à Paris. Cette date d’ouverture, désormais, il la connaît. Emmanuel Macron a, en effet, 
donné le feu vert pour un premier accueil des clients en terrasse, mercredi 19 mai. M. Fontaine imagine même faire venir un orchestre ce jour-là pour fêter l’événement.  
Il lui reste toutefois à construire sa terrasse. Il pensait que les ouvriers passeraient samedi 1er mai pour l’installer, mais, finalement, le rendez-vous a été décalé. 
Pour l’instant, le tas de bois est entreposé dans la salle de restaurant qui n’a plus accueilli de convives depuis le 29 octobre 2020, 
quand le couperet de la fermeture administrative est tombé.M. Fontaine, président de l’Association française des maîtres restaurateurs, 
ne manquera pas de concurrents prêts à profiter de ce premier temps de réouverture des bars et restaurants. Même si le couvre-feu limite le service à 21 heures. 
D’autant que la Mairie de Paris vient d’annoncer le renouvellement des terrasses éphémères installées en 2020 et leur gratuité jusqu’à la fin de l’été.
""")

📚 Documentation

Training

We tested two model architectures (T5 and BART) with input texts of 512 or 1024 tokens. Finally, the BART model with 512 tokens was selected. It was trained for 2 epochs (~700K articles) on a Tesla V100 (32 hours of training).

Results

Score of novelty

We compared our model (mbart-large-512-full in the graph) with two references:

MBERT, which corresponds to the performance of the model trained by the team that created the MLSUM article database.
Barthez, which is another model based on press articles from the OrangeSum database.

We can see that the novelty score (refer to the MLSUM paper) of our model is not yet comparable to these two references, and even less so to human - produced summaries. Nevertheless, the generated summaries are generally of good quality.

🔧 Technical Details

The model is fine - tuned from facebook/mbart-large-50 using press articles from the MLSUM database. The choice of the BART architecture with 512 - token input was based on testing different architectures and input lengths. The training was carried out on a Tesla V100 for 2 epochs on approximately 700K articles.

📄 License

This project is licensed under the MIT license.

📖 Citation

@article{scialom2020mlsum,
      title={MLSUM: The Multilingual Summarization Corpus}, 
      author={Thomas Scialom and Paul - Alexis Dray and Sylvain Lamprier and Benjamin Piwowarski and Jacopo Staiano},
      year={2020},
      eprint={2004.14900},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご