đ French RoBERTa2RoBERTa (shared) fine-tuned on MLSUM FR for summarization
This model is a French RoBERTa2RoBERTa (shared) fine - tuned on the MLSUM FR dataset for text summarization, aiming to provide efficient and accurate summarization services for French news.
âš Features
- Fine - tuned on MLSUM FR: Utilizes the MLSUM French dataset, which contains a large number of French news article - summary pairs, enabling the model to better adapt to French news summarization tasks.
- Based on RoBERTa: Built on the [camembert - base](https://huggingface.co/camembert - base) RoBERTa checkpoint, leveraging the powerful language understanding ability of RoBERTa.
đŠ Installation
The code example in the README uses Python and the transformers
library. You can install the transformers
library using the following command:
pip install transformers torch
đ» Usage Examples
Basic Usage
import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'mrm8488/camembert2camembert_shared-finetuned-french-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
def generate_summary(text):
inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
output = model.generate(input_ids, attention_mask=attention_mask)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "Un nuage de fumĂ©e juste aprĂšs lâexplosion, le 1er juin 2019. Une dĂ©flagration dans une importante usine dâexplosifs du centre de la Russie a fait au moins 79 blessĂ©s samedi 1er juin. Lâexplosion a eu lieu dans lâusine Kristall Ă Dzerzhinsk, une ville situĂ©e Ă environ 400 kilomĂštres Ă lâest de Moscou, dans la rĂ©gion de Nijni - Novgorod. « Il y a eu une explosion technique dans lâun des ateliers, suivie dâun incendie qui sâest propagĂ© sur une centaine de mĂštres carrĂ©s », a expliquĂ© un porte - parole des services dâurgence. Des images circulant sur les rĂ©seaux sociaux montraient un Ă©norme nuage de fumĂ©e aprĂšs lâexplosion. Cinq bĂątiments de lâusine et prĂšs de 180 bĂątiments rĂ©sidentiels ont Ă©tĂ© endommagĂ©s par lâexplosion, selon les autoritĂ©s municipales. Une enquĂȘte pour de potentielles violations des normes de sĂ©curitĂ© a Ă©tĂ© ouverte. Fragments de shrapnel Les blessĂ©s ont Ă©tĂ© soignĂ©s aprĂšs avoir Ă©tĂ© atteints par des fragments issus de lâexplosion, a prĂ©cisĂ© une porte - parole des autoritĂ©s sanitaires citĂ©e par Interfax. « Nous parlons de blessures par shrapnel dâune gravitĂ© moyenne et modĂ©rĂ©e », a - t - elle prĂ©cisĂ©. Selon des reprĂ©sentants de Kristall, cinq personnes travaillaient dans la zone oĂč sâest produite lâexplosion. Elles ont pu ĂȘtre Ă©vacuĂ©es en sĂ©curitĂ©. Les pompiers locaux ont rapportĂ© nâavoir aucune information sur des personnes qui se trouveraient encore dans lâusine."
generate_summary(text)
đ Documentation
Model
The model is based on the [camembert - base](https://huggingface.co/camembert - base) RoBERTa checkpoint, which is a pre - trained language model with strong language understanding capabilities.
Dataset
MLSUM is the first large - scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large - scale multilingual dataset which can enable new research directions for the text summarization community.
MLSUM fr
Results
Property |
Details |
Test Rouge2 - mid - precision |
14.47 |
Test Rouge2 - mid - recall |
12.90 |
Test Rouge2 - mid - fmeasure |
13.30 |
đ License
There is no license information provided in the original README.
â ïž Important Note
The code example requires the transformers
and torch
libraries to be installed. Please ensure these libraries are installed before running the code.
đĄ Usage Tip
You can adjust the max_length
parameter in the tokenizer
according to your actual needs to control the length of the input text.
Created by Manuel Romero/@mrm8488 with the support of Narrativa
Made with â„ in Spain