đ Spanish RoBERTa2RoBERTa (roberta-base-bne) fine-tuned on MLSUM ES for summarization
This project fine-tunes the RoBERTa model on the MLSUM Spanish dataset for text summarization, providing a practical solution for news summarization.
đ Quick Start
To use this fine - tuned model for text summarization, you can follow the code example below:
import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
def generate_summary(text):
inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
output = model.generate(input_ids, attention_mask=attention_mask)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "Your text here..."
generate_summary(text)
⨠Features
- Fine - tuned on MLSUM ES: The model is specifically fine - tuned on the Spanish part of the MLSUM dataset, which contains a large number of article/summary pairs from online newspapers, making it suitable for news summarization tasks in Spanish.
- Based on RoBERTa: Leveraging the powerful RoBERTa architecture, it can capture complex semantic information in text, improving the quality of summarization.
đĻ Installation
The installation steps are mainly about installing the necessary Python libraries. You can use the following command to install the transformers
and torch
libraries:
pip install transformers torch
đģ Usage Examples
Basic Usage
import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
def generate_summary(text):
inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
output = model.generate(input_ids, attention_mask=attention_mask)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "Your text here..."
generate_summary(text)
Advanced Usage
You can adjust the parameters in the generate
method to get different summarization results, such as changing the max_length
of the generated summary:
import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)
def generate_summary(text):
inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
input_ids = inputs.input_ids.to(device)
attention_mask = inputs.attention_mask.to(device)
output = model.generate(input_ids, attention_mask=attention_mask, max_length = 200)
return tokenizer.decode(output[0], skip_special_tokens=True)
text = "Your text here..."
generate_summary(text)
đ Documentation
Model
The model used in this project is [BSC - TeMU/roberta - base - bne](https://huggingface.co/BSC - TeMU/roberta - base - bne) (RoBERTa Checkpoint).
Dataset
MLSUM is the first large - scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large - scale multilingual dataset which can enable new research directions for the text summarization community. We report cross - lingual comparative analyses based on state - of - the - art systems. These highlight existing biases which motivate the use of a multi - lingual dataset.
You can access the Spanish part of the dataset here: MLSUM es
Results
The following table shows the evaluation results of the model on the test set:
Property |
Details |
Test - Rouge2 - mid - precision |
11.42 |
Test - Rouge2 - mid - recall |
10.58 |
Test - Rouge2 - mid - fmeasure |
10.69 |
Test - Rouge1 - fmeasure |
28.83 |
Test - RougeL - fmeasure |
23.15 |
The raw metrics using HF/metrics rouge
are as follows:
rouge = datasets.load_metric("rouge")
rouge.compute(predictions=results["pred_summary"], references=results["summary"])
{'rouge1': AggregateScore(low=Score(precision=0.30393366820245, recall=0.27905239591639935, fmeasure=0.283148902808752), mid=Score(precision=0.3068521142101569, recall=0.2817252494122592, fmeasure=0.28560373425206464), high=Score(precision=0.30972608774202665, recall=0.28458152325781716, fmeasure=0.2883786700591887)),
'rougeL': AggregateScore(low=Score(precision=0.24184668819794716, recall=0.22401171380621518, fmeasure=0.22624104698839514), mid=Score(precision=0.24470388406868163, recall=0.22665793214539162, fmeasure=0.2289118878817394), high=Score(precision=0.2476594458951327, recall=0.22932683203591905, fmeasure=0.23153001570662513))}
rouge.compute(predictions=results["pred_summary"], references=results["summary"], rouge_types=["rouge2"])["rouge2"].mid
Score(precision=0.11423200347113865, recall=0.10588038944902506, fmeasure=0.1069921217219595)
đ License
The README does not provide license information.
Created by: Narrativa
About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning - based platform, builds and deploys natural language solutions. #NLG #AI