Open-Source Spanish Text Summarization Model bsc_roberta2roberta_shared - Free Deployment to Aid News Content Extraction

Bsc Roberta2roberta Shared Spanish Finetuned Mlsum Summarization

Developed by Narrativa

This is a Spanish text summarization model based on the RoBERTa architecture, specifically fine-tuned for news summarization tasks.

Text Generation

Transformers

Spanish#Spanish news summarization #RoBERTa multi-task fine-tuning #MLSUM dataset optimization

Downloads 296

Release Time : 3/2/2022

Model Overview

The model uses RoBERTa-base-bne as its base architecture and is fine-tuned on the MLSUM Spanish news summarization dataset, capable of generating concise summaries of news articles.

Model Features

Spanish language optimization

Specifically optimized for Spanish text, using a Spanish pre-trained model as the base

News summarization specialization

Fine-tuned on the MLSUM news dataset, particularly suitable for generating summaries of news texts

Efficient summarization

Capable of extracting key information from long texts and generating concise and accurate summaries

Model Capabilities

Spanish text comprehension

News text summarization generation

Key information extraction from long texts

Use Cases

News media

Automatic news summarization

Automatically generate article summaries for news websites to improve reader browsing efficiency

Generates concise and accurate news summaries while retaining key information

Content analysis

Multi-document summarization

Comprehensively analyze multiple related news articles and generate a unified summary

🚀 Spanish RoBERTa2RoBERTa (roberta-base-bne) fine-tuned on MLSUM ES for summarization

This project fine-tunes the RoBERTa model on the MLSUM Spanish dataset for text summarization, providing a practical solution for news summarization.

🚀 Quick Start

To use this fine - tuned model for text summarization, you can follow the code example below:

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):

    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    output = model.generate(input_ids, attention_mask=attention_mask)
    return tokenizer.decode(output[0], skip_special_tokens=True)
    
text = "Your text here..."
generate_summary(text)

✨ Features

Fine - tuned on MLSUM ES: The model is specifically fine - tuned on the Spanish part of the MLSUM dataset, which contains a large number of article/summary pairs from online newspapers, making it suitable for news summarization tasks in Spanish.
Based on RoBERTa: Leveraging the powerful RoBERTa architecture, it can capture complex semantic information in text, improving the quality of summarization.

📦 Installation

The installation steps are mainly about installing the necessary Python libraries. You can use the following command to install the transformers and torch libraries:

pip install transformers torch

💻 Usage Examples

Basic Usage

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):

    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    output = model.generate(input_ids, attention_mask=attention_mask)
    return tokenizer.decode(output[0], skip_special_tokens=True)
    
text = "Your text here..."
generate_summary(text)

Advanced Usage

You can adjust the parameters in the generate method to get different summarization results, such as changing the max_length of the generated summary:

import torch
from transformers import RobertaTokenizerFast, EncoderDecoderModel
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ckpt = 'Narrativa/bsc_roberta2roberta_shared-spanish-finetuned-mlsum-summarization'
tokenizer = RobertaTokenizerFast.from_pretrained(ckpt)
model = EncoderDecoderModel.from_pretrained(ckpt).to(device)

def generate_summary(text):
    inputs = tokenizer([text], padding="max_length", truncation=True, max_length=512, return_tensors="pt")
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)
    # Adjust the max_length parameter
    output = model.generate(input_ids, attention_mask=attention_mask, max_length = 200)
    return tokenizer.decode(output[0], skip_special_tokens=True)
    
text = "Your text here..."
generate_summary(text)

📚 Documentation

Model

The model used in this project is [BSC - TeMU/roberta - base - bne](https://huggingface.co/BSC - TeMU/roberta - base - bne) (RoBERTa Checkpoint).

Dataset

MLSUM is the first large - scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -- namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large - scale multilingual dataset which can enable new research directions for the text summarization community. We report cross - lingual comparative analyses based on state - of - the - art systems. These highlight existing biases which motivate the use of a multi - lingual dataset.

You can access the Spanish part of the dataset here: MLSUM es

Results

The following table shows the evaluation results of the model on the test set:

Property	Details
Test - Rouge2 - mid - precision	11.42
Test - Rouge2 - mid - recall	10.58
Test - Rouge2 - mid - fmeasure	10.69
Test - Rouge1 - fmeasure	28.83
Test - RougeL - fmeasure	23.15

The raw metrics using HF/metrics rouge are as follows:

rouge = datasets.load_metric("rouge")
rouge.compute(predictions=results["pred_summary"], references=results["summary"])

{'rouge1': AggregateScore(low=Score(precision=0.30393366820245, recall=0.27905239591639935, fmeasure=0.283148902808752), mid=Score(precision=0.3068521142101569, recall=0.2817252494122592, fmeasure=0.28560373425206464), high=Score(precision=0.30972608774202665, recall=0.28458152325781716, fmeasure=0.2883786700591887)),
 'rougeL': AggregateScore(low=Score(precision=0.24184668819794716, recall=0.22401171380621518, fmeasure=0.22624104698839514), mid=Score(precision=0.24470388406868163, recall=0.22665793214539162, fmeasure=0.2289118878817394), high=Score(precision=0.2476594458951327, recall=0.22932683203591905, fmeasure=0.23153001570662513))}
 
rouge.compute(predictions=results["pred_summary"], references=results["summary"], rouge_types=["rouge2"])["rouge2"].mid

Score(precision=0.11423200347113865, recall=0.10588038944902506, fmeasure=0.1069921217219595)

📄 License

The README does not provide license information.

Created by: Narrativa

About Narrativa: Natural Language Generation (NLG) | Gabriele, our machine learning - based platform, builds and deploys natural language solutions. #NLG #AI

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご