🚀 NASca and NASes: Two Monolingual Pre-Trained Models for Abstractive Summarization in Catalan and Spanish
Most existing abstractive summarization models in literature are tailored for English. Our work presents monolingual models for Catalan and Spanish, offering better performance and a new evaluation metric.
🚀 Quick Start
There is no specific quick - start information provided in the original document.
✨ Features
- Monolingual Focus: Our models are specifically designed for Catalan and Spanish, addressing the limitations of multilingual models, especially for minority languages like Catalan.
- Enhanced Abstractivity: Through several self - supervised pre - training tasks, the abstractivity of the generated summaries is increased.
- New Evaluation Metric: We introduce a new metric called content reordering to better evaluate the abstractivity of generated summaries.
📦 Installation
There is no installation information provided in the original document.
💻 Usage Examples
There are no code examples provided in the original document.
📚 Documentation
General Introduction
Most models for abstractive summarization in the literature are suitable for English but not for other languages. Multilingual models were introduced to overcome language constraints, but their performance is often lower, especially for minority languages. In this paper, we present monolingual models for Catalan and Spanish.
NASes Model
- Model Structure: News Abstractive Summarization for Spanish (NASes) is a Transformer encoder - decoder model with the same hyper - parameters as BART.
- Pre - training: It is pre - trained on a combination of self - supervised tasks (sentence permutation, text infilling, Gap Sentence Generation, and Next Segment Generation) using Spanish newspapers and Wikipedia articles (21GB of raw text - 8.5 million documents).
- Fine - tuning: NASes is fine - tuned for the summarization task on 1,802,919 (document, summary) pairs from the Dataset for Automatic summarization of Catalan and Spanish newspaper Articles (DACSA).
New Evaluation Metric
The usual evaluation metrics like ROUGE and BertScore cannot correctly evaluate the abstractivity of generated summaries. We present a new metric called content reordering to evaluate the rearrangement of the original content, a common characteristic of abstractive summaries.
Experimentation
We carried out an exhaustive experiment to compare our monolingual models with two widely used multilingual models (mBART and mT5) in text summarization. The results support the quality of our monolingual models, considering that the multilingual models were pre - trained with many more resources.
🔧 Technical Details
- Model Architecture: The models are Transformer encoder - decoder architectures.
- Pre - training Tasks: Sentence permutation, text infilling, Gap Sentence Generation, and Next Segment Generation are used to pre - train the models, which helps to increase the abstractivity of the generated summaries.
- New Metric: The content reordering metric is designed to evaluate the rearrangement of the original content in abstractive summaries.
📄 License
There is no license information provided in the original document.
⚠️ Important Note
On the 5th of April 2022, we detected a mistake in the configuration file; thus, the model was not generating the summaries correctly, and it was underperforming in all scenarios. For this reason, if you had used the model until that day, we would be glad if you would re - evaluate the model if you are publishing some results with it. We apologize for the inconvenience and thank you for your understanding.
BibTeX entry
@Article{app11219872,
AUTHOR = {Ahuir, Vicent and Hurtado, Lluís - F. and González, José Ángel and Segarra, Encarna},
TITLE = {NASca and NASes: Two Monolingual Pre - Trained Models for Abstractive Summarization in Catalan and Spanish},
JOURNAL = {Applied Sciences},
VOLUME = {11},
YEAR = {2021},
NUMBER = {21},
ARTICLE - NUMBER = {9872},
URL = {https://www.mdpi.com/2076 - 3417/11/21/9872},
ISSN = {2076 - 3417},
DOI = {10.3390/app11219872}
}