🚀 Spanish RoBERTa-large trained on BNE finetuned for CAPITEL Part of Speech (POS) dataset
This model is a Part-of-speech-tagging (POS) model for the Spanish language, fine - tuned from a pre - trained RoBERTa large model. It offers high - performance POS tagging for Spanish text.
🚀 Quick Start
Here is a simple example of how to use the roberta - large - bne - capitel - pos
model for POS tagging:
from transformers import pipeline
from pprint import pprint
nlp = pipeline("token - classification", model="PlanTL - GOB - ES/roberta - large - bne - capitel - pos")
example = "El alcalde de Vigo, Abel Caballero, ha comenzado a colocar las luces de Navidad en agosto."
pos_results = nlp(example)
pprint(pos_results)
✨ Features
- High - Performance POS Tagging: The
roberta - large - bne - capitel - pos
model achieves an F1 score of 98.56 on the CAPITEL - POS test set, outperforming several standard multilingual and monolingual baselines.
- Spanish Language Focus: Specifically fine - tuned for the Spanish language, leveraging a large Spanish corpus for pre - training and a relevant dataset for fine - tuning.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import pipeline
from pprint import pprint
nlp = pipeline("token - classification", model="PlanTL - GOB - ES/roberta - large - bne - capitel - pos")
example = "El alcalde de Vigo, Abel Caballero, ha comenzado a colocar las luces de Navidad en agosto."
pos_results = nlp(example)
pprint(pos_results)
📚 Documentation
Model Description
The roberta - large - bne - capitel - pos is a Part - of - speech - tagging (POS) model for the Spanish language. It is fine - tuned from the [roberta - large - bne](https://huggingface.co/PlanTL - GOB - ES/roberta - large - bne) model, which is a RoBERTa large model. The pre - training of the base model uses the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text. This text is compiled from the web crawlings performed by the National Library of Spain (Biblioteca Nacional de España) from 2009 to 2019.
Intended Uses and Limitations
The roberta - large - bne - capitel - pos model can be used for Part - of - speech - tagging (POS) of Spanish text. However, the model is limited by its training dataset and may not generalize well for all use cases.
Limitations and Bias
At the time of submission, no measures have been taken to estimate the bias embedded in the model. Since the corpora have been collected using crawling techniques on multiple web sources, the models may be biased. The developers intend to conduct research in these areas in the future and will update this model card if completed.
Training
- Training Data: The dataset used is from the CAPITEL competition at IberLEF 2020 (sub - task 2).
- Training Procedure: The model was trained with a batch size of 16 and a learning rate of 3e - 5 for 5 epochs. The best checkpoint was selected using the downstream task metric in the corresponding development set and then evaluated on the test set.
Evaluation
- Variable and Metrics: This model was fine - tuned to maximize the F1 score.
- Evaluation Results:
| Model | CAPITEL - POS (F1) |
| ------------ |:----|
| roberta - large - bne - capitel - pos | 98.56 |
| roberta - base - bne - capitel - pos | 98.46 |
| BETO | 98.36 |
| mBERT | 98.39 |
| BERTIN | 98.47 |
| ELECTRA | 98.16 |
For more details, check the fine - tuning and evaluation scripts in the official [GitHub repository](https://github.com/PlanTL - GOB - ES/lm - spanish).
Additional Information
- Author: Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc - temu@bsc.es)
- Contact Information: For further information, send an email to <plantl - gob - es@bsc.es>
- Copyright: Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
- Licensing Information: [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0)
- Funding: This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan - TL.
- Citing Information: If you use this model, please cite the following paper:
@article{,
abstract = {We want to thank the National Library of Spain for such a large effort on the data gathering and the Future of Computing Center, a
Barcelona Supercomputing Center and IBM initiative (2020). This work was funded by the Spanish State Secretariat for Digitalization and Artificial
Intelligence (SEDIA) within the framework of the Plan - TL.},
author = {Asier Gutiérrez Fandiño and Jordi Armengol Estapé and Marc Pàmies and Joan Llop Palao and Joaquin Silveira Ocampo and Casimiro Pio Carrino and Carme Armentano Oller and Carlos Rodriguez Penagos and Aitor Gonzalez Agirre and Marta Villegas},
doi = {10.26342/2022 - 68 - 3},
issn = {1135 - 5948},
journal = {Procesamiento del Lenguaje Natural},
keywords = {Artificial intelligence,Benchmarking,Data processing.,MarIA,Natural language processing,Spanish language modelling,Spanish language resources,Tractament del llenguatge natural (Informàtica),Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural},
publisher = {Sociedad Española para el Procesamiento del Lenguaje Natural},
title = {MarIA: Spanish Language Models},
volume = {68},
url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A - 0.mendeley},
year = {2022},
}
- Disclaimer: The models published in this repository are for general use and are available to third parties. These models may have bias and/or other undesirable distortions. Third - party users are responsible for mitigating the risks arising from their use and complying with applicable regulations, including those regarding the use of artificial intelligence. The owner of the models (SEDIA) and the creator (BSC) are not liable for any results from third - party use of these models.
🔧 Technical Details
The model is based on the RoBERTa architecture, which is a variant of the Transformer architecture. It is pre - trained on a large Spanish corpus and then fine - tuned on the CAPITEL dataset for POS tagging. The training process involves optimizing the model with a batch size of 16 and a learning rate of 3e - 5 for 5 epochs.
📄 License
This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0).