roberta-large-bne-capitel-pos Open-source Spanish Part-of-Speech Tagging Model

Roberta Large Bne Capitel Pos

Developed by PlanTL-GOB-ES

A RoBERTa-large model trained on data from the Spanish National Library (BNE), fine-tuned for Spanish POS tagging on the CAPITEL dataset

Sequence Labeling

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Spanish POS tagging #High precision F1-98.56 #Trained on BNE corpus

Downloads 186

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Spanish POS tagging tasks, pre-trained on a large-scale Spanish corpus and fine-tuned on the CAPITEL-POS dataset

Model Features

Large-scale pre-training data

Pre-trained on 570GB of cleaned and deduplicated Spanish text, sourced from web-crawled data by the Spanish National Library between 2009-2019

High-performance POS tagging

Achieves an F1 score of 98.56 on the CAPITEL-POS test set, outperforming other Spanish language models

Domain-specific optimization

Fine-tuned using the IberLEF 2020 CAPITEL competition dataset, suitable for processing professional Spanish texts

Model Capabilities

Spanish POS tagging

Text token classification

Natural language processing

Use Cases

Text analysis

News text analysis

Analyzing POS distribution in Spanish news texts

Accurately identifies various POS in news texts

Academic research

For Spanish linguistics research and teaching

Provides professional-level POS tagging results

NLP applications

Information extraction systems

Serves as a preprocessing component for information extraction systems

Improves accuracy of subsequent processing tasks

🚀 Spanish RoBERTa-large trained on BNE finetuned for CAPITEL Part of Speech (POS) dataset

This model is a Part-of-speech-tagging (POS) model for the Spanish language, fine - tuned from a pre - trained RoBERTa large model. It offers high - performance POS tagging for Spanish text.

🚀 Quick Start

Here is a simple example of how to use the roberta - large - bne - capitel - pos model for POS tagging:

from transformers import pipeline
from pprint import pprint

nlp = pipeline("token - classification", model="PlanTL - GOB - ES/roberta - large - bne - capitel - pos")
example = "El alcalde de Vigo, Abel Caballero, ha comenzado a colocar las luces de Navidad en agosto."

pos_results = nlp(example)
pprint(pos_results)

✨ Features

High - Performance POS Tagging: The roberta - large - bne - capitel - pos model achieves an F1 score of 98.56 on the CAPITEL - POS test set, outperforming several standard multilingual and monolingual baselines.
Spanish Language Focus: Specifically fine - tuned for the Spanish language, leveraging a large Spanish corpus for pre - training and a relevant dataset for fine - tuning.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import pipeline
from pprint import pprint

nlp = pipeline("token - classification", model="PlanTL - GOB - ES/roberta - large - bne - capitel - pos")
example = "El alcalde de Vigo, Abel Caballero, ha comenzado a colocar las luces de Navidad en agosto."

pos_results = nlp(example)
pprint(pos_results)

📚 Documentation

Model Description

The roberta - large - bne - capitel - pos is a Part - of - speech - tagging (POS) model for the Spanish language. It is fine - tuned from the [roberta - large - bne](https://huggingface.co/PlanTL - GOB - ES/roberta - large - bne) model, which is a RoBERTa large model. The pre - training of the base model uses the largest Spanish corpus known to date, with a total of 570GB of clean and deduplicated text. This text is compiled from the web crawlings performed by the National Library of Spain (Biblioteca Nacional de España) from 2009 to 2019.

Intended Uses and Limitations

The roberta - large - bne - capitel - pos model can be used for Part - of - speech - tagging (POS) of Spanish text. However, the model is limited by its training dataset and may not generalize well for all use cases.

Limitations and Bias

At the time of submission, no measures have been taken to estimate the bias embedded in the model. Since the corpora have been collected using crawling techniques on multiple web sources, the models may be biased. The developers intend to conduct research in these areas in the future and will update this model card if completed.

Training

Training Data: The dataset used is from the CAPITEL competition at IberLEF 2020 (sub - task 2).
Training Procedure: The model was trained with a batch size of 16 and a learning rate of 3e - 5 for 5 epochs. The best checkpoint was selected using the downstream task metric in the corresponding development set and then evaluated on the test set.

Evaluation

Variable and Metrics: This model was fine - tuned to maximize the F1 score.
Evaluation Results: | Model | CAPITEL - POS (F1) | | ------------ |:----| | roberta - large - bne - capitel - pos | 98.56 | | roberta - base - bne - capitel - pos | 98.46 | | BETO | 98.36 | | mBERT | 98.39 | | BERTIN | 98.47 | | ELECTRA | 98.16 |

For more details, check the fine - tuning and evaluation scripts in the official [GitHub repository](https://github.com/PlanTL - GOB - ES/lm - spanish).

Additional Information

Author: Text Mining Unit (TeMU) at the Barcelona Supercomputing Center (bsc - temu@bsc.es)
Contact Information: For further information, send an email to <plantl - gob - es@bsc.es>
Copyright: Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
Licensing Information: [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0)
Funding: This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan - TL.
Citing Information: If you use this model, please cite the following paper:

@article{,
   abstract = {We want to thank the National Library of Spain for such a large effort on the data gathering and the Future of Computing Center, a
Barcelona Supercomputing Center and IBM initiative (2020). This work was funded by the Spanish State Secretariat for Digitalization and Artificial
Intelligence (SEDIA) within the framework of the Plan - TL.},
   author = {Asier Gutiérrez Fandiño and Jordi Armengol Estapé and Marc Pàmies and Joan Llop Palao and Joaquin Silveira Ocampo and Casimiro Pio Carrino and Carme Armentano Oller and Carlos Rodriguez Penagos and Aitor Gonzalez Agirre and Marta Villegas},
   doi = {10.26342/2022 - 68 - 3},
   issn = {1135 - 5948},
   journal = {Procesamiento del Lenguaje Natural},
   keywords = {Artificial intelligence,Benchmarking,Data processing.,MarIA,Natural language processing,Spanish language modelling,Spanish language resources,Tractament del llenguatge natural (Informàtica),Àrees temàtiques de la UPC::Informàtica::Intel·ligència artificial::Llenguatge natural},
   publisher = {Sociedad Española para el Procesamiento del Lenguaje Natural},
   title = {MarIA: Spanish Language Models},
   volume = {68},
   url = {https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A - 0.mendeley},
   year = {2022},
}

Disclaimer: The models published in this repository are for general use and are available to third parties. These models may have bias and/or other undesirable distortions. Third - party users are responsible for mitigating the risks arising from their use and complying with applicable regulations, including those regarding the use of artificial intelligence. The owner of the models (SEDIA) and the creator (BSC) are not liable for any results from third - party use of these models.

🔧 Technical Details

The model is based on the RoBERTa architecture, which is a variant of the Transformer architecture. It is pre - trained on a large Spanish corpus and then fine - tuned on the CAPITEL dataset for POS tagging. The training process involves optimizing the model with a batch size of 16 and a learning rate of 3e - 5 for 5 epochs.

📄 License

This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE - 2.0).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご