š RuPERTa: the Spanish RoBERTa š
RuPERTa is a Spanish language model that offers advanced natural language processing capabilities. It is based on the RoBERTa architecture and trained on a large Spanish corpus, enabling it to understand and generate Spanish text effectively.
š Quick Start
RuPERTa-base (uncased) is a RoBERTa model trained on an uncased version of the big Spanish corpus. RoBERTa improves on BERT's pretraining procedure. It trains the model for a longer time, with larger batches on more data, removes the next sentence prediction objective, trains on longer sequences, and dynamically changes the masking pattern applied to the training data. The architecture is the same as roberta-base
:
roberta.base:
RoBERTa using the BERT-base architecture 125M params
⨠Features
- Trained on Spanish Corpus: RuPERTa is trained on a large Spanish corpus, making it well - suited for Spanish language tasks.
- Improved Pretraining: Based on RoBERTa's enhanced pretraining method, it can better understand and generate text.
š Documentation
š§¾ Benchmarks
Work in progress (I'm still working on it) š§
š» Usage Examples
š· Basic Usage for POS and NER
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
id2label = {
"0": "B-LOC",
"1": "B-MISC",
"2": "B-ORG",
"3": "B-PER",
"4": "I-LOC",
"5": "I-MISC",
"6": "I-ORG",
"7": "I-PER",
"8": "O"
}
tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
text ="Julien, CEO de HF, nació en Francia."
input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
for m in last_hidden_states:
for index, n in enumerate(m):
if(index > 0 and index <= len(text.split(" "))):
print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])
'''
Julien,: I-PER
CEO: O
de: O
HF,: B-ORG
nació: I-PER
en: I-PER
Francia.: I-LOC
'''
For POS just change the id2label
dictionary and the model path to mrm8488/RuPERTa-base-finetuned-pos
š§Ŗ Advanced Usage for LM with pipelines
from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained('mrm8488/RuPERTa-base')
tokenizer = AutoTokenizer.from_pretrained("mrm8488/RuPERTa-base", do_lower_case=True)
from transformers import pipeline
pipeline_fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
pipeline_fill_mask("EspaƱa es un paĆs muy <mask> en la UE")
[
{
"score": 0.1814306527376175,
"sequence": "<s> espaƱa es un paĆs muy importante en la ue</s>",
"token": 1560
},
{
"score": 0.024842597544193268,
"sequence": "<s> espaƱa es un paĆs muy fuerte en la ue</s>",
"token": 2854
},
{
"score": 0.02473250962793827,
"sequence": "<s> espaƱa es un paĆs muy pequeƱo en la ue</s>",
"token": 2948
},
{
"score": 0.023991240188479424,
"sequence": "<s> espaƱa es un paĆs muy antiguo en la ue</s>",
"token": 5240
},
{
"score": 0.0215945765376091,
"sequence": "<s> espaƱa es un paĆs muy popular en la ue</s>",
"token": 5782
}
]
š Acknowledgments
I thank the š¤/transformers team for answering my questions and Google for supporting me through the TensorFlow Research Cloud program.
Created by Manuel Romero/@mrm8488
Made with ā„ in Spain