RuPERTa-base Open-source Model - Supports Multiple Spanish NLP Tasks and Is Free and Ultra-convenient to Use

Ruperta Base

Developed by mrm8488

RuPERTa is a case-insensitive RoBERTa model trained on a large Spanish corpus, using RoBERTa's improved pre-training methods, suitable for various Spanish NLP tasks.

Large Language Model Spanish#Spanish RoBERTa #Case-insensitive #Multi-task fine-tuning

Downloads 39

Release Time : 3/2/2022

Model Overview

RuPERTa is a Spanish pre-trained language model based on the RoBERTa architecture, optimized through improved training processes (such as longer training time, larger batch processing, etc.), supporting tasks like part-of-speech tagging and named entity recognition.

Model Features

Spanish Optimization

Trained on a large Spanish corpus, specifically optimized for Spanish NLP tasks

RoBERTa Improvements

Utilizes RoBERTa's improved pre-training methods, including longer training time, larger batch processing, and dynamic masking patterns

Case-insensitive Design

Case-insensitive version of the model, enhancing the ability to process case-insensitive text

Model Capabilities

Text filling

Part-of-speech tagging

Named entity recognition

Spanish text understanding

Use Cases

Natural Language Processing

Part-of-speech tagging

Performs part-of-speech tagging on Spanish text

F1 score 97.39 (on specific dataset)

Named entity recognition

Identifies named entities (person names, locations, organizations, etc.) in Spanish text

F1 score 77.55 (on specific dataset)

Text filling

Predicts missing words in Spanish sentences

Examples available on the model page

🚀 RuPERTa: the Spanish RoBERTa 🎃

RuPERTa is a Spanish language model that offers advanced natural language processing capabilities. It is based on the RoBERTa architecture and trained on a large Spanish corpus, enabling it to understand and generate Spanish text effectively.

🚀 Quick Start

RuPERTa-base (uncased) is a RoBERTa model trained on an uncased version of the big Spanish corpus. RoBERTa improves on BERT's pretraining procedure. It trains the model for a longer time, with larger batches on more data, removes the next sentence prediction objective, trains on longer sequences, and dynamically changes the masking pattern applied to the training data. The architecture is the same as roberta-base:

roberta.base: RoBERTa using the BERT-base architecture 125M params

✨ Features

Trained on Spanish Corpus: RuPERTa is trained on a large Spanish corpus, making it well - suited for Spanish language tasks.
Improved Pretraining: Based on RoBERTa's enhanced pretraining method, it can better understand and generate text.

📚 Documentation

🧾 Benchmarks

Work in progress (I'm still working on it) 🚧

Task/Dataset	F1	Precision	Recall	Fine-tuned model
POS	97.39	97.47	97.32	RuPERTa-base-finetuned-pos
NER	77.55	75.53	79.68	RuPERTa-base-finetuned-ner
SQUAD-es v1	to-do			RuPERTa-base-finetuned-squadv1
SQUAD-es v2	to-do			RuPERTa-base-finetuned-squadv2

💻 Usage Examples

🏷 Basic Usage for POS and NER

import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

id2label = {
    "0": "B-LOC",
    "1": "B-MISC",
    "2": "B-ORG",
    "3": "B-PER",
    "4": "I-LOC",
    "5": "I-MISC",
    "6": "I-ORG",
    "7": "I-PER",
    "8": "O"
}

tokenizer = AutoTokenizer.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')
model = AutoModelForTokenClassification.from_pretrained('mrm8488/RuPERTa-base-finetuned-ner')

text ="Julien, CEO de HF, nació en Francia."

input_ids = torch.tensor(tokenizer.encode(text)).unsqueeze(0)

outputs = model(input_ids)
last_hidden_states = outputs[0]

for m in last_hidden_states:
  for index, n in enumerate(m):
    if(index > 0 and index <= len(text.split(" "))):
      print(text.split(" ")[index-1] + ": " + id2label[str(torch.argmax(n).item())])

# Output:
'''
Julien,: I-PER
CEO: O
de: O
HF,: B-ORG
nació: I-PER
en: I-PER
Francia.: I-LOC
'''

For POS just change the id2label dictionary and the model path to mrm8488/RuPERTa-base-finetuned-pos

🧪 Advanced Usage for LM with `pipelines`

from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained('mrm8488/RuPERTa-base')
tokenizer = AutoTokenizer.from_pretrained("mrm8488/RuPERTa-base", do_lower_case=True)

from transformers import pipeline

pipeline_fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)

pipeline_fill_mask("España es un país muy <mask> en la UE")

[
  {
    "score": 0.1814306527376175,
    "sequence": "<s> españa es un país muy importante en la ue</s>",
    "token": 1560
  },
  {
    "score": 0.024842597544193268,
    "sequence": "<s> españa es un país muy fuerte en la ue</s>",
    "token": 2854
  },
  {
    "score": 0.02473250962793827,
    "sequence": "<s> españa es un país muy pequeño en la ue</s>",
    "token": 2948
  },
  {
    "score": 0.023991240188479424,
    "sequence": "<s> españa es un país muy antiguo en la ue</s>",
    "token": 5240
  },
  {
    "score": 0.0215945765376091,
    "sequence": "<s> españa es un país muy popular en la ue</s>",
    "token": 5782
  }
]

📄 Acknowledgments

I thank the 🤗/transformers team for answering my questions and Google for supporting me through the TensorFlow Research Cloud program.

Created by Manuel Romero/@mrm8488

Made with ♥ in Spain

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご