🚀 RoBERTuito-base-deacc
RoBERTuito 是一个针对西班牙语社交媒体文本的预训练语言模型,在 5 亿条推文上按照 RoBERTa 准则进行训练,能有效处理用户生成的内容。它有三种变体:区分大小写、不区分大小写和不区分大小写且去除重音,在多项西班牙语用户生成文本任务基准测试中表现优于其他预训练语言模型。
🚀 快速开始
RoBERTuito 尚未完全集成到 huggingface/transformers
中。要使用它,首先需要安装 pysentimiento
:
pip install pysentimiento
在将文本输入分词器之前,使用 pysentimiento.preprocessing.preprocess_tweet
对文本进行预处理:
from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet
tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')
text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)
tokenizer.tokenize(preprocessed_text)
你可以在这个笔记本中查看文本分类示例: 
✨ 主要特性
- 针对性训练:专门针对西班牙语社交媒体文本进行预训练,在处理用户生成的内容方面表现出色。
- 多种变体:提供区分大小写、不区分大小写和不区分大小写且去除重音三种变体,满足不同需求。
- 性能优越:在仇恨言论检测、情感和情绪分析、反讽检测等多项基准测试任务中,表现优于其他西班牙语预训练语言模型,如 BETO、BERTin 和 RoBERTa-BNE。
📦 安装指南
要使用 RoBERTuito,请先安装 pysentimiento
:
pip install pysentimiento
💻 使用示例
基础用法
from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet
tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')
text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)
tokenizer.tokenize(preprocessed_text)
高级用法
在进行文本分类任务时,可以参考以下笔记本中的示例: 
📚 详细文档
模型发布
我们在 Hugging Face 模型中心发布了预训练模型:
掩码语言模型(Masked LM)测试
测试掩码语言模型时,需要注意空格是在 SentencePiece 的标记内编码的。因此,如果你想测试:
Este es un día<mask>
不要在 día
和 <mask>
之间留空格。
性能对比
模型 |
仇恨言论检测 |
情感分析 |
情绪分析 |
反讽检测 |
综合得分 |
robertuito-uncased |
0.801 ± 0.010 |
0.707 ± 0.004 |
0.551 ± 0.011 |
0.736 ± 0.008 |
0.6987 |
robertuito-deacc |
0.798 ± 0.008 |
0.702 ± 0.004 |
0.543 ± 0.015 |
0.740 ± 0.006 |
0.6958 |
robertuito-cased |
0.790 ± 0.012 |
0.701 ± 0.012 |
0.519 ± 0.032 |
0.719 ± 0.023 |
0.6822 |
roberta-bne |
0.766 ± 0.015 |
0.669 ± 0.006 |
0.533 ± 0.011 |
0.723 ± 0.017 |
0.6726 |
bertin |
0.767 ± 0.005 |
0.665 ± 0.003 |
0.518 ± 0.012 |
0.716 ± 0.008 |
0.6666 |
beto-cased |
0.768 ± 0.012 |
0.665 ± 0.004 |
0.521 ± 0.012 |
0.706 ± 0.007 |
0.6651 |
beto-uncased |
0.757 ± 0.012 |
0.649 ± 0.005 |
0.521 ± 0.006 |
0.702 ± 0.008 |
0.6571 |
📄 许可证
如果你使用 RoBERTuito,请引用我们的论文:
@inproceedings{perez-etal-2022-robertuito,
title = "{R}o{BERT}uito: a pre-trained language model for social media text in {S}panish",
author = "P{\'e}rez, Juan Manuel and
Furman, Dami{\'a}n Ariel and
Alonso Alemany, Laura and
Luque, Franco M.",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.785",
pages = "7235--7243",
abstract = "Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for natural language processing tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks; however, for languages other than English, such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model has some cross-lingual abilities, achieving top results for English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and also competitive performance against monolingual models in English Twitter tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.",
}
⚠️ 重要提示
RoBERTuito 尚未完全集成到 huggingface/transformers
中,使用时需要先安装 pysentimiento
并对文本进行预处理。
💡 使用建议
在使用 RoBERTuito 时,建议使用 pysentimiento.preprocessing.preprocess_tweet
对输入文本进行预处理,以获得更好的效果。同时,可以参考提供的 Colab 笔记本示例进行文本分类等任务。