無料でデプロイ！RoBERTuito事前学習言語モデル、スペイン語のソーシャルメディアテキストを処理するのに超有用

ホーム

Robertuito Base Deacc

pysentimientoによって開発

RoBERTuitoはスペイン語ソーシャルメディアテキスト向けの事前学習言語モデルで、RoBERTaフレームワークに基づき5億件のツイートで学習され、大文字小文字を区別する、区別しない、アクセントを除去する3種類の変種があります。

大規模言語モデル

Transformers

#スペイン語ソーシャルメディア #ユーザー生成コンテンツの最適化 #マルチタスク微調整

ダウンロード数 84

リリース時間 : 3/2/2022

モデル概要

RoBERTuitoはスペイン語のユーザー生成コンテンツに特化して最適化された事前学習言語モデルで、複数のソーシャルメディアテキスト分析タスクで同類のスペイン語モデルよりも優れた性能を発揮します。

モデル特徴

ソーシャルメディア最適化

スペイン語ソーシャルメディアテキストに特化して学習され、ユーザー生成コンテンツの非公式表現を効果的に処理できます。

マルチタスクの優位性

ヘイトスピーチ検出、感情分析、情緒分析、皮肉検出の4つのタスクで同類のスペイン語モデルを全面的に上回ります。

変種選択

大文字小文字を区別する、区別しない、アクセントを除去する3種類のバージョンを提供し、さまざまなアプリケーションシーンのニーズに対応します。

モデル能力

ソーシャルメディアテキスト理解

ヘイトスピーチ検出

感情分析

情緒分析

皮肉検出

使用事例

ソーシャルメディア分析

ヘイトスピーチ検出

スペイン語ソーシャルメディアのヘイトスピーチコンテンツを識別します。

HatEvalデータセットでF1スコア0.798

感情分析

スペイン語ツイートの感情傾向を分析します。

TASS 2020データセットで正解率0.702

コンテンツ審査

不適切コンテンツ識別

スペイン語ソーシャルメディアの不適切コンテンツを自動識別します。

🚀 robertuito-base-deacc

RoBERTuitoは、スペイン語のソーシャルメディアテキスト向けの事前学習言語モデルです。5億件のツイートを使用してRoBERTaのガイドラインに沿って学習され、多様なタスクで高い性能を発揮します。

🚀 クイックスタート

RoBERTuitoは、スペイン語のユーザー生成コンテンツ用の事前学習言語モデルです。RoBERTaのガイドラインに沿って5億件のツイートで学習されています。このモデルには、大文字小文字区別あり、大文字小文字区別なし、大文字小文字区別なし+アクセント除去の3種類があります。

論文全文を読む Githubリポジトリ

✨ 主な機能

RoBERTuitoは、スペイン語のユーザー生成テキストを対象としたタスクのベンチマークでテストされています。BETO、BERTin、RoBERTa-BNEなどの他の事前学習言語モデルを上回る性能を発揮します。評価に選ばれた4つのタスクは、ヘイトスピーチ検出（SemEval 2019 Task 5、HatEvalデータセットを使用）、センチメントと感情分析（TASS 2020データセットを使用）、および皮肉検出（IrosVa 2019データセットを使用）です。

モデル	ヘイトスピーチ検出	センチメント分析	感情分析	皮肉検出	スコア
robertuito-uncased	0.801 ± 0.010	0.707 ± 0.004	0.551 ± 0.011	0.736 ± 0.008	0.6987
robertuito-deacc	0.798 ± 0.008	0.702 ± 0.004	0.543 ± 0.015	0.740 ± 0.006	0.6958
robertuito-cased	0.790 ± 0.012	0.701 ± 0.012	0.519 ± 0.032	0.719 ± 0.023	0.6822
roberta-bne	0.766 ± 0.015	0.669 ± 0.006	0.533 ± 0.011	0.723 ± 0.017	0.6726
bertin	0.767 ± 0.005	0.665 ± 0.003	0.518 ± 0.012	0.716 ± 0.008	0.6666
beto-cased	0.768 ± 0.012	0.665 ± 0.004	0.521 ± 0.012	0.706 ± 0.007	0.6651
beto-uncased	0.757 ± 0.012	0.649 ± 0.005	0.521 ± 0.006	0.702 ± 0.008	0.6571

事前学習モデルは、huggingfaceモデルハブで公開されています。

📚 ドキュメント

マスク付き言語モデル (Masked LM)

マスク付き言語モデルをテストする際には、空白がSentencePieceのトークン内にエンコードされていることに注意してください。たとえば、以下のようなテストを行う場合

Este es un día<mask>

díaと<mask>の間に空白を入れないでください。

使用方法

重要 -- 最初にこれを読んでください

RoBERTuitoはまだhuggingface/transformersに完全に統合されていません。使用するには、まずpysentimientoをインストールします。

pip install pysentimiento

そして、テキストをトークナイザーに入力する前に、pysentimiento.preprocessing.preprocess_tweetを使用してテキストを前処理します。

from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet

tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')

text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)

tokenizer.tokenize(preprocessed_text)
# ['<s>','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','</s>']

この前処理ステップをtransformersライブラリ内のトークナイザーに統合する作業を進めています。

テキスト分類の例は、このノートブックで確認できます。

引用

RoBERTuitoを使用する場合は、以下の論文を引用してください。

@inproceedings{perez-etal-2022-robertuito,
    title = "{R}o{BERT}uito: a pre-trained language model for social media text in {S}panish",
    author = "P{\'e}rez, Juan Manuel  and
      Furman, Dami{\'a}n Ariel  and
      Alonso Alemany, Laura  and
      Luque, Franco M.",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.785",
    pages = "7235--7243",
    abstract = "Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for natural language processing tasks. Recently, some works geared towards pre-training specially-crafted models for particular domains, such as scientific papers, medical documents, user-generated texts, among others. These domain-specific models have been shown to improve performance significantly in most tasks; however, for languages other than English, such models are not widely available. In this work, we present RoBERTuito, a pre-trained language model for user-generated text in Spanish, trained on over 500 million tweets. Experiments on a benchmark of tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models in Spanish. In addition to this, our model has some cross-lingual abilities, achieving top results for English-Spanish tasks of the Linguistic Code-Switching Evaluation benchmark (LinCE) and also competitive performance against monolingual models in English Twitter tasks. To facilitate further research, we make RoBERTuito publicly available at the HuggingFace model hub together with the dataset used to pre-train it.",
}