xlmr_formality_classifierオープンソースモデル - 多言語テキストの正式度分類をサポート、英語、フランス語、イタリア語、ポルトガル語を含む

ホーム

Xlmr Formality Classifier

s-nlpによって開発

XLM-Robertaベースの多言語テキストフォーマリティ分類モデルで、英語、フランス語、イタリア語、ポルトガル語をサポート

テキスト分類

Transformers

複数言語対応#多言語フォーマリティ分類 #テキストスタイル検出 #XLM-Robertaモデル

ダウンロード数 795

リリース時間 : 3/2/2022

モデル概要

このモデルはテキストのフォーマル度を検出し、入力テキストを'フォーマル'または'インフォーマル'の2つに分類します。多言語フォーマリティ分類データセットXFORMALでトレーニングされています。

モデル特徴

多言語サポート

英語、フランス語、イタリア語、ポルトガル語の4言語のフォーマリティ分類をサポート

高精度

英語分類タスクで85.2%の精度を達成、その他の言語では76-80%の精度

Transformerアーキテクチャベース

XLM-Roberta-baseをベースモデルとして使用し、強力なテキスト理解能力を有する

モデル能力

テキストフォーマリティ分類

多言語テキスト分析

使用事例

テキスト処理

フォーマル文書選別

フォーマル文書とインフォーマル文書を自動識別・分類

文書管理システムでの自動分類に利用可能

ライティング支援ツール

ユーザーがテキストのフォーマル度をチェックし、ライティングアドバイスを提供

ライティング品質向上、目標場面に適したフォーマリティ要件を確保

コンテンツモデレーション

コンテンツ適切性チェック

フォーマルな場面に不適切なインフォーマルコンテンツを識別

フォーラムやコメント欄の自動モデレーションに利用可能

🚀 多言語形式分類モデル

このモデルは、テキストの形式分類に特化した多言語対応モデルです。論文の研究成果を基に構築され、多言語のテキストに対して形式的か非形式的かを高精度に分類できます。

🚀 クイックスタート

本モデルは、論文 "Detecting Text Formality: A Study of Text Classification Approaches" で提示されたものです。XLM-Robertaベースの分類器で、XFORMAL という多言語形式分類データセットで学習されています。

モデル情報

属性	详情
サポート言語	en, fr, it, pt
タグ	形式的または非形式的分類
ライセンス	OpenRAIL++
ベースモデル	FacebookAI/xlm-roberta-base

結果

全言語

	精度	再現率	F1スコア	サンプル数
0	0.744912	0.927790	0.826354	108019
1	0.889088	0.645630	0.748048	96845
正解率			0.794405	204864
マクロ平均	0.817000	0.786710	0.787201	204864
加重平均	0.813068	0.794405	0.789337	204864

英語 (EN)

	精度	再現率	F1スコア	サンプル数
0	0.800053	0.962981	0.873988	22151
1	0.945106	0.725899	0.821124	19449
正解率			0.852139	41600
マクロ平均	0.872579	0.844440	0.847556	41600
加重平均	0.867869	0.852139	0.849273	41600

フランス語 (FR)

	精度	再現率	F1スコア	サンプル数
0	0.746709	0.925738	0.826641	21505
1	0.887305	0.650592	0.750731	19327
正解率			0.795504	40832
マクロ平均	0.817007	0.788165	0.788686	40832
加重平均	0.813257	0.795504	0.790711	40832

イタリア語 (IT)

	精度	再現率	F1スコア	サンプル数
0	0.721282	0.914669	0.806545	21528
1	0.864887	0.607135	0.713445	19368
正解率			0.769024	40896
マクロ平均	0.793084	0.760902	0.759995	40896
加重平均	0.789292	0.769024	0.762454	40896

ポルトガル語 (PT)

	精度	再現率	F1スコア	サンプル数
0	0.717546	0.908167	0.801681	21637
1	0.853628	0.599700	0.704481	19323
正解率			0.762646	40960
マクロ平均	0.785587	0.753933	0.753081	40960
加重平均	0.781743	0.762646	0.755826	40960

💻 使用例

基本的な使用法

from transformers import XLMRobertaTokenizerFast, XLMRobertaForSequenceClassification

# load tokenizer and model weights
tokenizer = XLMRobertaTokenizerFast.from_pretrained('s-nlp/xlmr_formality_classifier')
model = XLMRobertaForSequenceClassification.from_pretrained('s-nlp/xlmr_formality_classifier')

id2formality = {0: "formal", 1: "informal"}
texts = [
    "I like you. I love you",
    "Hey, what's up?",
    "Siema, co porabiasz?",
    "I feel deep regret and sadness about the situation in international politics.",
]

# prepare the input
encoding = tokenizer(
    texts,
    add_special_tokens=True,
    return_token_type_ids=True,
    truncation=True,
    padding="max_length",
    return_tensors="pt",
)

# inference
output = model(**encoding)

formality_scores = [
    {id2formality[idx]: score for idx, score in enumerate(text_scores.tolist())}
    for text_scores in output.logits.softmax(dim=1)
]
print(formality_scores)

出力例:

[{'formal': 0.993225634098053, 'informal': 0.006774314679205418},
 {'formal': 0.8807966113090515, 'informal': 0.1192033663392067},
 {'formal': 0.936184287071228, 'informal': 0.06381577253341675},
 {'formal': 0.9986615180969238, 'informal': 0.0013385231141000986}]

📚 引用情報

@inproceedings{dementieva-etal-2023-detecting,
    title = "Detecting Text Formality: A Study of Text Classification Approaches",
    author = "Dementieva, Daryna  and
      Babakov, Nikolay  and
      Panchenko, Alexander",
    editor = "Mitkov, Ruslan  and
      Angelova, Galia",
    booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
    month = sep,
    year = "2023",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd., Shoumen, Bulgaria",
    url = "https://aclanthology.org/2023.ranlp-1.31",
    pages = "274--284",
    abstract = "Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation{---}GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments {--} monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.",
}