🚀 文本正式程度分類模型
本模型用於文本正式程度分類,基於XLM - Roberta架構,在多語言數據集上訓練,能有效識別不同語言文本的正式或非正式程度,為自然語言處理相關任務提供支持。
🚀 快速開始
本模型是論文 "Detecting Text Formality: A Study of Text Classification Approaches" 中所提出的模型。它是一個基於XLM - Roberta的分類器,在 XFORMAL 多語言正式程度分類數據集上進行訓練。
✨ 主要特性
- 多語言支持:支持英語(en)、法語(fr)、意大利語(it)和葡萄牙語(pt)等多種語言。
- 正式程度分類:能夠對文本進行正式或非正式的分類。
📦 安裝指南
使用本模型前,需要安裝transformers
庫,可使用以下命令進行安裝:
pip install transformers
💻 使用示例
基礎用法
from transformers import XLMRobertaTokenizerFast, XLMRobertaForSequenceClassification
tokenizer = XLMRobertaTokenizerFast.from_pretrained('s-nlp/xlmr_formality_classifier')
model = XLMRobertaForSequenceClassification.from_pretrained('s-nlp/xlmr_formality_classifier')
id2formality = {0: "formal", 1: "informal"}
texts = [
"I like you. I love you",
"Hey, what's up?",
"Siema, co porabiasz?",
"I feel deep regret and sadness about the situation in international politics.",
]
encoding = tokenizer(
texts,
add_special_tokens=True,
return_token_type_ids=True,
truncation=True,
padding="max_length",
return_tensors="pt",
)
output = model(**encoding)
formality_scores = [
{id2formality[idx]: score for idx, score in enumerate(text_scores.tolist())}
for text_scores in output.logits.softmax(dim=1)
]
print(formality_scores)
運行上述代碼後,輸出結果如下:
[{'formal': 0.993225634098053, 'informal': 0.006774314679205418},
{'formal': 0.8807966113090515, 'informal': 0.1192033663392067},
{'formal': 0.936184287071228, 'informal': 0.06381577253341675},
{'formal': 0.9986615180969238, 'informal': 0.0013385231141000986}]
📚 詳細文檔
模型評估結果
所有語言
類別 |
精確率 |
召回率 |
F1 - 分數 |
樣本數 |
0 |
0.744912 |
0.927790 |
0.826354 |
108019 |
1 |
0.889088 |
0.645630 |
0.748048 |
96845 |
準確率 |
|
|
0.794405 |
204864 |
宏平均 |
0.817000 |
0.786710 |
0.787201 |
204864 |
加權平均 |
0.813068 |
0.794405 |
0.789337 |
204864 |
英語(EN)
類別 |
精確率 |
召回率 |
F1 - 分數 |
樣本數 |
0 |
0.800053 |
0.962981 |
0.873988 |
22151 |
1 |
0.945106 |
0.725899 |
0.821124 |
19449 |
準確率 |
|
|
0.852139 |
41600 |
宏平均 |
0.872579 |
0.844440 |
0.847556 |
41600 |
加權平均 |
0.867869 |
0.852139 |
0.849273 |
41600 |
法語(FR)
類別 |
精確率 |
召回率 |
F1 - 分數 |
樣本數 |
0 |
0.746709 |
0.925738 |
0.826641 |
21505 |
1 |
0.887305 |
0.650592 |
0.750731 |
19327 |
準確率 |
|
|
0.795504 |
40832 |
宏平均 |
0.817007 |
0.788165 |
0.788686 |
40832 |
加權平均 |
0.813257 |
0.795504 |
0.790711 |
40832 |
意大利語(IT)
類別 |
精確率 |
召回率 |
F1 - 分數 |
樣本數 |
0 |
0.721282 |
0.914669 |
0.806545 |
21528 |
1 |
0.864887 |
0.607135 |
0.713445 |
19368 |
準確率 |
|
|
0.769024 |
40896 |
宏平均 |
0.793084 |
0.760902 |
0.759995 |
40896 |
加權平均 |
0.789292 |
0.769024 |
0.762454 |
40896 |
葡萄牙語(PT)
類別 |
精確率 |
召回率 |
F1 - 分數 |
樣本數 |
0 |
0.717546 |
0.908167 |
0.801681 |
21637 |
1 |
0.853628 |
0.599700 |
0.704481 |
19323 |
準確率 |
|
|
0.762646 |
40960 |
宏平均 |
0.785587 |
0.753933 |
0.753081 |
40960 |
加權平均 |
0.781743 |
0.762646 |
0.755826 |
40960 |
📄 許可證
本模型採用OpenRAIL++許可證,該許可證支持開發服務於公共利益的各類技術,包括工業和學術領域的技術。
📖 引用信息
如果您使用了本模型,請引用以下論文:
@inproceedings{dementieva-etal-2023-detecting,
title = "Detecting Text Formality: A Study of Text Classification Approaches",
author = "Dementieva, Daryna and
Babakov, Nikolay and
Panchenko, Alexander",
editor = "Mitkov, Ruslan and
Angelova, Galia",
booktitle = "Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing",
month = sep,
year = "2023",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2023.ranlp-1.31",
pages = "274--284",
abstract = "Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation{---}GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments {--} monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.",
}