🚀 多語言禮貌分類模型
本模型基於xlm - roberta - large
,並在TyDiP數據集的英語子集上進行了微調,相關內容可參考原論文此處。該模型可用於文本分類任務,能對多種語言的文本進行禮貌程度分類。
🚀 快速開始
本模型基於xlm - roberta - large
,在TyDiP數據集的英語子集上進行微調,可用於多語言的禮貌分類。
✨ 主要特性
- 多語言支持:在論文中,該模型在英語以及其他9種語言(印地語、韓語、西班牙語、泰米爾語、法語、越南語、俄語、南非荷蘭語、匈牙利語)上進行了評估。鑑於模型的良好性能和XLMR的跨語言能力,微調後的模型很可能也適用於更多語言。
- 基於強大基礎模型:基於
xlm - roberta - large
進行微調,充分利用了其預訓練的語言知識。
📦 安裝指南
文檔未提及具體安裝步驟,可通過transformers
庫使用該模型,確保已安裝transformers
庫:
pip install transformers
💻 使用示例
基礎用法
from transformers import pipeline
classifier = pipeline(task="text-classification", model="Genius1237/xlm-roberta-large-tydip")
sentences = ["Could you please get me a glass of water", "mere liye पानी का एक गिलास ले आओ "]
print(classifier(sentences))
高級用法
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained('Genius1237/xlm-roberta-large-tydip')
model = AutoModelForSequenceClassification.from_pretrained('Genius1237/xlm-roberta-large-tydip')
text = "Could you please get me a glass of water"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
prediction = torch.argmax(output.logits).item()
print(model.config.id2label[prediction])
📚 詳細文檔
評估結果
TyDiP測試集上10種語言的禮貌分類準確率分數如下:
語言 |
準確率 |
英語 (en) |
0.892 |
印地語 (hi) |
0.868 |
韓語 (ko) |
0.784 |
西班牙語 (es) |
0.84 |
泰米爾語 (ta) |
0.78 |
法語 (fr) |
0.82 |
越南語 (vi) |
0.844 |
俄語 (ru) |
0.668 |
南非荷蘭語 (af) |
0.856 |
匈牙利語 (hu) |
0.812 |
引用信息
@inproceedings{srinivasan-choi-2022-tydip,
title = "{T}y{D}i{P}: A Dataset for Politeness Classification in Nine Typologically Diverse Languages",
author = "Srinivasan, Anirudh and
Choi, Eunsol",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
month = dec,
year = "2022",
address = "Abu Dhabi, United Arab Emirates",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-emnlp.420",
doi = "10.18653/v1/2022.findings-emnlp.420",
pages = "5723--5738",
abstract = "We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels {--} they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy{'}s impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.",
}
📄 許可證
本項目採用MIT許可證。