xlm-roberta-large-tydip開源模型 - 支持10種語言的多語言禮貌度判斷

首頁

Xlm Roberta Large Tydip

由Genius1237開發

基於xlm-roberta-large架構的多語言禮貌度分類模型，在TyDiP數據集的英語子集上微調，支持10種語言的禮貌度判斷

文本分類

Transformers

支持多種語言開源協議:MIT #多語言禮貌分析 #跨語言文本分類 #高準確率XLMR

下載量 929

發布時間 : 4/20/2023

模型概述

該模型用於判斷文本的禮貌程度（禮貌/不禮貌），特別針對多語言場景設計，在英語及9種其他語言上表現出色

模型特點

多語言支持

支持10種語言的禮貌度分類，包括印地語、韓語等非拉丁語系語言

高準確率

在英語測試集上達到0.892的準確率，其他語言也表現良好

跨語言能力

基於XLMR架構，具備優秀的跨語言遷移能力，可能適用於更多語言

模型能力

多語言文本分類

禮貌度判斷

跨語言遷移學習

使用案例

社交媒體分析

評論禮貌度篩選

自動識別社交媒體評論的禮貌程度

可幫助過濾不禮貌內容

客服系統

客服回覆質量監控

評估客服回覆的禮貌程度

提升客戶服務質量

🚀 多語言禮貌分類模型

本模型基於xlm - roberta - large，並在TyDiP數據集的英語子集上進行了微調，相關內容可參考原論文此處。該模型可用於文本分類任務，能對多種語言的文本進行禮貌程度分類。

🚀 快速開始

本模型基於xlm - roberta - large，在TyDiP數據集的英語子集上進行微調，可用於多語言的禮貌分類。

✨ 主要特性

多語言支持：在論文中，該模型在英語以及其他9種語言（印地語、韓語、西班牙語、泰米爾語、法語、越南語、俄語、南非荷蘭語、匈牙利語）上進行了評估。鑑於模型的良好性能和XLMR的跨語言能力，微調後的模型很可能也適用於更多語言。
基於強大基礎模型：基於xlm - roberta - large進行微調，充分利用了其預訓練的語言知識。

📦 安裝指南

文檔未提及具體安裝步驟，可通過transformers庫使用該模型，確保已安裝transformers庫：

pip install transformers

💻 使用示例

基礎用法

from transformers import pipeline

classifier = pipeline(task="text-classification", model="Genius1237/xlm-roberta-large-tydip")

sentences = ["Could you please get me a glass of water", "mere liye पानी का एक गिलास ले आओ "]

print(classifier(sentences))
# [{'label': 'polite', 'score': 0.9076159000396729}, {'label': 'impolite', 'score': 0.765066385269165}]

高級用法

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained('Genius1237/xlm-roberta-large-tydip')
model = AutoModelForSequenceClassification.from_pretrained('Genius1237/xlm-roberta-large-tydip')

text = "Could you please get me a glass of water"
encoded_input = tokenizer(text, return_tensors='pt')

output = model(**encoded_input)
prediction = torch.argmax(output.logits).item()

print(model.config.id2label[prediction])
# polite

📚 詳細文檔

評估結果

TyDiP測試集上10種語言的禮貌分類準確率分數如下：

語言	準確率
英語 (en)	0.892
印地語 (hi)	0.868
韓語 (ko)	0.784
西班牙語 (es)	0.84
泰米爾語 (ta)	0.78
法語 (fr)	0.82
越南語 (vi)	0.844
俄語 (ru)	0.668
南非荷蘭語 (af)	0.856
匈牙利語 (hu)	0.812

引用信息

@inproceedings{srinivasan-choi-2022-tydip,
    title = "{T}y{D}i{P}: A Dataset for Politeness Classification in Nine Typologically Diverse Languages",
    author = "Srinivasan, Anirudh  and
      Choi, Eunsol",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.420",
    doi = "10.18653/v1/2022.findings-emnlp.420",
    pages = "5723--5738",
    abstract = "We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels {--} they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy{'}s impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.",
}