🚀 九種語言的多語言毒性分類器 (2024)
本項目是基於 [xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 模型微調得到的,針對二元毒性分類任務,在我們整理的數據集 textdetox/multilingual_toxicity_dataset 上進行了訓練。該項目能夠有效識別九種語言文本中的毒性內容,為多語言環境下的文本安全提供了有力支持。
📦 模型信息
屬性 |
詳情 |
許可證 |
OpenRail++ |
訓練數據集 |
textdetox/multilingual_toxicity_dataset |
支持語言 |
英語、俄語、烏克蘭語、西班牙語、德語、阿姆哈拉語、阿拉伯語、中文、印地語 |
評估指標 |
F1 |
基礎模型 |
FacebookAI/xlm - roberta - large |
標籤 |
毒性 |
新版本 |
textdetox/xlmr - large - toxicity - classifier - v2 |
📊 模型訓練與評估
首先,我們劃分出 20% 的平衡測試集來評估模型的適用性。然後,在全量數據上對模型進行微調。以下是模型在測試集上的評估結果:
語言 |
精確率 |
召回率 |
F1值 |
所有語言 |
0.8713 |
0.8710 |
0.8710 |
英語 |
0.9650 |
0.9650 |
0.9650 |
俄語 |
0.9791 |
0.9790 |
0.9790 |
烏克蘭語 |
0.9267 |
0.9250 |
0.9251 |
德語 |
0.8791 |
0.8760 |
0.8758 |
西班牙語 |
0.8700 |
0.8700 |
0.8700 |
阿拉伯語 |
0.7787 |
0.7780 |
0.7780 |
阿姆哈拉語 |
0.7781 |
0.7780 |
0.7780 |
印地語 |
0.9360 |
0.9360 |
0.9360 |
中文 |
0.7318 |
0.7320 |
0.7315 |
📖 引用信息
如果您想引用我們的工作,請參考以下文獻:
@inproceedings{dementieva2024overview,
title={Overview of the Multilingual Text Detoxification Task at PAN 2024},
author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander},
booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum},
editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{\v{s}}{\v{c}}{\'a}kov{\'a} and Alba Garc{\'i}a Seco de Herrera},
year={2024},
organization={CEUR-WS.org}
}
@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24,
author = {Janek Bevendorff and
Xavier Bonet Casals and
Berta Chulvi and
Daryna Dementieva and
Ashaf Elnagar and
Dayne Freitag and
Maik Fr{\"{o}}be and
Damir Korencic and
Maximilian Mayerl and
Animesh Mukherjee and
Alexander Panchenko and
Martin Potthast and
Francisco Rangel and
Paolo Rosso and
Alisa Smirnova and
Efstathios Stamatatos and
Benno Stein and
Mariona Taul{\'{e}} and
Dmitry Ustalov and
Matti Wiegmann and
Eva Zangerle},
editor = {Nazli Goharian and
Nicola Tonellotto and
Yulan He and
Aldo Lipani and
Graham McDonald and
Craig Macdonald and
Iadh Ounis},
title = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual
Text Detoxification, Oppositional Thinking Analysis, and Generative
{AI} Authorship Verification - Extended Abstract},
booktitle = {Advances in Information Retrieval - 46th European Conference on Information
Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
Part {VI}},
series = {Lecture Notes in Computer Science},
volume = {14613},
pages = {3--10},
publisher = {Springer},
year = {2024},
url = {https://doi.org/10.1007/978-3-031-56072-9\_1},
doi = {10.1007/978-3-031-56072-9\_1},
timestamp = {Fri, 29 Mar 2024 23:01:36 +0100},
biburl = {https://dblp.org/rec/conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}