🚀 九种语言的多语言毒性分类器 (2024)
本项目是基于 [xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 模型微调得到的,针对二元毒性分类任务,在我们整理的数据集 textdetox/multilingual_toxicity_dataset 上进行了训练。该项目能够有效识别九种语言文本中的毒性内容,为多语言环境下的文本安全提供了有力支持。
📦 模型信息
属性 |
详情 |
许可证 |
OpenRail++ |
训练数据集 |
textdetox/multilingual_toxicity_dataset |
支持语言 |
英语、俄语、乌克兰语、西班牙语、德语、阿姆哈拉语、阿拉伯语、中文、印地语 |
评估指标 |
F1 |
基础模型 |
FacebookAI/xlm - roberta - large |
标签 |
毒性 |
新版本 |
textdetox/xlmr - large - toxicity - classifier - v2 |
📊 模型训练与评估
首先,我们划分出 20% 的平衡测试集来评估模型的适用性。然后,在全量数据上对模型进行微调。以下是模型在测试集上的评估结果:
语言 |
精确率 |
召回率 |
F1值 |
所有语言 |
0.8713 |
0.8710 |
0.8710 |
英语 |
0.9650 |
0.9650 |
0.9650 |
俄语 |
0.9791 |
0.9790 |
0.9790 |
乌克兰语 |
0.9267 |
0.9250 |
0.9251 |
德语 |
0.8791 |
0.8760 |
0.8758 |
西班牙语 |
0.8700 |
0.8700 |
0.8700 |
阿拉伯语 |
0.7787 |
0.7780 |
0.7780 |
阿姆哈拉语 |
0.7781 |
0.7780 |
0.7780 |
印地语 |
0.9360 |
0.9360 |
0.9360 |
中文 |
0.7318 |
0.7320 |
0.7315 |
📖 引用信息
如果您想引用我们的工作,请参考以下文献:
@inproceedings{dementieva2024overview,
title={Overview of the Multilingual Text Detoxification Task at PAN 2024},
author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander},
booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum},
editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{\v{s}}{\v{c}}{\'a}kov{\'a} and Alba Garc{\'i}a Seco de Herrera},
year={2024},
organization={CEUR-WS.org}
}
@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24,
author = {Janek Bevendorff and
Xavier Bonet Casals and
Berta Chulvi and
Daryna Dementieva and
Ashaf Elnagar and
Dayne Freitag and
Maik Fr{\"{o}}be and
Damir Korencic and
Maximilian Mayerl and
Animesh Mukherjee and
Alexander Panchenko and
Martin Potthast and
Francisco Rangel and
Paolo Rosso and
Alisa Smirnova and
Efstathios Stamatatos and
Benno Stein and
Mariona Taul{\'{e}} and
Dmitry Ustalov and
Matti Wiegmann and
Eva Zangerle},
editor = {Nazli Goharian and
Nicola Tonellotto and
Yulan He and
Aldo Lipani and
Graham McDonald and
Craig Macdonald and
Iadh Ounis},
title = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual
Text Detoxification, Oppositional Thinking Analysis, and Generative
{AI} Authorship Verification - Extended Abstract},
booktitle = {Advances in Information Retrieval - 46th European Conference on Information
Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
Part {VI}},
series = {Lecture Notes in Computer Science},
volume = {14613},
pages = {3--10},
publisher = {Springer},
year = {2024},
url = {https://doi.org/10.1007/978-3-031-56072-9\_1},
doi = {10.1007/978-3-031-56072-9\_1},
timestamp = {Fri, 29 Mar 2024 23:01:36 +0100},
biburl = {https://dblp.org/rec/conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}