xlmr-large-toxicity-classifier开源多语言毒性分类器

首页

Xlmr Large Toxicity Classifier

由 textdetox 开发

基于xlm-roberta-large架构的多语言毒性分类器，支持9种语言的文本毒性检测

文本分类

Transformers

支持多种语言#多语言毒性检测 #高精度分类 #社交媒体内容审核

下载量 5,509

发布时间 : 2/2/2024

模型简介

该模型用于检测文本中的毒性内容，支持英语、俄语、乌克兰语、西班牙语、德语、阿姆哈拉语、阿拉伯语、中文和印地语等多种语言。

模型特点

多语言支持

支持9种语言的毒性内容检测，包括英语、俄语、中文等主要语言

高准确率

在多种语言上表现出色，英语F1值达0.965，俄语达0.979

平衡数据集

基于精心构建的多语言毒性数据集训练，测试集表现均衡

模型能力

文本毒性检测

多语言文本分析

内容安全过滤

使用案例

内容审核

社交媒体内容过滤

自动检测并过滤社交媒体上的有毒评论

可有效识别多种语言的有害内容

在线社区管理

帮助论坛和社区管理员识别不当言论

提供多语言支持，覆盖广泛用户群体

学术研究

语言毒性研究

用于跨语言毒性特征的比较研究

提供标准化评估指标

🚀 九种语言的多语言毒性分类器 (2024)

本项目是基于 [xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 模型微调得到的，针对二元毒性分类任务，在我们整理的数据集 textdetox/multilingual_toxicity_dataset 上进行了训练。该项目能够有效识别九种语言文本中的毒性内容，为多语言环境下的文本安全提供了有力支持。

📦 模型信息

属性	详情
许可证	OpenRail++
训练数据集	textdetox/multilingual_toxicity_dataset
支持语言	英语、俄语、乌克兰语、西班牙语、德语、阿姆哈拉语、阿拉伯语、中文、印地语
评估指标	F1
基础模型	FacebookAI/xlm - roberta - large
标签	毒性
新版本	textdetox/xlmr - large - toxicity - classifier - v2

📊 模型训练与评估

首先，我们划分出 20% 的平衡测试集来评估模型的适用性。然后，在全量数据上对模型进行微调。以下是模型在测试集上的评估结果：

语言	精确率	召回率	F1值
所有语言	0.8713	0.8710	0.8710
英语	0.9650	0.9650	0.9650
俄语	0.9791	0.9790	0.9790
乌克兰语	0.9267	0.9250	0.9251
德语	0.8791	0.8760	0.8758
西班牙语	0.8700	0.8700	0.8700
阿拉伯语	0.7787	0.7780	0.7780
阿姆哈拉语	0.7781	0.7780	0.7780
印地语	0.9360	0.9360	0.9360
中文	0.7318	0.7320	0.7315

📖 引用信息

如果您想引用我们的工作，请参考以下文献：

@inproceedings{dementieva2024overview,
  title={Overview of the Multilingual Text Detoxification Task at PAN 2024},
  author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander},
  booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum},
  editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{\v{s}}{\v{c}}{\'a}kov{\'a} and Alba Garc{\'i}a Seco de Herrera},
  year={2024},
  organization={CEUR-WS.org}
}

@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24,
  author       = {Janek Bevendorff and
                  Xavier Bonet Casals and
                  Berta Chulvi and
                  Daryna Dementieva and
                  Ashaf Elnagar and
                  Dayne Freitag and
                  Maik Fr{\"{o}}be and
                  Damir Korencic and
                  Maximilian Mayerl and
                  Animesh Mukherjee and
                  Alexander Panchenko and
                  Martin Potthast and
                  Francisco Rangel and
                  Paolo Rosso and
                  Alisa Smirnova and
                  Efstathios Stamatatos and
                  Benno Stein and
                  Mariona Taul{\'{e}} and
                  Dmitry Ustalov and
                  Matti Wiegmann and
                  Eva Zangerle},
  editor       = {Nazli Goharian and
                  Nicola Tonellotto and
                  Yulan He and
                  Aldo Lipani and
                  Graham McDonald and
                  Craig Macdonald and
                  Iadh Ounis},
  title        = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual
                  Text Detoxification, Oppositional Thinking Analysis, and Generative
                  {AI} Authorship Verification - Extended Abstract},
  booktitle    = {Advances in Information Retrieval - 46th European Conference on Information
                  Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
                  Part {VI}},
  series       = {Lecture Notes in Computer Science},
  volume       = {14613},
  pages        = {3--10},
  publisher    = {Springer},
  year         = {2024},
  url          = {https://doi.org/10.1007/978-3-031-56072-9\_1},
  doi          = {10.1007/978-3-031-56072-9\_1},
  timestamp    = {Fri, 29 Mar 2024 23:01:36 +0100},
  biburl       = {https://dblp.org/rec/conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}