xlmr-large-toxicity-classifier開源多語言毒性分類器

首頁

Xlmr Large Toxicity Classifier

由textdetox開發

基於xlm-roberta-large架構的多語言毒性分類器，支持9種語言的文本毒性檢測

文本分類

Transformers

支持多種語言#多語言毒性檢測 #高精度分類 #社交媒體內容審核

下載量 5,509

發布時間 : 2/2/2024

模型概述

該模型用於檢測文本中的毒性內容，支持英語、俄語、烏克蘭語、西班牙語、德語、阿姆哈拉語、阿拉伯語、中文和印地語等多種語言。

模型特點

多語言支持

支持9種語言的毒性內容檢測，包括英語、俄語、中文等主要語言

高準確率

在多種語言上表現出色，英語F1值達0.965，俄語達0.979

平衡數據集

基於精心構建的多語言毒性數據集訓練，測試集表現均衡

模型能力

文本毒性檢測

多語言文本分析

內容安全過濾

使用案例

內容審核

社交媒體內容過濾

自動檢測並過濾社交媒體上的有毒評論

可有效識別多種語言的有害內容

在線社區管理

幫助論壇和社區管理員識別不當言論

提供多語言支持，覆蓋廣泛用戶群體

學術研究

語言毒性研究

用於跨語言毒性特徵的比較研究

提供標準化評估指標

🚀 九種語言的多語言毒性分類器 (2024)

本項目是基於 [xlm - roberta - large](https://huggingface.co/FacebookAI/xlm - roberta - large) 模型微調得到的，針對二元毒性分類任務，在我們整理的數據集 textdetox/multilingual_toxicity_dataset 上進行了訓練。該項目能夠有效識別九種語言文本中的毒性內容，為多語言環境下的文本安全提供了有力支持。

📦 模型信息

屬性	詳情
許可證	OpenRail++
訓練數據集	textdetox/multilingual_toxicity_dataset
支持語言	英語、俄語、烏克蘭語、西班牙語、德語、阿姆哈拉語、阿拉伯語、中文、印地語
評估指標	F1
基礎模型	FacebookAI/xlm - roberta - large
標籤	毒性
新版本	textdetox/xlmr - large - toxicity - classifier - v2

📊 模型訓練與評估

首先，我們劃分出 20% 的平衡測試集來評估模型的適用性。然後，在全量數據上對模型進行微調。以下是模型在測試集上的評估結果：

語言	精確率	召回率	F1值
所有語言	0.8713	0.8710	0.8710
英語	0.9650	0.9650	0.9650
俄語	0.9791	0.9790	0.9790
烏克蘭語	0.9267	0.9250	0.9251
德語	0.8791	0.8760	0.8758
西班牙語	0.8700	0.8700	0.8700
阿拉伯語	0.7787	0.7780	0.7780
阿姆哈拉語	0.7781	0.7780	0.7780
印地語	0.9360	0.9360	0.9360
中文	0.7318	0.7320	0.7315

📖 引用信息

如果您想引用我們的工作，請參考以下文獻：

@inproceedings{dementieva2024overview,
  title={Overview of the Multilingual Text Detoxification Task at PAN 2024},
  author={Dementieva, Daryna and Moskovskiy, Daniil and Babakov, Nikolay and Ayele, Abinew Ali and Rizwan, Naquee and Schneider, Frolian and Wang, Xintog and Yimam, Seid Muhie and Ustalov, Dmitry and Stakovskii, Elisei and Smirnova, Alisa and Elnagar, Ashraf and Mukherjee, Animesh and Panchenko, Alexander},
  booktitle={Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum},
  editor={Guglielmo Faggioli and Nicola Ferro and Petra Galu{\v{s}}{\v{c}}{\'a}kov{\'a} and Alba Garc{\'i}a Seco de Herrera},
  year={2024},
  organization={CEUR-WS.org}
}

@inproceedings{DBLP:conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24,
  author       = {Janek Bevendorff and
                  Xavier Bonet Casals and
                  Berta Chulvi and
                  Daryna Dementieva and
                  Ashaf Elnagar and
                  Dayne Freitag and
                  Maik Fr{\"{o}}be and
                  Damir Korencic and
                  Maximilian Mayerl and
                  Animesh Mukherjee and
                  Alexander Panchenko and
                  Martin Potthast and
                  Francisco Rangel and
                  Paolo Rosso and
                  Alisa Smirnova and
                  Efstathios Stamatatos and
                  Benno Stein and
                  Mariona Taul{\'{e}} and
                  Dmitry Ustalov and
                  Matti Wiegmann and
                  Eva Zangerle},
  editor       = {Nazli Goharian and
                  Nicola Tonellotto and
                  Yulan He and
                  Aldo Lipani and
                  Graham McDonald and
                  Craig Macdonald and
                  Iadh Ounis},
  title        = {Overview of {PAN} 2024: Multi-author Writing Style Analysis, Multilingual
                  Text Detoxification, Oppositional Thinking Analysis, and Generative
                  {AI} Authorship Verification - Extended Abstract},
  booktitle    = {Advances in Information Retrieval - 46th European Conference on Information
                  Retrieval, {ECIR} 2024, Glasgow, UK, March 24-28, 2024, Proceedings,
                  Part {VI}},
  series       = {Lecture Notes in Computer Science},
  volume       = {14613},
  pages        = {3--10},
  publisher    = {Springer},
  year         = {2024},
  url          = {https://doi.org/10.1007/978-3-031-56072-9\_1},
  doi          = {10.1007/978-3-031-56072-9\_1},
  timestamp    = {Fri, 29 Mar 2024 23:01:36 +0100},
  biburl       = {https://dblp.org/rec/conf/ecir/BevendorffCCDEFFKMMPPRRSSSTUWZ24.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}