deoffxlmr-mono-kannada開源模型 - 免費檢測卡納達語混合代碼冒犯內容

首頁

Deoffxlmr Mono Kannada

由Hate-speech-CNERG開發

該模型用於檢測卡納達語混合代碼中的冒犯性內容，基於XLM-Roberta-Base預訓練模型微調，在EACL 2021共享任務中表現優異。

文本分類

Transformers

其他開源協議:Apache-2.0 #卡納達語混合代碼 #冒犯性內容檢測 #XLM-Roberta預訓練

下載量 22

發布時間 : 3/2/2022

模型概述

專門針對卡納達語（純語言和混合代碼）設計的冒犯性內容檢測模型，採用單語設置訓練，適用於社交媒體內容審核。

模型特點

單語專注優化

專門針對卡納達語（包括混合代碼）進行訓練，在單語場景下表現優異

共享任務優勝模型

在EACL 2021達羅毗荼語系冒犯性語言識別共享任務中獲得卡納達語子任務第二名

混合代碼處理

能夠有效處理卡納達語與其他語言混合的代碼切換內容

模型能力

文本分類

冒犯性內容檢測

混合語言處理

使用案例

內容審核

社交媒體內容過濾

自動識別卡納達語社交媒體中的冒犯性言論

在測試集上達到0.73的加權F1分數

🚀 卡納達語混合代碼冒犯性內容檢測模型

本模型用於檢測卡納達語混合代碼語言中的冒犯性內容。名稱中的“mono”指單語設置，即該模型僅使用卡納達語（純語言和混合代碼）數據進行訓練。模型權重初始化為預訓練的XLM - Roberta - Base，並在使用交叉熵損失進行微調之前，在目標數據集上通過掩碼語言建模進行預訓練。

該模型是為EACL 2021達羅毗荼語系語言冒犯性語言識別共享任務訓練的多個模型中表現最優的。基於遺傳算法的集成測試預測在排行榜上獲得了第二高的加權F1分數（保留測試集上的加權F1分數：本模型 - 0.73，集成模型 - 0.74）。

📚 詳細文檔

關於我們的論文詳情

Debjoy Saha、Naman Paharia、Debajit Chakraborty、Punyajoy Saha、Animesh Mukherjee發表了論文“[Hate - Alert@DravidianLangTech - EACL2021: Ensembling strategies for Transformer - based Offensive language Detection](https://www.aclweb.org/anthology/2021.dravidianlangtech - 1.38/)”。

⚠️ 重要提示

請在任何使用這些資源的已發表作品中引用我們的論文。

論文引用格式

@inproceedings{saha-etal-2021-hate,
    title = "Hate-Alert@{D}ravidian{L}ang{T}ech-{EACL}2021: Ensembling strategies for Transformer-based Offensive language Detection",
    author = "Saha, Debjoy and Paharia, Naman and Chakraborty, Debajit and Saha, Punyajoy and Mukherjee, Animesh",
    booktitle = "Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages",
    month = apr,
    year = "2021",
    address = "Kyiv",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.dravidianlangtech-1.38",
    pages = "270--276",
    abstract = "Social media often acts as breeding grounds for different forms of offensive content. For low resource languages like Tamil, the situation is more complex due to the poor performance of multilingual or language-specific models and lack of proper benchmark datasets. Based on this shared task {``}Offensive Language Identification in Dravidian Languages{''} at EACL 2021; we present an exhaustive exploration of different transformer models, We also provide a genetic algorithm technique for ensembling different models. Our ensembled models trained separately for each language secured the first position in Tamil, the second position in Kannada, and the first position in Malayalam sub-tasks. The models and codes are provided.",
}