deoffxlmr-mono-tamil開源模型 - 精準檢測泰米爾語代碼混合文本冒犯內容

首頁

Deoffxlmr Mono Tamil

由Hate-speech-CNERG開發

該模型用於檢測泰米爾語代碼混合文本中的冒犯性內容，基於XLM-Roberta-Base模型訓練，在EACL 2021達羅毗荼語系冒犯性語言識別共享任務中表現優異。

文本分類

Transformers

其他開源協議:Apache-2.0 #泰米爾語冒犯檢測 #代碼混合文本處理 #XLM-Roberta微調

下載量 100

發布時間 : 3/2/2022

模型概述

專門用於識別泰米爾語（包括純文本及代碼混合形式）中的冒犯性內容的單語模型，採用Transformer架構，在特定數據集上取得了較高的檢測準確率。

模型特點

單語專注優化

專門針對泰米爾語（包括代碼混合形式）進行優化，相比多語言模型在特定語言任務上表現更優

集成策略優勢

採用遺傳算法集成技術，在共享任務中獲得了泰米爾語子任務第一名的成績

低資源語言解決方案

針對泰米爾語等低資源語言的冒犯性內容檢測難題提供了有效解決方案

模型能力

泰米爾語文本分類

代碼混合文本處理

冒犯性內容識別

使用案例

內容審核

社交媒體內容過濾

自動檢測泰米爾語社交媒體中的冒犯性言論

在測試集上達到0.76的加權F1分數

語言研究

達羅毗荼語系語言分析

用於研究泰米爾語等低資源語言中的冒犯性語言特徵

🚀 泰米爾語冒犯性內容檢測模型

本模型用於檢測泰米爾語代碼混合語言中的冒犯性內容。名稱中的“mono”指單語設置，即該模型僅使用泰米爾語（純泰米爾語和代碼混合語）數據進行訓練。模型權重初始化為預訓練的XLM - Roberta - Base，在使用交叉熵損失進行微調之前，先在目標數據集上使用掩碼語言建模進行預訓練。

該模型是為EACL 2021達羅毗荼語系語言冒犯性語言識別共享任務訓練的多個模型中表現最優的。基於遺傳算法的集成測試預測在排行榜上獲得了最高的加權F1分數（保留測試集上的加權F1分數：本模型 - 0.76，集成模型 - 0.78）。

📚 詳細文檔

關於我們的論文詳情

Debjoy Saha, Naman Paharia, Debajit Chakraborty, Punyajoy Saha, Animesh Mukherjee. "[Hate - Alert@DravidianLangTech - EACL2021: Ensembling strategies for Transformer - based Offensive language Detection](https://www.aclweb.org/anthology/2021.dravidianlangtech - 1.38/)"。

請在任何使用這些資源的已發表作品中引用我們的論文。

@inproceedings{saha-etal-2021-hate,
    title = "Hate-Alert@{D}ravidian{L}ang{T}ech-{EACL}2021: Ensembling strategies for Transformer-based Offensive language Detection",
    author = "Saha, Debjoy and Paharia, Naman and Chakraborty, Debajit and Saha, Punyajoy and Mukherjee, Animesh",
    booktitle = "Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages",
    month = apr,
    year = "2021",
    address = "Kyiv",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.dravidianlangtech-1.38",
    pages = "270--276",
    abstract = "Social media often acts as breeding grounds for different forms of offensive content. For low resource languages like Tamil, the situation is more complex due to the poor performance of multilingual or language-specific models and lack of proper benchmark datasets. Based on this shared task {``}Offensive Language Identification in Dravidian Languages{''} at EACL 2021; we present an exhaustive exploration of different transformer models, We also provide a genetic algorithm technique for ensembling different models. Our ensembled models trained separately for each language secured the first position in Tamil, the second position in Kannada, and the first position in Malayalam sub-tasks. The models and codes are provided.",
}