deoffxlmr-mono-tamil Open-source Model - Accurately Detect Offensive Content in Tamil Code-mixed Texts

Deoffxlmr Mono Tamil

Developed by Hate-speech-CNERG

This model is used to detect offensive content in Tamil code-mixed text, trained based on the XLM-Roberta-Base model, and performed excellently in the EACL 2021 Dravidian Language Offensive Language Identification Shared Task.

Text Classification

Transformers

OtherOpen Source License:Apache-2.0 #Tamil Offensive Detection #Code-Mixed Text Processing #XLM-Roberta Fine-tuning

Downloads 100

Release Time : 3/2/2022

Model Overview

A monolingual model specifically designed to identify offensive content in Tamil (including pure text and code-mixed forms), using Transformer architecture, achieving high detection accuracy on specific datasets.

Model Features

Monolingual Focus Optimization

Specifically optimized for Tamil (including code-mixed forms), outperforming multilingual models in specific language tasks.

Integration Strategy Advantage

Utilized genetic algorithm integration techniques, achieving first place in the Tamil sub-task of the shared task.

Low-Resource Language Solution

Provides an effective solution for offensive content detection in low-resource languages such as Tamil.

Model Capabilities

Tamil Text Classification

Code-Mixed Text Processing

Offensive Content Recognition

Use Cases

Content Moderation

Social Media Content Filtering

Automatically detects offensive speech in Tamil social media

Achieved a weighted F1 score of 0.76 on the test set

Language Research

Dravidian Language Family Analysis

Used to study offensive language features in low-resource languages such as Tamil

🚀 Tamil Mono Offensive Content Detection Model

This model is designed to detect Offensive Content in the Tamil Code-Mixed language. The "mono" in its name indicates a monolingual setting, where the model is trained solely on Tamil (both pure and code-mixed) data. The model initializes its weights from the pretrained XLM-Roberta-Base and undergoes pretraining using Masked Language Modelling on the target dataset, followed by fine-tuning with Cross-Entropy Loss.

This model emerged as the top performer among multiple models trained for the EACL 2021 Shared Task on Offensive Language Identification in Dravidian Languages. The test predictions from a Genetic-Algorithm based ensemble achieved the highest weighted F1 score on the leaderboard (Weighted F1 score on the hold-out test set: This model - 0.76, Ensemble - 0.78).

📚 Documentation

For more details about our paper

Debjoy Saha, Naman Paharia, Debajit Chakraborty, Punyajoy Saha, Animesh Mukherjee. "Hate-Alert@DravidianLangTech-EACL2021: Ensembling strategies for Transformer-based Offensive language Detection".

⚠️ Important Note

Please cite our paper in any published work that uses any of these resources.

@inproceedings{saha-etal-2021-hate,
    title = "Hate-Alert@{D}ravidian{L}ang{T}ech-{EACL}2021: Ensembling strategies for Transformer-based Offensive language Detection",
    author = "Saha, Debjoy and Paharia, Naman and Chakraborty, Debajit and Saha, Punyajoy and Mukherjee, Animesh",
    booktitle = "Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages",
    month = apr,
    year = "2021",
    address = "Kyiv",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.dravidianlangtech-1.38",
    pages = "270--276",
    abstract = "Social media often acts as breeding grounds for different forms of offensive content. For low resource languages like Tamil, the situation is more complex due to the poor performance of multilingual or language-specific models and lack of proper benchmark datasets. Based on this shared task {``}Offensive Language Identification in Dravidian Languages{''} at EACL 2021; we present an exhaustive exploration of different transformer models, We also provide a genetic algorithm technique for ensembling different models. Our ensembled models trained separately for each language secured the first position in Tamil, the second position in Kannada, and the first position in Malayalam sub-tasks. The models and codes are provided.",
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご