BERT-Offensive Lang Detection TR Open-Source Model - Free Detection of Offensive Language in Turkish Texts

Home

Bert Offensive Lang Detection Tr

Developed by TURKCELL

A BERT-based Turkish text classification model for detecting offensive language in text

Text Classification

Transformers

OtherOpen Source License:MIT #Turkish text classification #Social media content moderation #BERT fine-tuning

Downloads 43

Release Time : 1/30/2024

Model Overview

This model is fine-tuned from dbmdz/bert-base-turkish-128k-uncased, specifically designed for Turkish offensive language detection tasks.

Model Features

Turkish language optimization

Specifically optimized for Turkish language characteristics, including character processing and text preprocessing

Comprehensive preprocessing pipeline

Includes various text cleaning steps such as accent conversion, lowercase conversion, and user mention removal

Imbalanced data handling

Optimized for handling imbalanced datasets (where offensive samples are in the minority)

Model Capabilities

Turkish text classification

Offensive language detection

Social media text analysis

Use Cases

Content moderation

Social media comment moderation

Automatically identify and filter offensive comments on social media

Helps reduce manual moderation workload

Online community management

Detect inappropriate speech in forums and discussion areas

Maintains a healthy online discussion environment

🚀 Turkish Offensive Language Detection Model

This project provides a model for detecting offensive language in Turkish text. It leverages a fine - tuned BERT model and offers comprehensive pre - processing steps and usage examples.

🚀 Quick Start

Installation

Install the necessary libraries: pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

Model Initialization

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

Check Offensive Sentence

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]

is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

✨ Features

Fine - Tuned Model: The model is fine - tuned using dbmdz/bert-base-turkish-128k-uncased with the OffensEval 2020 dataset.
Comprehensive Pre - processing: It includes multiple pre - processing steps such as accented character transformation, lowercase conversion, and removal of various elements like URLs and emojis.
High Accuracy: Achieves 89% accuracy on the test set.

📦 Installation

pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

💻 Usage Examples

Basic Usage

# Install necessary libraries
from turkish.deasciifier import Deasciifier
import re

def deasciifier(text):
    deasciifier = Deasciifier(text)
    return deasciifier.convert_to_turkish()

def remove_circumflex(text):
    circumflex_map = {
        'â': 'a',
        'î': 'i',
        'û': 'u',
        'ô': 'o',
        'Â': 'A',
        'Î': 'I',
        'Û': 'U',
        'Ô': 'O'
    }

    return ''.join(circumflex_map.get(c, c) for c in text)    
def turkish_lower(text):
    turkish_map = {
        'I': 'ı',
        'İ': 'i',
        'Ç': 'ç',
        'Ş': 'ş',
        'Ğ': 'ğ',
        'Ü': 'ü',
        'Ö': 'ö'
    }
    return ''.join(turkish_map.get(c, c).lower() for c in text)

def clean_text(text):
    # Metindeki şapkalı harfleri kaldırma
    text = remove_circumflex(text)
    # Metni küçük harfe dönüştürme
    text = turkish_lower(text)
    # deasciifier
    text = deasciifier(text)
    # Kullanıcı adlarını kaldırma
    text = re.sub(r"@\S*", " ", text)
    # Hashtag'leri kaldırma
    text = re.sub(r'#\S+', ' ', text)
    # URL'leri kaldırma
    text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
    # Noktalama işaretlerini ve metin tabanlı emojileri kaldırma
    text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
    # Emojileri kaldırma
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r' ', text)

    # Birden fazla boşluğu tek boşlukla değiştirme
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]

is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

📚 Documentation

Model Description

This model has been fine - tuned using dbmdz/bert-base-turkish-128k-uncased model with the OffensEval 2020 dataset. The offenseval - tr dataset contains 31,756 annotated tweets.

Dataset Distribution

	Non Offensive(0)	Offensive (1)
Train	25625	6131
Test	2812	716

Preprocessing Steps

Process	Description
Accented character transformation	Converting accented characters to their unaccented equivalents
Lowercase transformation	Converting all text to lowercase
Removing @user mentions	Removing @user formatted user mentions from text
Removing hashtag expressions	Removing #hashtag formatted expressions from text
Removing URLs	Removing URLs from text
Removing punctuation and punctuated emojis	Removing punctuation marks and emojis presented with punctuation from text
Removing emojis	Removing emojis from text
Deasciification	Converting ASCII text into text containing Turkish characters

The performance of each pre - process was analyzed. Removing digits and keeping hashtags had no effect.

Evaluation

Evaluation results on the test set are shown in the table below. We achieve 89% accuracy on the test set.

Model Performance Metrics

Class	Precision	Recall	F1 - score	Accuracy
Class 0	0.92	0.94	0.93	0.89
Class 1	0.73	0.67	0.70
Macro	0.83	0.80	0.81

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご