bert-offensive-lang-detection-tr開源模型 - 免費檢測土耳其語文本攻擊性語言

首頁

Bert Offensive Lang Detection Tr

由TURKCELL開發

基於BERT的土耳其語文本分類模型，用於檢測文本中的攻擊性語言

文本分類

Transformers

其他開源協議:MIT #土耳其語文本分類 #社交媒體內容審核 #BERT微調

下載量 43

發布時間 : 1/30/2024

模型概述

該模型基於dbmdz/bert-base-turkish-128k-uncased模型微調，專門用於土耳其語攻擊性語言檢測任務。

模型特點

土耳其語優化

專門針對土耳其語特性進行優化，包括字符處理和文本預處理

全面的預處理流程

包含重音轉換、小寫處理、用戶提及移除等多種文本清洗步驟

不平衡數據處理

針對不平衡數據集（攻擊性樣本佔少數）進行了優化處理

模型能力

土耳其語文本分類

攻擊性語言檢測

社交媒體文本分析

使用案例

內容審核

社交媒體評論審核

自動識別並過濾社交媒體上的攻擊性評論

可幫助減少人工審核工作量

在線社區管理

檢測論壇和討論區中的不當言論

維護健康的在線討論環境

🚀 土耳其語冒犯性語言檢測模型

本項目是一個用於檢測土耳其語冒犯性語言的模型，它基於預訓練的BERT模型，使用特定的數據集進行微調，能夠有效識別文本中的冒犯性內容。

🚀 快速開始

安裝必要的庫

pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

模型初始化

# 直接加載模型
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

檢查句子是否具有冒犯性

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]

is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

✨ 主要特性

基於 dbmdz/bert-base-turkish-128k-uncased 模型進行微調。
使用 OffensEval 2020 數據集，該數據集包含 31,756 條帶註釋的推文。
對多種預處理步驟的性能進行了分析，確定了有效的預處理策略。
在測試集上達到了 89% 的準確率。

📦 安裝指南

pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

💻 使用示例

基礎用法

安裝必要的庫

pip install git+https://github.com/emres/turkish-deasciifier.git pip install keras_preprocessing

預處理函數

from turkish.deasciifier import Deasciifier
def deasciifier(text):
    deasciifier = Deasciifier(text)
    return deasciifier.convert_to_turkish()

def remove_circumflex(text):
    circumflex_map = {
        'â': 'a',
        'î': 'i',
        'û': 'u',
        'ô': 'o',
        'Â': 'A',
        'Î': 'I',
        'Û': 'U',
        'Ô': 'O'
    }

    return ''.join(circumflex_map.get(c, c) for c in text)    
def turkish_lower(text):
    turkish_map = {
        'I': 'ı',
        'İ': 'i',
        'Ç': 'ç',
        'Ş': 'ş',
        'Ğ': 'ğ',
        'Ü': 'ü',
        'Ö': 'ö'
    }
    return ''.join(turkish_map.get(c, c).lower() for c in text)

文本清理函數

import re

def clean_text(text):
    # 去除文本中的帶帽字符
    text = remove_circumflex(text)
    # 將文本轉換為小寫
    text = turkish_lower(text)
    # 轉換為土耳其語字符
    text = deasciifier(text)
    # 去除用戶提及
    text = re.sub(r"@\S*", " ", text)
    # 去除話題標籤
    text = re.sub(r'#\S+', ' ', text)
    # 去除URL
    text = re.sub(r"http\S+|www\S+|https\S+", ' ', text, flags=re.MULTILINE)
    # 去除標點符號和帶標點的表情符號
    text = re.sub(r'[^\w\s]|(:\)|:\(|:D|:P|:o|:O|;\))', ' ', text)
    # 去除表情符號
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # 表情符號
                           u"\U0001F300-\U0001F5FF"  # 符號和象形圖
                           u"\U0001F680-\U0001F6FF"  # 交通和地圖符號
                           u"\U0001F1E0-\U0001F1FF"  # 旗幟（iOS）
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r' ', text)

    # 將多個連續空格替換為單個空格並去除首尾空格
    text = re.sub(r'\s+', ' ', text).strip()
    return text

高級用法

模型初始化

# 直接加載模型
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")
model = AutoModelForSequenceClassification.from_pretrained("TURKCELL/bert-offensive-lang-detection-tr")

檢查句子是否具有冒犯性

import numpy as np
def is_offensive(sentence):
    d = {
        0: 'non-offensive',
        1: 'offensive'
    }
    normalize_text = clean_text(sentence)
    test_sample = tokenizer([normalize_text], padding=True, truncation=True, max_length=256, return_tensors='pt')

    test_sample = {k: v.to(device) for k, v in test_sample.items()}

    output = model(**test_sample)
    y_pred = np.argmax(output.logits.detach().cpu().numpy(), axis=1)

    print(normalize_text, "-->", d[y_pred[0]])
    return y_pred[0]

is_offensive("@USER Mekanı cennet olsun, saygılar sayın avukatımız,iyi günler dilerim")
is_offensive("Bir Gün Gelecek Biriniz Bile Kalmayana Kadar Mücadeleye Devam Kökünüzü Kurutacağız !! #bebekkatilipkk")

📚 詳細文檔

模型描述

本模型使用 dbmdz/bert-base-turkish-128k-uncased 模型和 OffensEval 2020 數據集進行微調。Offenseval-tr 數據集包含 31,756 條帶註釋的推文。

數據集分佈

	非冒犯性(0)	冒犯性 (1)
訓練集	25625	6131
測試集	2812	716

預處理步驟

處理步驟	描述
重音字符轉換	將重音字符轉換為無重音的等效字符
小寫轉換	將所有文本轉換為小寫
去除 @user 提及	從文本中去除 @user 格式的用戶提及
去除話題標籤表達式	從文本中去除 #話題標籤格式的表達式
去除 URL	從文本中去除 URL
去除標點符號和帶標點的表情符號	從文本中去除標點符號和帶有標點的表情符號
去除表情符號	從文本中去除表情符號
去 ASCII 化	將 ASCII 文本轉換為包含土耳其字符的文本