kcElectra_base_Bad_Sentence_Classifier Open Source Model - Free Deployment, Accurately Identify Sensitive Korean Comments and Chat Content

Kcelectra Base Bad Sentence Classifier

Developed by JminJ

A Korean text classification model based on ELECTRA architecture, designed to determine if comments and chat content contain sensitive information

Text Classification

Transformers

#Korean sensitive content detection #ELECTRA fine-tuning #Social media content moderation

Downloads 46

Release Time : 4/7/2022

Model Overview

This model is fine-tuned from ELECTRA, specifically for detecting inappropriate content (such as sensitive information, hate speech, etc.) in Korean text. It is trained on public datasets, but the training data is not disclosed due to copyright issues.

Model Features

Multi-dataset fusion training

Combines the Korean Unsmile and Korean HateSpeech datasets and relabels them into a binary classification format

Specific sensitive word processing

Special tagging for sentences containing specific Korean sensitive words (e.g., '~노', '좆', etc.)

Multi-model comparison

Trains and compares performance using three different Korean ELECTRA models

Model Capabilities

Korean text classification

Sensitive content detection

Hate speech recognition

Use Cases

Content moderation

Social media comment filtering

Automatically identifies and filters inappropriate comments on social media

Accuracy of 88.49% (based on kcElectra_base model)

Chat content monitoring

Real-time monitoring of inappropriate speech in chat applications

🚀 Bad_text_classifier

This project presents a model designed to identify whether various comments and chats on the Internet contain sensitive content. The model is fine - tuned using publicly available data after modifying labels and merging datasets.

🚀 Quick Start

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')

✨ Features

This model can distinguish whether Internet comments and chats contain sensitive content.
It uses publicly available Korean datasets for fine - tuning.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')

📚 Documentation

Dataset

Data Label

0 : bad sentence
1 : not bad sentence

Used Datasets

Dataset Processing Method

Two datasets that were not originally binary - classified were relabeled in a binary - classification format. Then, only the data with label 1 (not bad sentence) from the Korean HateSpeech Dataset was selected and merged with the processed Korean Unsmile Dataset.

Some data in the Korean Unsmile Dataset that was originally labeled as "clean" was modified to 0 (bad sentence):

Sentences containing "~노" but not "이기" or "노무" were modified to 0 (bad sentence).
Data containing sexual connotations such as "좆" and "봊" were modified to 0 (bad sentence).

Model Training

Fine - tuning was performed using ElectraForSequenceClassification from huggingface transformers.
Three publicly available Korean Electra models were used for training respectively.

Used Models

Model Valid Accuracy

Model	Accuracy
kcElectra_base_fp16_wd_custom_dataset	0.8849
tunibElectra_base_fp16_wd_custom_dataset	0.8726
koElectra_base_fp16_wd_custom_dataset	0.8434

⚠️ Important Note

All models were trained with the same seed, learning rate (3e - 06), weight decay lambda (0.001), and batch size (128).

Contact

jminju254@gmail.com

Github

https://github.com/JminJ/Bad_text_classifier

Reference

⚠️ Important Note

Due to copyright issues of the public data, the modified data used for model training cannot be made public. Also, the opinions of this model are not related to the author's opinions.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご