BerTurk-SpamSMS Open-Source Spam SMS Detection Model

Home

Berturk SpamSMS

Developed by BaranKanat

A fine-tuned Turkish BERT model for spam SMS detection, used to classify Turkish SMS messages as spam or normal.

Text Classification

Transformers

Open Source License:Apache-2.0 #Turkish SMS detection #BERTurk fine-tuning #Spam message blocking

Downloads 45

Release Time : 1/14/2025

Model Overview

This model is fine-tuned based on dbmdz/bert-base-turkish-128k-uncased, specifically designed for Turkish SMS spam detection tasks.

Model Features

Turkish-specific

A classification model specifically optimized for Turkish SMS messages.

BERT-based architecture

Utilizes the BERTurk pre-trained model for fine-tuning, offering robust text comprehension capabilities.

Balanced dataset

Trained using a balanced dataset of spam and normal SMS messages.

Model Capabilities

Turkish text classification

Spam message detection

SMS content analysis

Use Cases

Communication security

Spam SMS filtering

Automatically identifies and filters Turkish spam SMS messages.

Effectively reduces the number of spam messages received by users.

Communication platform content moderation

Helps communication platforms automatically detect and flag suspicious SMS messages.

Enhances platform content security and user experience.

🚀 Spam SMS Detection Model

This model is fine - tuned from dbmdz/bert - base - turkish - 128k - uncased for SMS spam detection in Turkish. It aims to classify text messages as either Spam or Normal, providing an effective solution for identifying unwanted SMS in the Turkish language.

✨ Features

Accurate Classification: Capable of precisely classifying Turkish SMS messages as spam or normal.
Fine - Tuned Model: Based on a well - known pre - trained model, fine - tuned for the specific task of Turkish SMS spam detection.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "BaranKanat/BerTurk-SpamSMS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

test_sms = "2000 TL DENEME BONUSU KAZANDINIZ !!! YATIRIM SARTI YOK KAZANC ve CEKIM LIMITI YOK." #SPAM SMS

inputs = tokenizer(test_sms, return_tensors="pt", truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1).item()

labels = ["Normal", "Spam"]  # 0: Normal, 1: Spam
print(f"Mesaj: {test_sms}")
print(f"Sonuç: {labels[predicted_class]} ({predicted_class})")

📚 Documentation

Training

The model was trained using the BERTurk tokenizer and classifier with the following configuration:

Property	Details
Model	`dbmdz/bert-base-turkish-128k-uncased`
Optimizer	AdamW
Learning rate	5e - 5
Epochs	4

The dataset used includes both spam and normal SMS messages, ensuring balanced representation.

Performance

The model achieved the following metrics:

Accuracy: Will be updated after testing on more datasets.
F1 - Score: Will be updated after testing on more datasets.
Precision: Will be updated after testing on more datasets.
Recall: Will be updated after testing on more datasets.

Dataset

The dataset used for fine - tuning is the Turkish SMS Collection Dataset, which is publicly available on [Kaggle](https://www.kaggle.com/datasets/onurkarasoy/turkish - sms - collection). It contains 2,536 spam messages and 2,215 normal (ham) messages.

About the Dataset

The dataset is a collection of Turkish SMS messages tagged as spam or normal. It was collected from people of different age groups living in different regions of Turkey.

If you use this dataset, please cite: Karasoy, O., Ballı, S. Spam SMS Detection for Turkish Language with Deep Text Analysis and Deep Learning Methods. Arab J Sci Eng (2021). [https://doi.org/10.1007/s13369 - 021 - 06187 - 1](https://doi.org/10.1007/s13369 - 021 - 06187 - 1)

🔧 Technical Details

The model is fine - tuned from dbmdz/bert - base - turkish - 128k - uncased using the BERTurk tokenizer and classifier. The training process involves specific hyperparameters such as an AdamW optimizer with a learning rate of 5e - 5 and 4 epochs. The use of a balanced dataset with both spam and normal SMS messages helps in achieving better generalization.

📄 License

This model is licensed under the CreativeML OpenRAIL - M license.

Allowed: You can use, share, and modify the model for non - commercial purposes, as long as proper attribution is given.
Not Allowed: Commercial use or selling this model or its derivatives is strictly prohibited.

For more details, refer to the CreativeML OpenRAIL - M license terms.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご