🚀 Spam SMS Detection Model
This model is fine - tuned from dbmdz/bert - base - turkish - 128k - uncased
for SMS spam detection in Turkish. It aims to classify text messages as either Spam or Normal, providing an effective solution for identifying unwanted SMS in the Turkish language.
✨ Features
- Accurate Classification: Capable of precisely classifying Turkish SMS messages as spam or normal.
- Fine - Tuned Model: Based on a well - known pre - trained model, fine - tuned for the specific task of Turkish SMS spam detection.
📦 Installation
The README does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "BaranKanat/BerTurk-SpamSMS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
test_sms = "2000 TL DENEME BONUSU KAZANDINIZ !!! YATIRIM SARTI YOK KAZANC ve CEKIM LIMITI YOK."
inputs = tokenizer(test_sms, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1).item()
labels = ["Normal", "Spam"]
print(f"Mesaj: {test_sms}")
print(f"Sonuç: {labels[predicted_class]} ({predicted_class})")
📚 Documentation
Training
The model was trained using the BERTurk
tokenizer and classifier with the following configuration:
Property |
Details |
Model |
dbmdz/bert-base-turkish-128k-uncased |
Optimizer |
AdamW |
Learning rate |
5e - 5 |
Epochs |
4 |
The dataset used includes both spam and normal SMS messages, ensuring balanced representation.
Performance
The model achieved the following metrics:
- Accuracy: Will be updated after testing on more datasets.
- F1 - Score: Will be updated after testing on more datasets.
- Precision: Will be updated after testing on more datasets.
- Recall: Will be updated after testing on more datasets.
Dataset
The dataset used for fine - tuning is the Turkish SMS Collection Dataset, which is publicly available on [Kaggle](https://www.kaggle.com/datasets/onurkarasoy/turkish - sms - collection). It contains 2,536 spam messages and 2,215 normal (ham) messages.
About the Dataset
The dataset is a collection of Turkish SMS messages tagged as spam or normal. It was collected from people of different age groups living in different regions of Turkey.
If you use this dataset, please cite:
Karasoy, O., Ballı, S. Spam SMS Detection for Turkish Language with Deep Text Analysis and Deep Learning Methods. Arab J Sci Eng (2021). [https://doi.org/10.1007/s13369 - 021 - 06187 - 1](https://doi.org/10.1007/s13369 - 021 - 06187 - 1)
🔧 Technical Details
The model is fine - tuned from dbmdz/bert - base - turkish - 128k - uncased
using the BERTurk
tokenizer and classifier. The training process involves specific hyperparameters such as an AdamW optimizer with a learning rate of 5e - 5 and 4 epochs. The use of a balanced dataset with both spam and normal SMS messages helps in achieving better generalization.
📄 License
This model is licensed under the CreativeML OpenRAIL - M license.
- Allowed: You can use, share, and modify the model for non - commercial purposes, as long as proper attribution is given.
- Not Allowed: Commercial use or selling this model or its derivatives is strictly prohibited.
For more details, refer to the CreativeML OpenRAIL - M license terms.