BERT Turkish Text Classification Open-Source Model - Accurately Classify Turkish Texts into 7 Predefined Categories

Bert Turkish Text Classification

Developed by savasy

This is a Turkish text classification model fine-tuned based on the BERT architecture, capable of classifying Turkish texts into 7 predefined categories.

Text Classification Other#Turkish BERT #Multi-class Classification #News Topic Recognition

Downloads 523

Release Time : 3/2/2022

Model Overview

This model is specifically designed for Turkish text classification tasks, supporting the classification of texts into 7 categories: World, Economy, Culture, Health, Politics, Sports, and Technology.

Model Features

Turkish Language Optimization

Fine-tuned based on the Turkish BERT model, specifically optimized for Turkish text classification tasks.

Multi-class Classification

Supports text classification into 7 different categories, covering major news domains.

Easy to Use

Provides simple API interfaces for easy integration into various applications.

Model Capabilities

Turkish Text Classification

Multi-class Prediction

Text Content Analysis

Use Cases

News Classification

Automatic News Classification

Automatically classify Turkish news into 7 predefined categories.

Achieves accuracy levels reported in the paper.

Content Analysis

Social Media Content Analysis

Analyze the topic distribution of Turkish social media content.

🚀 Turkish Text Classification

This model is a fine - tuned version of https://github.com/stefan-it/turkish-bert. It uses text classification data with 7 categories.

Model Category Mapping

code_to_label={
 'LABEL_0': 'dunya ',
 'LABEL_1': 'ekonomi ',
 'LABEL_2': 'kultur ',
 'LABEL_3': 'saglik ',
 'LABEL_4': 'siyaset ',
 'LABEL_5': 'spor ',
 'LABEL_6': 'teknoloji '
}

📚 Documentation

Citation

Please cite the following papers if needed:

@misc{yildirim2024finetuning,
      title={Fine-tuning Transformer-based Encoder for Turkish Language Understanding Tasks}, 
      author={Savas Yildirim},
      year={2024},
      eprint={2401.17396},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@book{yildirim2021mastering,
  title={Mastering Transformers: Build state-of-the-art models from scratch with advanced natural language processing techniques},
  author={Yildirim, Savas and Asgari-Chenaghlu, Meysam},
  year={2021},
  publisher={Packt Publishing Ltd}
}

Data

The following Turkish benchmark dataset is used for fine - tuning: https://www.kaggle.com/savasy/ttc4900

🚀 Quick Start

Begin by installing transformers as follows:

pip install transformers

Basic Usage

# import libraries
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("savasy/bert-turkish-text-classification")

# build and load model, it takes time depending on your internet connection
model = AutoModelForSequenceClassification.from_pretrained("savasy/bert-turkish-text-classification")

# make pipeline
nlp = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

# apply model
nlp("bla bla")
# [{'label': 'LABEL_2', 'score': 0.4753005802631378}]

code_to_label = {
 'LABEL_0': 'dunya ',
 'LABEL_1': 'ekonomi ',
 'LABEL_2': 'kultur ',
 'LABEL_3': 'saglik ',
 'LABEL_4': 'siyaset ',
 'LABEL_5': 'spor ',
 'LABEL_6': 'teknoloji '
}

code_to_label[nlp("bla bla")[0]['label']]
# > 'kultur '

🔧 Technical Details

Model Training

# loading data for Turkish text classification
import pandas as pd
# https://www.kaggle.com/savasy/ttc4900
df = pd.read_csv("7allV03.csv")
df.columns = ["labels", "text"]
df.labels = pd.Categorical(df.labels)

traind_df = ...
eval_df = ...

# model
from simpletransformers.classification import ClassificationModel
import torch, sklearn

model_args = {
    "use_early_stopping": True,
    "early_stopping_delta": 0.01,
    "early_stopping_metric": "mcc",
    "early_stopping_metric_minimize": False,
    "early_stopping_patience": 5,
    "evaluate_during_training_steps": 1000,
    "fp16": False,
    "num_train_epochs": 3
}

model = ClassificationModel(
    "bert", 
    "dbmdz/bert-base-turkish-cased",
    use_cuda=cuda_available, 
    args=model_args, 
    num_labels=7
)
model.train_model(train_df, acc=sklearn.metrics.accuracy_score)

For other training models, please check https://simpletransformers.ai/.

Detailed Usage

For the detailed usage of Turkish Text Classification, please check python notebook.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご