RoBERTa-TR-medium-morph-44k Open Source Model - Empowering Turkish Natural Language Processing Tasks

Roberta TR Medium Morph 44k

Developed by ctoraman

A RoBERTa model for Turkish language, pre-trained with morphological-level tokenization and masked language modeling objectives, suitable for Turkish NLP tasks.

Large Language Model

Transformers

Other#Turkish morphological analysis #Case-insensitive processing #Medium parameter count

Downloads 453

Release Time : 3/9/2022

Model Overview

This model is a Turkish-optimized RoBERTa variant using morphological-level tokenization (with Zemberek morphological analyzer) and case-insensitive format, suitable for various Turkish text processing tasks.

Model Features

Morphological-Level Tokenization

Uses Zemberek Turkish morphological analyzer for text segmentation, achieving more Turkish-appropriate tokenization.

Case-Insensitive Format

Model input is case-insensitive, simplifying preprocessing and improving generalization.

Medium-Scale Architecture

Uses 8-layer Transformer structure for balanced computational efficiency and performance.

Model Capabilities

Turkish text understanding

Masked language modeling

Sequence classification (requires fine-tuning)

Use Cases

Natural Language Processing

Turkish Text Classification

Implement news classification, sentiment analysis etc. through model fine-tuning.

Language Model Pretraining

Serves as base model for transfer learning in Turkish NLP tasks.

🚀 RoBERTa Turkish medium Morph-level 44k (uncased)

A pre - trained model on the Turkish language using a masked language modeling (MLM) objective. It offers a solution for Turkish - related natural language processing tasks by leveraging the power of the RoBERTa architecture.

🚀 Quick Start

This is a pre - trained model on the Turkish language with a masked language modeling (MLM) objective. It is uncased. The pretrained corpus is the Turkish split of OSCAR, which has been further filtered and cleaned.

✨ Features

Model Architecture: Similar to bert - medium, it has 8 layers, 8 heads, and a 512 hidden size.
Tokenization: Uses a Morph - level algorithm. Text is split according to a Turkish morphological analyzer (Zemberek).
Vocabulary Size: Approximately 43.6k.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The following code can be used for model loading and tokenization. Note that the example max length (514) can be changed:

model = AutoModel.from_pretrained([model_path])
# for sequence classification:
# model = AutoModelForSequenceClassification.from_pretrained([model_path], num_labels=[num_classes])

tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
tokenizer.mask_token = "[MASK]"
tokenizer.cls_token = "[CLS]"
tokenizer.sep_token = "[SEP]"
tokenizer.pad_token = "[PAD]"
tokenizer.unk_token = "[UNK]"
tokenizer.bos_token = "[CLS]"
tokenizer.eos_token = "[SEP]"
tokenizer.model_max_length = 514

📚 Documentation

Important Note

⚠️ Important Note

This model needs a pre - processing step before running, because the tokenizer file is not a morphological analyzer. That is, the test dataset can not be split into morphemes with the tokenizer file. The user needs to process any test dataset by a Turkish morphological analyzer (Zemberek in this case) before running evaluation.

Details and Performance

The details and performance comparisons can be found at this paper: https://arxiv.org/abs/2204.08832

BibTeX Entry and Citation Info

@misc{https://doi.org/10.48550/arxiv.2204.08832,
  doi = {10.48550/ARXIV.2204.08832},
  url = {https://arxiv.org/abs/2204.08832},
  author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Impact of Tokenization on Language Models: An Analysis for Turkish},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}

📄 License

The model is licensed under cc - by - nc - sa - 4.0.

Property	Details
Model Type	RoBERTa Turkish medium Morph - level 44k (uncased)
Training Data	Filtered and cleaned Turkish split of OSCAR

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご