🚀 RoBERTa Turkish medium Morph-level 44k (uncased)
A pre - trained model on the Turkish language using a masked language modeling (MLM) objective. It offers a solution for Turkish - related natural language processing tasks by leveraging the power of the RoBERTa architecture.
🚀 Quick Start
This is a pre - trained model on the Turkish language with a masked language modeling (MLM) objective. It is uncased. The pretrained corpus is the Turkish split of OSCAR, which has been further filtered and cleaned.
✨ Features
- Model Architecture: Similar to bert - medium, it has 8 layers, 8 heads, and a 512 hidden size.
- Tokenization: Uses a Morph - level algorithm. Text is split according to a Turkish morphological analyzer (Zemberek).
- Vocabulary Size: Approximately 43.6k.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
The following code can be used for model loading and tokenization. Note that the example max length (514) can be changed:
model = AutoModel.from_pretrained([model_path])
tokenizer = PreTrainedTokenizerFast(tokenizer_file=[file_path])
tokenizer.mask_token = "[MASK]"
tokenizer.cls_token = "[CLS]"
tokenizer.sep_token = "[SEP]"
tokenizer.pad_token = "[PAD]"
tokenizer.unk_token = "[UNK]"
tokenizer.bos_token = "[CLS]"
tokenizer.eos_token = "[SEP]"
tokenizer.model_max_length = 514
📚 Documentation
Important Note
⚠️ Important Note
This model needs a pre - processing step before running, because the tokenizer file is not a morphological analyzer. That is, the test dataset can not be split into morphemes with the tokenizer file. The user needs to process any test dataset by a Turkish morphological analyzer (Zemberek in this case) before running evaluation.
Details and Performance
The details and performance comparisons can be found at this paper:
https://arxiv.org/abs/2204.08832
BibTeX Entry and Citation Info
@misc{https://doi.org/10.48550/arxiv.2204.08832,
doi = {10.48550/ARXIV.2204.08832},
url = {https://arxiv.org/abs/2204.08832},
author = {Toraman, Cagri and Yilmaz, Eyup Halit and Şahinuç, Furkan and Ozcelik, Oguzhan},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Impact of Tokenization on Language Models: An Analysis for Turkish},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}
}
📄 License
The model is licensed under cc - by - nc - sa - 4.0.
Property |
Details |
Model Type |
RoBERTa Turkish medium Morph - level 44k (uncased) |
Training Data |
Filtered and cleaned Turkish split of OSCAR |