roberta-base-turkish-uncased Open Source Model - Empowering Turkish Natural Language Processing, Free to Use

Home

Roberta Base Turkish Uncased

Developed by burakaytan

RoBERTa base model pre-trained on Turkish, trained with 38GB of Turkish corpus

Large Language Model

Transformers

OtherOpen Source License:MIT #Turkish pre-training #Cloze prediction #Large-scale corpus training

Downloads 57

Release Time : 4/20/2022

Model Overview

This is a RoBERTa base model for Turkish, primarily used for masked language modeling tasks, supporting Turkish text understanding and generation.

Model Features

Large-scale Turkish pre-training

Trained with 38GB of Turkish corpus (including Wikipedia, OSCAR corpus, and news website data)

High-performance hardware training

Training completed on high-performance hardware equipped with Intel Xeon Gold processors and Tesla V100 GPUs

Optimized Turkish language processing

Specifically optimized for Turkish language characteristics, enabling better handling of Turkish text

Model Capabilities

Turkish text understanding

Masked language modeling

Text completion

Semantic analysis

Use Cases

Text completion

Cloze application

Predict masked words in sentences

Accurately predicts key masked words in Turkish sentences

Semantic analysis

Text similarity calculation

Calculate semantic similarity between Turkish texts

🚀 RoBERTaTurk

RoBERTaTurk is a pre - trained Turkish language model, leveraging the power of the RoBERTa architecture. It has been trained on a large - scale Turkish corpus, enabling high - quality natural language processing tasks in Turkish.

🚀 Quick Start

📦 Installation

Load the transformers library with the following code:

from transformers import AutoTokenizer, AutoModelForMaskedLM
  
tokenizer = AutoTokenizer.from_pretrained("burakaytan/roberta-base-turkish-uncased")
model = AutoModelForMaskedLM.from_pretrained("burakaytan/roberta-base-turkish-uncased")

💻 Usage Examples

Basic Usage

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="burakaytan/roberta-base-turkish-uncased",
    tokenizer="burakaytan/roberta-base-turkish-uncased"
)

fill_mask("iki ülke arasında <mask> başladı")

[{'sequence': 'iki ülke arasında savaş başladı',
  'score': 0.3013845384120941,
  'token': 1359,
  'token_str': ' savaş'},
 {'sequence': 'iki ülke arasında müzakereler başladı',
  'score': 0.1058429479598999,
  'token': 30439,
  'token_str': ' müzakereler'},
 {'sequence': 'iki ülke arasında görüşmeler başladı',
  'score': 0.07718811184167862,
  'token': 4916,
  'token_str': ' görüşmeler'},
 {'sequence': 'iki ülke arasında kriz başladı',
  'score': 0.07174749672412872,
  'token': 3908,
  'token_str': ' kriz'},
 {'sequence': 'iki ülke arasında çatışmalar başladı',
  'score': 0.05678590387105942,
  'token': 19346,
  'token_str': ' çatışmalar'}]

📚 Documentation

✨ Features

This is a Turkish RoBERTa base model pretrained on Turkish Wikipedia, Turkish OSCAR, and some news websites. The final training corpus has a size of 38 GB and 329,720,508 sentences. Thanks to Turkcell, the model was trained on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps.

🔧 Technical Details

The model is trained on a corpus that combines Turkish Wikipedia, Turkish OSCAR, and some news websites. The final training corpus has a size of 38 GB and 329,720,508 sentences. The training was carried out on Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz 256GB RAM 2 x GV100GL [Tesla V100 PCIe 32GB] GPU for 2.5M steps.

📄 License

This project is licensed under the MIT License.

📚 Citation

To cite this model, use the following BibTeX entry:

@inproceedings{aytan2022comparison,
  title={Comparison of Transformer-Based Models Trained in Turkish and Different Languages on Turkish Natural Language Processing Problems},
  author={Aytan, Burak and Sakar, C Okan},
  booktitle={2022 30th Signal Processing and Communications Applications Conference (SIU)},
  pages={1--4},
  year={2022},
  organization={IEEE}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご