🚀 Transformer Based Punctuation Restoration Models for Turkish
This project aims to correctly place pre - decided punctuation marks in a given Turkish text. It presents three pre - trained transformer models to predict period(.), comma(,) and question(?) marks.
Liked our work? give us a ⭐ on GitHub!
You can find the BERT model used in the paper Transformer Based Punctuation Restoration for Turkish. The aim of this work is to correctly place pre - decided punctuation marks in a given text. We present three pre - trained transformer models to predict period(.), comma(,) and question(?) marks for the Turkish language.
🚀 Quick Start
💻 Usage Examples
Basic Usage
Recommended usage is via HuggingFace. You can run an inference using the pre - trained BERT model with the following code:
from transformers import pipeline
pipe = pipeline(task="token - classification", model="uygarkurt/bert - restore - punctuation - turkish")
sample_text = "Türkiye toprakları üzerindeki ilk yerleşmeler Yontma Taş Devri'nde başlar Doğu Trakya'da Traklar olmak üzere Hititler Frigler Lidyalılar ve Dor istilası sonucu Yunanistan'dan kaçan Akalar tarafından kurulan İyon medeniyeti gibi çeşitli eski Anadolu medeniyetlerinin ardından Makedonya kralı Büyük İskender'in egemenliğiyle ve fetihleriyle birlikte Helenistik Dönem başladı"
out = pipe(sample_text)
To use a different pre - trained model you can just replace the model
argument with one of the other available models we provided.
📚 Documentation
📦 Data
The dataset is provided in the data/
directory as train, validation and test splits.
The dataset can be summarized as below:
Property |
Train |
Validation |
Test |
Total |
1471806 |
180326 |
182487 |
Period (.) |
124817 |
15306 |
15524 |
Comma (,) |
98194 |
11980 |
12242 |
Question (?) |
9816 |
1199 |
1255 |
✨ Available Models
We experimented with BERT, ELECTRA and ConvBERT. Pre - trained models can be accessed via Huggingface.
- BERT: https://huggingface.co/uygarkurt/bert - restore - punctuation - turkish
- ELECTRA: https://huggingface.co/uygarkurt/electra - restore - punctuation - turkish
- ConvBERT: https://huggingface.co/uygarkurt/convbert - restore - punctuation - turkish
🔧 Results
Precision
, Recall
and F1
scores for each model and punctuation mark are summarized below.
Model |
Score Type |
PERIOD |
|
|
COMMA |
|
|
QUESTION |
|
|
OVERALL |
|
|
|
P |
R |
F1 |
P |
R |
F1 |
P |
R |
F1 |
P |
R |
BERT |
|
0.972602 |
0.947504 |
0.959952 |
0.576145 |
0.700010 |
0.632066 |
0.927642 |
0.911342 |
0.919420 |
0.825506 |
0.852952 |
ELECTRA |
|
0.972602 |
0.948689 |
0.960497 |
0.576800 |
0.710208 |
0.636590 |
0.920325 |
0.921074 |
0.920699 |
0.823242 |
0.859990 |
ConvBERT |
|
0.972731 |
0.946791 |
0.959585 |
0.576964 |
0.708124 |
0.635851 |
0.922764 |
0.913849 |
0.918285 |
0.824153 |
0.856254 |
📄 License
This project is under the MIT license.
🔗 Citation
@INPROCEEDINGS{10286690,
author={Kurt, Uygar and Çayır, Aykut},
booktitle={2023 8th International Conference on Computer Science and Engineering (UBMK)},
title={Transformer Based Punctuation Restoration for Turkish},
year={2023},
volume={},
number={},
pages={169-174},
doi={10.1109/UBMK59864.2023.10286690}
}