bert-restore-punctuation-turkish Open Source Model - Accurately Predict the Punctuation Positions in Turkish Texts

Bert Restore Punctuation Turkish

Developed by uygarkurt

This is a Transformer model for Turkish text punctuation restoration, capable of predicting the correct positions of periods (.), commas (,), and question marks (?).

Sequence Labeling

Transformers

OtherOpen Source License:MIT #Turkish punctuation restoration #BERT fine-tuning #Multi-punctuation prediction

Downloads 55

Release Time : 7/3/2023

Model Overview

This model aims to automatically restore punctuation marks in Turkish texts, supporting prediction of three main punctuation marks. Based on BERT architecture, it is suitable for Turkish text processing tasks.

Model Features

Multi-punctuation prediction

Capable of simultaneously predicting three punctuation marks: periods, commas, and question marks

Transformer-based

Utilizes advanced Transformer architecture to provide high-quality punctuation prediction

Turkish language optimization

Specifically trained and optimized for Turkish language characteristics

Model Capabilities

Turkish text processing

Punctuation mark prediction

Text normalization

Use Cases

Text processing

Automatic punctuation restoration

Automatically adds punctuation marks to Turkish texts lacking punctuation

Improves text readability and subsequent processing quality

Speech-to-text post-processing

Adds punctuation to text output from speech recognition systems

Enhances readability of speech-to-text results

🚀 Transformer Based Punctuation Restoration Models for Turkish

This project aims to correctly place pre - decided punctuation marks in a given Turkish text. It presents three pre - trained transformer models to predict period(.), comma(,) and question(?) marks.

Liked our work? give us a ⭐ on GitHub!

You can find the BERT model used in the paper Transformer Based Punctuation Restoration for Turkish. The aim of this work is to correctly place pre - decided punctuation marks in a given text. We present three pre - trained transformer models to predict period(.), comma(,) and question(?) marks for the Turkish language.

🚀 Quick Start

💻 Usage Examples

Basic Usage

Recommended usage is via HuggingFace. You can run an inference using the pre - trained BERT model with the following code:

from transformers import pipeline

pipe = pipeline(task="token - classification", model="uygarkurt/bert - restore - punctuation - turkish")

sample_text = "Türkiye toprakları üzerindeki ilk yerleşmeler Yontma Taş Devri'nde başlar Doğu Trakya'da Traklar olmak üzere Hititler Frigler Lidyalılar ve Dor istilası sonucu Yunanistan'dan kaçan Akalar tarafından kurulan İyon medeniyeti gibi çeşitli eski Anadolu medeniyetlerinin ardından Makedonya kralı Büyük İskender'in egemenliğiyle ve fetihleriyle birlikte Helenistik Dönem başladı"

out = pipe(sample_text)

To use a different pre - trained model you can just replace the model argument with one of the other available models we provided.

📚 Documentation

📦 Data

The dataset is provided in the data/ directory as train, validation and test splits.

The dataset can be summarized as below:

Property	Train	Validation	Test
Total	1471806	180326	182487
Period (.)	124817	15306	15524
Comma (,)	98194	11980	12242
Question (?)	9816	1199	1255

✨ Available Models

We experimented with BERT, ELECTRA and ConvBERT. Pre - trained models can be accessed via Huggingface.

BERT: https://huggingface.co/uygarkurt/bert - restore - punctuation - turkish
ELECTRA: https://huggingface.co/uygarkurt/electra - restore - punctuation - turkish
ConvBERT: https://huggingface.co/uygarkurt/convbert - restore - punctuation - turkish

🔧 Results

Precision, Recall and F1 scores for each model and punctuation mark are summarized below.

Model	PERIOD			COMMA			QUESTION			OVERALL
	P	R	F1	P	R	F1	P	R	F1	P	R
BERT	0.972602	0.947504	0.959952	0.576145	0.700010	0.632066	0.927642	0.911342	0.919420	0.825506	0.852952
ELECTRA	0.972602	0.948689	0.960497	0.576800	0.710208	0.636590	0.920325	0.921074	0.920699	0.823242	0.859990
ConvBERT	0.972731	0.946791	0.959585	0.576964	0.708124	0.635851	0.922764	0.913849	0.918285	0.824153	0.856254

📄 License

This project is under the MIT license.

🔗 Citation

@INPROCEEDINGS{10286690,
    author={Kurt, Uygar and Çayır, Aykut},
    booktitle={2023 8th International Conference on Computer Science and Engineering (UBMK)}, 
    title={Transformer Based Punctuation Restoration for Turkish}, 
    year={2023},
    volume={},
    number={},
    pages={169-174},
    doi={10.1109/UBMK59864.2023.10286690}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご