fullstop-punctuation-multilingual-sonar-base open-source model - Predict multilingual punctuation, essential for restoring oral transcriptions

Fullstop Punctuation Multilingual Sonar Base

Developed by oliverguhr

This model is used to predict punctuation marks in English, Italian, French, German, and Dutch texts, and is particularly suitable for restoring punctuation in transcribed spoken language.

Sequence Labeling

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual punctuation restoration #European Parliament text adaptation #High-precision F1 score

Downloads 6,181

Release Time : 5/17/2022

Model Overview

A multilingual punctuation prediction model based on the Transformer architecture, capable of restoring punctuation marks such as periods, commas, question marks, hyphens, and colons.

Model Features

Multilingual support

Supports punctuation prediction for five languages: English, German, French, Italian, and Dutch.

High-precision prediction

Performs excellently in various punctuation prediction tasks, especially with high accuracy in predicting periods and question marks.

Optimized for political speeches

The model is trained on the European Parliament dataset and is particularly suitable for processing political speech texts.

Model Capabilities

Text punctuation restoration

Multilingual text processing

Punctuation prediction

Use Cases

Speech transcription

Punctuation restoration for meeting records

Add punctuation to meeting transcription texts without punctuation

F1 score reaches 0.784 (macro-average)

Education

Language learning assistance

Help language learners understand the correct use of punctuation

🚀 Multilingual Punctuation Prediction Model

This model is designed to predict punctuation in English, Italian, French, German, and Dutch texts, aiming to restore punctuation in transcribed spoken language.

🚀 Quick Start

This model predicts the punctuation of English, Italian, French, German, and Dutch texts. It was developed to restore the punctuation of transcribed spoken language.

This multilanguage model was trained on the Europarl Dataset provided by the SEPP - NLG Shared Task. For the Dutch language, the SoNaR Dataset was included.

⚠️ Important Note

Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.

The model restores the following punctuation markers: "." "," "?" "-" ":"

✨ Features

Supports multiple languages including English, Italian, French, German, and Dutch.
Trained on large - scale datasets for better performance.
Can restore various punctuation markers.

📦 Installation

To get started, install the package from pypi:

pip install deepmultilingualpunctuation

💻 Usage Examples

Basic Usage

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

output

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

Advanced Usage

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

output

[['My', '0', 0.99998856], ['name', '0', 0.9999708], ['is', '0', 0.99975926], ['Clara', '0', 0.6117834], ['and', '0', 0.9999014], ['I', '0', 0.9999808], ['live', '0', 0.9999666], ['in', '0', 0.99990165], ['Berkeley', ',', 0.9941764], ['California', '.', 0.9952892], ['Ist', '0', 0.9999577], ['das', '0', 0.9999678], ['eine', '0', 0.99998224], ['Frage', ',', 0.9952265], ['Frau', '0', 0.99995995], ['Müller', '?', 0.972517]]

📚 Documentation

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Property	English	German	French	Italian	Dutch
0	0.990	0.996	0.991	0.988	0.994
.	0.924	0.951	0.921	0.917	0.959
?	0.825	0.829	0.800	0.736	0.817
,	0.798	0.937	0.811	0.778	0.813
:	0.535	0.608	0.578	0.544	0.657
-	0.345	0.384	0.353	0.344	0.464
macro average	0.736	0.784	0.742	0.718	0.784
micro average	0.975	0.987	0.977	0.972	0.983

Languages

Models

Languages	Model
English, Italian, French and German	oliverguhr/fullstop-punctuation-multilang-large
English, Italian, French, German and Dutch	oliverguhr/fullstop-punctuation-multilingual-sonar-base
Dutch	oliverguhr/fullstop-dutch-sonar-punctuation-prediction

Community Models

Languages	Model
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian	kredor/punctuate-all
Catalan	softcatala/fullstop-catalan-punctuation-prediction

You can use different models by setting the model parameter:

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

📄 License

This project is licensed under the MIT license.

How to cite us

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

@misc{https://doi.org/10.48550/arxiv.2301.03319,
  doi = {10.48550/ARXIV.2301.03319},
  url = {https://arxiv.org/abs/2301.03319},
  author = {Vandeghinste, Vincent and Guhr, Oliver},
  keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7},
  title = {FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers},
  publisher = {arXiv},
  year = {2023},  
  copyright = {Creative Commons Attribution Share Alike 4.0 International}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご