fullstop-punctuation-multilang-large Open-source Model - Free Recovery of Punctuation Structures in Transcribed Spoken English, Italian, French and German

Fullstop Punctuation Multilang Large

Developed by oliverguhr

A multilingual model for predicting punctuation in English, Italian, French, and German texts, designed to restore the punctuation structure of transcribed speech.

Sequence Labeling

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual punctuation restoration #Political speech transcription optimization #European Parliament data training

Downloads 375.32k

Release Time : 3/2/2022

Model Overview

This model is trained on the European Parliament dataset and can restore punctuation marks such as periods, commas, question marks, hyphens, and colons.

Model Features

Multilingual support

Supports punctuation prediction for four languages: English, Italian, French, and German.

Multiple punctuation prediction

Capable of predicting various punctuation marks including periods, commas, question marks, hyphens, and colons.

Easy to use

Provides a simple Python package that supports processing texts of any length.

Model Capabilities

Text punctuation prediction

Multilingual text processing

Punctuation restoration for transcribed speech

Use Cases

Text processing

Punctuation restoration for transcribed text

Restores punctuation structure for transcribed speech to improve readability.

The restored text contains correct punctuation marks such as periods and commas.

🚀 Multilingual Punctuation Prediction Model

This model is designed to predict the punctuation of English, Italian, French, and German texts. It was developed to restore punctuation in transcribed spoken language, offering a practical solution for enhancing text readability.

📋 Metadata

Supported Languages: English, German, French, Italian, Multilingual
Tags: Punctuation prediction, Punctuation
Datasets: Europarl Dataset
License: MIT

🚀 Quick Start

This model predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

This multilanguage model was trained on the Europarl Dataset provided by the SEPP-NLG Shared Task. Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.

The model restores the following punctuation markers: "." "," "?" "-" ":"

📦 Installation

To get started, install the package from pypi:

pip install deepmultilingualpunctuation

💻 Usage Examples

##### Basic Usage

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

Output

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

##### Advanced Usage

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

Output

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

📈 Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Label	EN	DE	FR	IT
0	0.991	0.997	0.992	0.989
.	0.948	0.961	0.945	0.942
?	0.890	0.893	0.871	0.832
,	0.819	0.945	0.831	0.798
:	0.575	0.652	0.620	0.588
-	0.425	0.435	0.431	0.421
macro average	0.775	0.814	0.782	0.762

📚 Documentation

##### Models

Languages	Model
English, Italian, French and German	oliverguhr/fullstop-punctuation-multilang-large
English, Italian, French, German and Dutch	oliverguhr/fullstop-punctuation-multilingual-sonar-base
Dutch	oliverguhr/fullstop-dutch-sonar-punctuation-prediction

##### Community Models

Languages	Model
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian	kredor/punctuate-all
Catalan	softcatala/fullstop-catalan-punctuation-prediction
Welsh	techiaith/fullstop-welsh-punctuation-prediction

You can use different models by setting the model parameter:

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

🔧 Technical Details

If you're interested in the complete code of the research project, you can take a look at this repository.

There is also a guide on how to fine tune this model for your data / language.

📄 License

This project is licensed under the MIT license.

📖 References

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご