🚀 Multilingual Punctuation Prediction Model
This model is designed to predict punctuation in English, Italian, French, German, and Dutch texts, aiming to restore punctuation in transcribed spoken language.
🚀 Quick Start
This model predicts the punctuation of English, Italian, French, German, and Dutch texts. It was developed to restore the punctuation of transcribed spoken language.
This multilanguage model was trained on the Europarl Dataset provided by the SEPP - NLG Shared Task. For the Dutch language, the SoNaR Dataset was included.
⚠️ Important Note
Please note that this dataset consists of political speeches. Therefore the model might perform differently on texts from other domains.
The model restores the following punctuation markers: "." "," "?" "-" ":"
✨ Features
- Supports multiple languages including English, Italian, French, German, and Dutch.
- Trained on large - scale datasets for better performance.
- Can restore various punctuation markers.
📦 Installation
To get started, install the package from pypi:
pip install deepmultilingualpunctuation
💻 Usage Examples
Basic Usage
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)
output
My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?
Advanced Usage
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel(model="oliverguhr/fullstop-punctuation-multilingual-sonar-base")
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)
output
[['My', '0', 0.99998856], ['name', '0', 0.9999708], ['is', '0', 0.99975926], ['Clara', '0', 0.6117834], ['and', '0', 0.9999014], ['I', '0', 0.9999808], ['live', '0', 0.9999666], ['in', '0', 0.99990165], ['Berkeley', ',', 0.9941764], ['California', '.', 0.9952892], ['Ist', '0', 0.9999577], ['das', '0', 0.9999678], ['eine', '0', 0.99998224], ['Frage', ',', 0.9952265], ['Frau', '0', 0.99995995], ['Müller', '?', 0.972517]]
📚 Documentation
Results
The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:
Property |
English |
German |
French |
Italian |
Dutch |
0 |
0.990 |
0.996 |
0.991 |
0.988 |
0.994 |
. |
0.924 |
0.951 |
0.921 |
0.917 |
0.959 |
? |
0.825 |
0.829 |
0.800 |
0.736 |
0.817 |
, |
0.798 |
0.937 |
0.811 |
0.778 |
0.813 |
: |
0.535 |
0.608 |
0.578 |
0.544 |
0.657 |
- |
0.345 |
0.384 |
0.353 |
0.344 |
0.464 |
macro average |
0.736 |
0.784 |
0.742 |
0.718 |
0.784 |
micro average |
0.975 |
0.987 |
0.977 |
0.972 |
0.983 |
Languages
Models
Community Models
Languages |
Model |
English, German, French, Spanish, Bulgarian, Italian, Polish, Dutch, Czech, Portugese, Slovak, Slovenian |
kredor/punctuate-all |
Catalan |
softcatala/fullstop-catalan-punctuation-prediction |
You can use different models by setting the model parameter:
model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")
📄 License
This project is licensed under the MIT license.
How to cite us
@article{guhr-EtAl:2021:fullstop,
title={FullStop: Multilingual Deep Models for Punctuation Prediction},
author = {Guhr, Oliver and Schumann, Anne-Kathrin and Bahrmann, Frank and Böhme, Hans Joachim},
booktitle = {Proceedings of the Swiss Text Analytics Conference 2021},
month = {June},
year = {2021},
address = {Winterthur, Switzerland},
publisher = {CEUR Workshop Proceedings},
url = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}
@misc{https://doi.org/10.48550/arxiv.2301.03319,
doi = {10.48550/ARXIV.2301.03319},
url = {https://arxiv.org/abs/2301.03319},
author = {Vandeghinste, Vincent and Guhr, Oliver},
keywords = {Computation and Language (cs.CL), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences, I.2.7},
title = {FullStop:Punctuation and Segmentation Prediction for Dutch with Transformers},
publisher = {arXiv},
year = {2023},
copyright = {Creative Commons Attribution Share Alike 4.0 International}
}