🚀 HerBERT
HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. It offers an effective solution for Polish language processing tasks, leveraging advanced training techniques. For more details, please refer to: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish.
Model training and experiments were conducted with transformers in version 2.9.
🚀 Quick Start
HerBERT is a powerful language model designed for the Polish language. It's built on the BERT architecture and trained with specific techniques to handle Polish text effectively.
✨ Features
- Advanced Training Techniques: Trained using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words.
- Multiple Corpora Utilization: Trained on six different Polish language corpora, ensuring broad language coverage.
📦 Installation
The model can be used with the transformers
library. You can install it via the following command:
pip install transformers==2.9
📚 Documentation
Corpus
HerBERT was trained on six different corpora available for Polish language:
Tokenizer
The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer
) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.
We kindly encourage you to use the Fast
version of the tokenizer, namely HerbertTokenizerFast
.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")
output = model(
**tokenizer.batch_encode_plus(
[
(
"A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
"A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
)
],
padding='longest',
add_special_tokens=True,
return_tensors='pt'
)
)
📄 License
CC BY 4.0
🔗 Citation
If you use this model, please cite the following paper:
@inproceedings{mroczkowski-etal-2021-herbert,
title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
author = "Mroczkowski, Robert and
Rybak, Piotr and
Wr{\\'o}blewska, Alina and
Gawlik, Ireneusz",
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
month = apr,
year = "2021",
address = "Kiyv, Ukraine",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
pages = "1--10",
}
👥 Authors
The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.
You can contact us at: klejbenchmark@allegro.pl