Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Polbert - Polish BERT
The Polish version of the BERT language model is available! It comes in cased and uncased variants, both accessible via the HuggingFace transformers library.
🚀 Quick Start
The Polish BERT, Polbert, is now available in two variants: cased and uncased. You can download and use them via the HuggingFace transformers library. It's recommended to use the cased model. More details about the differences and benchmark results are provided below.
✨ Features
Cased and uncased variants
- Uncased Model:
- Initially trained, but some issues were found after publication.
- Incorrect tokenization of some Polish characters and accents when applying lowercase through the BERT tokenizer. This has little impact on sequence classification but may significantly affect token classification tasks.
- Many duplicates in the Open Subtitles dataset, which dominates the training corpus.
- Whole Word Masking was not used.
- Cased Model:
- Improves on the uncased model by correctly tokenizing all Polish characters and accents.
- Duplicates removed from the Open Subtitles dataset, resulting in a smaller but more balanced corpus.
- Trained with Whole Word Masking.
📦 Installation
Polbert is released via the HuggingFace Transformers library. You can install it following the official guide of the HuggingFace library.
💻 Usage Examples
Basic Usage
Uncased
from transformers import *
model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
print(pred)
# Output:
# {'sequence': '[CLS] adam mickiewicz wielkim polskim poeta był. [SEP]', 'score': 0.47196975350379944, 'token': 26596}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.09127858281135559, 'token': 10953}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.0647173821926117, 'token': 5182}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.05232388526201248, 'token': 24293}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim politykiem był. [SEP]', 'score': 0.04554257541894913, 'token': 44095}
Cased
model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-cased-v1")
tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
print(pred)
# Output:
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.5391148328781128, 'token': 37120}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.11683262139558792, 'token': 6810}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.06021466106176376, 'token': 17709}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim mistrzem był. [SEP]', 'score': 0.051870670169591904, 'token': 14652}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim artystą był. [SEP]', 'score': 0.031787533313035965, 'token': 35680}
📚 Documentation
Pre-training corpora
Uncased
Property | Details |
---|---|
Polish subset of Open Subtitles | Lines: 236635408, Words: 1431199601, Characters: 7628097730 |
Polish subset of ParaCrawl | Lines: 8470950, Words: 176670885, Characters: 1163505275 |
Polish Parliamentary Corpus | Lines: 9799859, Words: 121154785, Characters: 938896963 |
Polish Wikipedia - Feb 2020 | Lines: 8014206, Words: 132067986, Characters: 1015849191 |
Total | Lines: 262920423, Words: 1861093257, Characters: 10746349159 |
Cased
Property | Details |
---|---|
Polish subset of Open Subtitles (Deduplicated) | Lines: 41998942, Words: 213590656, Characters: 1424873235 |
Polish subset of ParaCrawl | Lines: 8470950, Words: 176670885, Characters: 1163505275 |
Polish Parliamentary Corpus | Lines: 9799859, Words: 121154785, Characters: 938896963 |
Polish Wikipedia - Feb 2020 | Lines: 8014206, Words: 132067986, Characters: 1015849191 |
Total | Lines: 68283960, Words: 646479197, Characters: 4543124667 |
Pre-training details
Uncased
- Trained with code from Google BERT's github repository (https://github.com/google-research/bert).
- Follows the bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters).
- Training set-up: 1 million training steps in total.
- 100,000 steps - 128 sequence length, batch size 512, learning rate 1e-4 (10,000 steps warmup).
- 800,000 steps - 128 sequence length, batch size 512, learning rate 5e-5.
- 100,000 steps - 512 sequence length, batch size 256, learning rate 2e-5.
- Trained on a single Google Cloud TPU v3-8.
Cased
- Similar approach to the uncased model, with the addition of Whole Word Masking.
- Training set-up:
- 100,000 steps - 128 sequence length, batch size 2048, learning rate 1e-4 (10,000 steps warmup).
- 100,000 steps - 128 sequence length, batch size 2048, learning rate 5e-5.
- 100,000 steps - 512 sequence length, batch size 256, learning rate 2e-5.
Evaluation
The KLEJ benchmark is used for evaluation. The following results are achieved by running standard evaluation scripts using both cased and uncased variants of Polbert.
Model | Average | NKJP-NER | CDSC-E | CDSC-R | CBD | PolEmo2.0-IN | PolEmo2.0-OUT | DYK | PSC | AR |
---|---|---|---|---|---|---|---|---|---|---|
Polbert cased | 81.7 | 93.6 | 93.4 | 93.8 | 52.7 | 87.4 | 71.1 | 59.1 | 98.6 | 85.2 |
Polbert uncased | 81.4 | 90.1 | 93.9 | 93.5 | 55.0 | 88.1 | 68.8 | 59.4 | 98.8 | 85.4 |
⚠️ Important Note
Notice that the uncased model performs better than the cased model on some tasks. This may be due to the oversampling of the Open Subtitles dataset and its similarity to the data in some of these tasks. All these benchmark tasks are sequence classification, so the relative strength of the cased model is not very obvious here.
💡 Usage Tip
The data used to train the model is biased and may reflect stereotypes related to gender, ethnicity, etc. Please be cautious when using the model for downstream tasks to consider and mitigate these biases.
🔧 Technical Details
The model is based on the BERT architecture and is trained on various Polish corpora. Different training set-ups and techniques (such as Whole Word Masking) are used for the cased and uncased variants.
📄 License
No license information is provided in the original README.
Acknowledgements
- Thanks to Google TensorFlow Research Cloud (TFRC) for providing free TPU credits.
- Appreciate the help from Timo Möller from deepset for sharing tips and scripts based on their experience training the German BERT model.
- Big thanks to Allegro for releasing the KLEJ Benchmark and specifically to Piotr Rybak for help with the evaluation and pointing out some tokenization issues.
- Thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from fastai for their NLP and Deep Learning courses.
Author
Darek Kłeczek - contact me on Twitter @dk21
References
- https://github.com/google-research/bert
- https://github.com/narusemotoki/srx_segmenter
- SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
- KLEJ benchmark

