bert-base-polish-uncased-v1 Open-source Polish Language Model

Bert Base Polish Uncased V1

Developed by dkleczek

Polish version of the BERT language model, offering both case-sensitive and case-insensitive variants, suitable for Polish natural language processing tasks.

Large Language Model Other#Polish pre-training #Case sensitivity optimization #Whole word masking technique

Downloads 3,853

Release Time : 3/2/2022

Model Overview

Polbert is a Polish pre-trained language model based on the BERT architecture, supporting various downstream NLP tasks such as text classification and named entity recognition.

Model Features

Polish language optimization

Specially optimized for Polish language characteristics, correctly handling special characters and diacritics in Polish.

Whole word masking technique

Case-sensitive version employs whole word masking technique to enhance model comprehension.

Corpus optimization

Removed duplicate content and trained using a more balanced Polish corpus.

Model Capabilities

Text classification

Named entity recognition

Text filling

Semantic understanding

Use Cases

Text understanding

Poetry author identification

Identifying fragments of works by famous Polish poets

Correctly identified Adam Mickiewicz as 'pisarzem' (writer)

Academic research

Polish linguistic analysis

Used for studying Polish grammar and semantic features

🚀 Polbert - Polish BERT

The Polish version of the BERT language model is available! It comes in cased and uncased variants, both accessible via the HuggingFace transformers library.

PolBERT image

🚀 Quick Start

The Polish BERT, Polbert, is now available in two variants: cased and uncased. You can download and use them via the HuggingFace transformers library. It's recommended to use the cased model. More details about the differences and benchmark results are provided below.

✨ Features

Cased and uncased variants

Uncased Model:
- Initially trained, but some issues were found after publication.
- Incorrect tokenization of some Polish characters and accents when applying lowercase through the BERT tokenizer. This has little impact on sequence classification but may significantly affect token classification tasks.
- Many duplicates in the Open Subtitles dataset, which dominates the training corpus.
- Whole Word Masking was not used.
Cased Model:
- Improves on the uncased model by correctly tokenizing all Polish characters and accents.
- Duplicates removed from the Open Subtitles dataset, resulting in a smaller but more balanced corpus.
- Trained with Whole Word Masking.

📦 Installation

Polbert is released via the HuggingFace Transformers library. You can install it following the official guide of the HuggingFace library.

💻 Usage Examples

Basic Usage

Uncased

from transformers import *
model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-uncased-v1")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
  print(pred)
# Output:
# {'sequence': '[CLS] adam mickiewicz wielkim polskim poeta był. [SEP]', 'score': 0.47196975350379944, 'token': 26596}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.09127858281135559, 'token': 10953}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.0647173821926117, 'token': 5182}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.05232388526201248, 'token': 24293}
# {'sequence': '[CLS] adam mickiewicz wielkim polskim politykiem był. [SEP]', 'score': 0.04554257541894913, 'token': 44095}

Cased

model = BertForMaskedLM.from_pretrained("dkleczek/bert-base-polish-cased-v1")
tokenizer = BertTokenizer.from_pretrained("dkleczek/bert-base-polish-cased-v1")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"Adam Mickiewicz wielkim polskim {nlp.tokenizer.mask_token} był."):
  print(pred)
# Output:
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim pisarzem był. [SEP]', 'score': 0.5391148328781128, 'token': 37120}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim człowiekiem był. [SEP]', 'score': 0.11683262139558792, 'token': 6810}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim bohaterem był. [SEP]', 'score': 0.06021466106176376, 'token': 17709}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim mistrzem był. [SEP]', 'score': 0.051870670169591904, 'token': 14652}
# {'sequence': '[CLS] Adam Mickiewicz wielkim polskim artystą był. [SEP]', 'score': 0.031787533313035965, 'token': 35680}

📚 Documentation

Pre-training corpora

Uncased

Property	Details
Polish subset of Open Subtitles	Lines: 236635408, Words: 1431199601, Characters: 7628097730
Polish subset of ParaCrawl	Lines: 8470950, Words: 176670885, Characters: 1163505275
Polish Parliamentary Corpus	Lines: 9799859, Words: 121154785, Characters: 938896963
Polish Wikipedia - Feb 2020	Lines: 8014206, Words: 132067986, Characters: 1015849191
Total	Lines: 262920423, Words: 1861093257, Characters: 10746349159

Cased

Property	Details
Polish subset of Open Subtitles (Deduplicated)	Lines: 41998942, Words: 213590656, Characters: 1424873235
Polish subset of ParaCrawl	Lines: 8470950, Words: 176670885, Characters: 1163505275
Polish Parliamentary Corpus	Lines: 9799859, Words: 121154785, Characters: 938896963
Polish Wikipedia - Feb 2020	Lines: 8014206, Words: 132067986, Characters: 1015849191
Total	Lines: 68283960, Words: 646479197, Characters: 4543124667

Pre-training details

Uncased

Trained with code from Google BERT's github repository (https://github.com/google-research/bert).
Follows the bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters).
Training set-up: 1 million training steps in total.
- 100,000 steps - 128 sequence length, batch size 512, learning rate 1e-4 (10,000 steps warmup).
- 800,000 steps - 128 sequence length, batch size 512, learning rate 5e-5.
- 100,000 steps - 512 sequence length, batch size 256, learning rate 2e-5.
Trained on a single Google Cloud TPU v3-8.

Cased

Similar approach to the uncased model, with the addition of Whole Word Masking.
Training set-up:
- 100,000 steps - 128 sequence length, batch size 2048, learning rate 1e-4 (10,000 steps warmup).
- 100,000 steps - 128 sequence length, batch size 2048, learning rate 5e-5.
- 100,000 steps - 512 sequence length, batch size 256, learning rate 2e-5.

Evaluation

The KLEJ benchmark is used for evaluation. The following results are achieved by running standard evaluation scripts using both cased and uncased variants of Polbert.

Model	Average	NKJP-NER	CDSC-E	CDSC-R	CBD	PolEmo2.0-IN	PolEmo2.0-OUT	DYK	PSC	AR
Polbert cased	81.7	93.6	93.4	93.8	52.7	87.4	71.1	59.1	98.6	85.2
Polbert uncased	81.4	90.1	93.9	93.5	55.0	88.1	68.8	59.4	98.8	85.4

⚠️ Important Note

Notice that the uncased model performs better than the cased model on some tasks. This may be due to the oversampling of the Open Subtitles dataset and its similarity to the data in some of these tasks. All these benchmark tasks are sequence classification, so the relative strength of the cased model is not very obvious here.

💡 Usage Tip

The data used to train the model is biased and may reflect stereotypes related to gender, ethnicity, etc. Please be cautious when using the model for downstream tasks to consider and mitigate these biases.

🔧 Technical Details

The model is based on the BERT architecture and is trained on various Polish corpora. Different training set-ups and techniques (such as Whole Word Masking) are used for the cased and uncased variants.

📄 License

No license information is provided in the original README.

Acknowledgements

Thanks to Google TensorFlow Research Cloud (TFRC) for providing free TPU credits.
Appreciate the help from Timo Möller from deepset for sharing tips and scripts based on their experience training the German BERT model.
Big thanks to Allegro for releasing the KLEJ Benchmark and specifically to Piotr Rybak for help with the evaluation and pointing out some tokenization issues.
Thanks to Rachel Thomas, Jeremy Howard and Sylvain Gugger from fastai for their NLP and Deep Learning courses.

Author

Darek Kłeczek - contact me on Twitter @dk21

References

https://github.com/google-research/bert
https://github.com/narusemotoki/srx_segmenter
SRX rules file for sentence splitting in Polish, written by Marcin Miłkowski: https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx
KLEJ benchmark

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご