herbert-base-cased: An Open-Source Pre-trained Polish Language Model - Supports Polish Language Processing Use Cases

Herbert Base Cased

Developed by allegro

HerBERT is a Polish pre-trained language model based on the BERT architecture, trained using dynamic whole word masking and sentence structure objectives.

Large Language Model Other#Polish pretraining #Dynamic whole word masking #Character-level BPE

Downloads 84.18k

Release Time : 3/2/2022

Model Overview

HerBERT is an efficient Transformer model optimized for Polish, primarily used for natural language processing tasks such as text classification, question answering, and text generation.

Model Features

Polish optimization

Specially designed and trained for Polish, excelling in Polish language tasks.

Dynamic whole word masking

Uses dynamic whole word masking strategy for pre-training, improving the model's understanding of Polish.

Sentence structure objective

In addition to traditional MLM tasks, it is also trained using sentence structure objectives (SSO), enhancing the model's understanding of sentence structures.

Large-scale training data

Trained on over 8.5 billion Polish tokens, covering various text types.

Model Capabilities

Polish text understanding

Polish text generation

Polish text classification

Polish question answering systems

Use Cases

Natural language processing

Polish text classification

Can be used for tasks such as sentiment analysis and topic classification in Polish.

Polish question answering system

Build question answering systems for Polish content.

Polish text generation

Generate fluent Polish text.

🚀 HerBERT

HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. It offers an effective solution for Polish language processing tasks, leveraging advanced training techniques. For more details, please refer to: HerBERT: Efficiently Pretrained Transformer-based Language Model for Polish.

Model training and experiments were conducted with transformers in version 2.9.

🚀 Quick Start

HerBERT is a powerful language model designed for the Polish language. It's built on the BERT architecture and trained with specific techniques to handle Polish text effectively.

✨ Features

Advanced Training Techniques: Trained using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words.
Multiple Corpora Utilization: Trained on six different Polish language corpora, ensuring broad language coverage.

📦 Installation

The model can be used with the transformers library. You can install it via the following command:

pip install transformers==2.9

📚 Documentation

Corpus

HerBERT was trained on six different corpora available for Polish language:

Property	Details
CCNet Middle	3243M tokens, 7.9M documents
CCNet Head	2641M tokens, 7.0M documents
National Corpus of Polish	1357M tokens, 3.9M documents
Open Subtitles	1056M tokens, 1.1M documents
Wikipedia	260M tokens, 1.4M documents
Wolne Lektury	41M tokens, 5.5k documents

Tokenizer

The training dataset was tokenized into subwords using a character level byte-pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.

We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-base-cased")
model = AutoModel.from_pretrained("allegro/herbert-base-cased")

output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
            )
        ],
    padding='longest',
    add_special_tokens=True,
    return_tensors='pt'
    )
)

📄 License

CC BY 4.0

🔗 Citation

If you use this model, please cite the following paper:

@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}

👥 Authors

The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

You can contact us at: klejbenchmark@allegro.pl

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご