Herbert-large-cased Open-source Polish Language Model - Supports Applications such as Text Comprehension and Processing

Herbert Large Cased

Developed by allegro

HerBERT is a Polish pre-trained language model based on the BERT architecture, trained using dynamic whole word masking and sentence structure objectives.

Large Language Model Other#Polish pre-training #Dynamic whole word masking #Sentence structure optimization

Downloads 1,272

Release Time : 3/2/2022

Model Overview

HerBERT is an efficient Polish pre-trained language model based on the BERT architecture, suitable for various natural language processing tasks.

Model Features

Dynamic whole word masking

Trained using masked language modeling with dynamic whole word masking, enhancing the model's language understanding capabilities.

Sentence structure objective

Incorporates sentence structure objectives (SSO) during training to improve the model's understanding of sentence structures.

Large-scale training corpus

Trained on six Polish corpora, covering a wide range of text types and domains.

Efficient tokenizer

Uses character-level byte pair encoding (CharBPETokenizer) to convert text into 50K subword units, improving processing efficiency.

Model Capabilities

Polish text understanding

Polish text generation

Masked language modeling

Use Cases

Natural language processing

Text classification

Used for Polish text classification tasks such as sentiment analysis and topic classification.

Named entity recognition

Identifies named entities in Polish text, such as person names, locations, and organization names.

Machine translation

Serves as a component in Polish machine translation systems to improve translation quality.

🚀 HerBERT

HerBERT is a BERT-based Language Model trained on Polish corpora using Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic masking of whole words. It offers an effective solution for natural language processing tasks in Polish.

🚀 Quick Start

HerBERT is a powerful language model for the Polish language. It's trained on various Polish corpora, and you can easily use it with the transformers library.

✨ Features

Trained on multiple Polish corpora, ensuring wide language coverage.
Utilizes Masked Language Modelling (MLM) and Sentence Structural Objective (SSO) with dynamic whole - word masking.
Compatible with the transformers library for easy integration.

📦 Installation

Model training and experiments were conducted with transformers in version 2.9. You can install it using the following command:

pip install transformers==2.9

📚 Documentation

Corpus

HerBERT was trained on six different corpora available for the Polish language:

Property	Details
CCNet Middle	3243M tokens, 7.9M documents
CCNet Head	2641M tokens, 7.0M documents
National Corpus of Polish	1357M tokens, 3.9M documents
Open Subtitles	1056M tokens, 1.1M documents
Wikipedia	260M tokens, 1.4M documents
Wolne Lektury	41M tokens, 5.5k documents

Tokenizer

The training dataset was tokenized into subwords using a character level byte - pair encoding (CharBPETokenizer) with a vocabulary size of 50k tokens. The tokenizer itself was trained with a tokenizers library.

We kindly encourage you to use the Fast version of the tokenizer, namely HerbertTokenizerFast.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("allegro/herbert-large-cased")
model = AutoModel.from_pretrained("allegro/herbert-large-cased")

output = model(
    **tokenizer.batch_encode_plus(
        [
            (
                "A potem szedł środkiem drogi w kurzawie, bo zamiatał nogami, ślepy dziad prowadzony przez tłustego kundla na sznurku.",
                "A potem leciał od lasu chłopak z butelką, ale ten ujrzawszy księdza przy drodze okrążył go z dala i biegł na przełaj pól do karczmy."
            )
        ],
    padding='longest',
    add_special_tokens=True,
    return_tensors='pt'
    )
)

📄 License

This model is released under the CC BY 4.0 license.

🔗 Citation

If you use this model, please cite the following paper:

@inproceedings{mroczkowski-etal-2021-herbert,
    title = "{H}er{BERT}: Efficiently Pretrained Transformer-based Language Model for {P}olish",
    author = "Mroczkowski, Robert  and
      Rybak, Piotr  and
      Wr{\'o}blewska, Alina  and
      Gawlik, Ireneusz",
    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
    month = apr,
    year = "2021",
    address = "Kiyv, Ukraine",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2021.bsnlp-1.1",
    pages = "1--10",
}

👥 Authors

The model was trained by Machine Learning Research Team at Allegro and Linguistic Engineering Group at Institute of Computer Science, Polish Academy of Sciences.

You can contact us at: klejbenchmark@allegro.pl

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご