hplt_bert_base_sk Open-source Slovak BERT Model - Free Deployment to Support Masked Language Modeling Tasks

Hplt Bert Base Sk

Developed by HPLT

A monolingual Slovak BERT model released by the HPLT project, trained on the LTG-BERT architecture, suitable for masked language modeling tasks

Large Language Model

Transformers

OtherOpen Source License:Apache-2.0 #Slovak-specific #Masked Language Model #Monolingual BERT

Downloads 23

Release Time : 4/22/2024

Model Overview

This is a monolingual Slovak BERT model based on the HPLT 1.2 data release, using the improved LTG-BERT architecture, primarily designed for masked language modeling tasks.

Model Features

Monolingual Optimization

Specifically trained for Slovak language using the HPLT dataset of this language

Improved Architecture

Adopts the enhanced LTG-BERT architecture with performance improvements over standard BERT

Intermediate Checkpoints

Provides 10 intermediate checkpoints during training for analyzing model evolution

Model Capabilities

Masked Language Modeling

Text Understanding

Sequence Classification

Token Classification

Question Answering

Multiple Choice Tasks

Use Cases

Natural Language Processing

Text Completion

Predicting masked words

Example successfully predicted 'place' to complete the sentence

Text Classification

Classifying Slovak texts

🚀 HPLT Bert for Slovak

This is a monolingual encoder-only language model initially released by the HPLT project. It's a masked language model, specifically a modified version of the classic BERT model named LTG-BERT. A monolingual LTG-BERT model is trained for each major language in the HPLT 1.2 data release, resulting in a total of 75 models.

✨ Features

Same Hyper-parameters: All HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup. Hidden size is 768, attention heads are 12, layers are 12, and vocabulary size is 32768.
Custom Tokenizers: Each model uses its own tokenizer trained on language-specific HPLT data.
Multiple Implementations: The following classes are implemented: AutoModel, AutoModelMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoModelForQuestionAnswering and AutoModeltForMultipleChoice.
Intermediate Checkpoints: 10 intermediate checkpoints are released for each model at intervals of every 3125 training steps in separate branches.

📦 Installation

This model currently needs a custom wrapper from modeling_ltgbert.py, you should therefore load the model with trust_remote_code=True.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_sk")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_sk", trust_remote_code=True)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] It's a beautiful place.[SEP]'
print(tokenizer.decode(output_text[0].tolist()))

Advanced Usage

You can load a specific model revision with transformers using the argument revision:

model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_sk", revision="step21875", trust_remote_code=True)

You can access all the revisions for the models with the following code:

from huggingface_hub import list_repo_refs
out = list_repo_refs("HPLT/hplt_bert_base_sk")
print([b.name for b in out.branches])

📚 Documentation

See sizes of the training corpora, evaluation results and more in our language model training report.

The training code.

The training statistics of all 75 runs

🔧 Technical Details

All HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:

hidden size: 768
attention heads: 12
layers: 12
vocabulary size: 32768

Every model uses its own tokenizer trained on language-specific HPLT data.

📄 License

This project is licensed under the apache-2.0 license.

Cite us

@inproceedings{samuel-etal-2023-trained,
    title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.146",
    doi = "10.18653/v1/2023.findings-eacl.146",
    pages = "1954--1974"
}

@inproceedings{de-gibert-etal-2024-new-massive,
    title = "A New Massive Multilingual Dataset for High-Performance Language Technologies",
    author = {de Gibert, Ona  and
      Nail, Graeme  and
      Arefyev, Nikolay  and
      Ba{\~n}{\'o}n, Marta  and
      van der Linde, Jelmer  and
      Ji, Shaoxiong  and
      Zaragoza-Bernabeu, Jaume  and
      Aulamo, Mikko  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Kutuzov, Andrey  and
      Pyysalo, Sampo  and
      Oepen, Stephan  and
      Tiedemann, J{\"o}rg},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.100",
    pages = "1116--1128",
    abstract = "We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of {\mbox{$\approx$}} 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.",
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご