hplt_bert_base_skオープンソースのスロバキア語BERTモデル - 無料でデプロイしてマスク言語モデリングタスクを支援

ホーム

Hplt Bert Base Sk

HPLTによって開発

HPLTプロジェクトがリリースしたスロバキア語単一言語BERTモデルで、LTG-BERTアーキテクチャに基づいて訓練され、マスク言語モデリングタスクに適しています

大規模言語モデル

Transformers

その他オープンソースライセンス:Apache-2.0 #スロバキア語専用 #マスク言語モデル #単一言語BERT

ダウンロード数 23

リリース時間 : 4/22/2024

モデル概要

これはHPLT 1.2データに基づいてリリースされたスロバキア語単一言語BERTモデルで、改良されたLTG-BERTアーキテクチャを採用し、主にマスク言語モデリングタスクに使用されます。

モデル特徴

単一言語最適化

スロバキア語に特化して訓練され、その言語のHPLTデータセットを使用しています

改良アーキテクチャ

LTG-BERT改良アーキテクチャを採用し、標準BERTと比較して性能向上があります

中間チェックポイント

10の訓練プロセスの中間チェックポイントを提供し、モデルの進化を分析しやすくしています

モデル能力

マスク言語モデリング

テキスト理解

シーケンス分類

トークン分類

質問応答タスク

多肢選択タスク

使用事例

自然言語処理

テキスト補完

マスクで隠された単語を予測します

例では'place'を正しく予測して文を完成させました

テキスト分類

スロバキア語テキストを分類します

🚀 HPLT Bert for Slovak

このモデルは、HPLTプロジェクトによって初めてリリースされた、エンコーダーのみの単言語モデルの1つです。これはマスク言語モデルと呼ばれるもので、特に、LTG - BERTという古典的なBERTモデルの改良版を使用しています。

HPLT 1.2データリリースの主要言語ごとに単言語のLTG - BERTモデルがトレーニングされています（合計75モデル）。

📦 インストール

このモデルに関するインストール手順については、原文書に記載がありません。

💻 使用例

基本的な使用法

このモデルは現在、modeling_ltgbert.pyのカスタムラッパーが必要です。したがって、trust_remote_code=Trueを指定してモデルをロードする必要があります。

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_sk")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_sk", trust_remote_code=True)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] It's a beautiful place.[SEP]'
print(tokenizer.decode(output_text[0].tolist()))

現在実装されているクラスは、AutoModel、AutoModelMaskedLM、AutoModelForSequenceClassification、AutoModelForTokenClassification、AutoModelForQuestionAnswering、AutoModeltForMultipleChoiceです。

中間チェックポイント

各モデルについて、3125トレーニングステップごとの間隔で10の中間チェックポイントを別のブランチでリリースしています。命名規則はstepXXXです。例えば、step18750です。

transformersを使用して、revision引数を指定することで特定のモデルリビジョンをロードできます。

model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_sk", revision="step21875", trust_remote_code=True)

以下のコードを使用して、モデルのすべてのリビジョンにアクセスできます。

from huggingface_hub import list_repo_refs
out = list_repo_refs("HPLT/hplt_bert_base_sk")
print([b.name for b in out.branches])

📚 ドキュメント

すべてのHPLTのエンコーダーのみのモデルは、同じハイパーパラメータを使用しており、大まかにBERT - baseの設定に従っています。

隠れ層のサイズ: 768
アテンションヘッド: 12
レイヤー: 12
語彙サイズ: 32768

各モデルは、言語固有のHPLTデータでトレーニングされた独自のトークナイザーを使用しています。トレーニングコーパスのサイズ、評価結果などの詳細については、言語モデルトレーニングレポートを参照してください。

トレーニングコード

75回のトレーニングの統計情報

📄 ライセンス

このプロジェクトは、Apache 2.0ライセンスの下で提供されています。

引用

@inproceedings{samuel-etal-2023-trained,
    title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    editor = "Vlachos, Andreas  and
      Augenstein, Isabelle",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.146",
    doi = "10.18653/v1/2023.findings-eacl.146",
    pages = "1954--1974"
})

@inproceedings{de-gibert-etal-2024-new-massive,
    title = "A New Massive Multilingual Dataset for High-Performance Language Technologies",
    author = {de Gibert, Ona  and
      Nail, Graeme  and
      Arefyev, Nikolay  and
      Ba{\~n}{\'o}n, Marta  and
      van der Linde, Jelmer  and
      Ji, Shaoxiong  and
      Zaragoza-Bernabeu, Jaume  and
      Aulamo, Mikko  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Kutuzov, Andrey  and
      Pyysalo, Sampo  and
      Oepen, Stephan  and
      Tiedemann, J{\"o}rg},
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.100",
    pages = "1116--1128",
    abstract = "We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of {\mbox{$\approx$}} 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.",
}