ElhBERTeuオープンソースバスク語BERTモデル - 多領域コーパスでの訓練、ベンチマークテストで良好な結果

ホーム

Elhberteu

orai-nlpによって開発

ElhBERTeuはバスク語向けに開発されたBERTモデルで、マルチドメインコーパスで訓練され、BasqueGLUEベンチマークで優れた性能を発揮します。

大規模言語モデル

Transformers

その他#バスク語理解 #マルチドメイン事前学習 #単一言語BERT

ダウンロード数 529

リリース時間 : 5/6/2022

モデル概要

ElhBERTeuは基本版で大文字小文字を区別するバスク語単一言語BERTモデルであり、自然言語理解タスク向けに設計されており、総パラメータ数は1億2400万です。

モデル特徴

マルチドメインコーパス訓練

ニュース、ウィキペディア、科学、文学など多様な分野のバスク語テキストを集約し、総規模は5億7500万トークンに達します。

最適化された訓練手法

512シーケンス長でTPU上で100万ステップの事前学習を完了し、batch_sizeは256に設定されました。

ベンチマークでの優れた性能

BasqueGLUEベンチマークで平均スコア73.71を達成し、類似モデルBERTeusを上回りました。

モデル能力

バスク語テキスト理解

固有表現認識

意図分類

スロット充填

テキスト分類

質問応答システム

語義曖昧性解消

照応解決

使用事例

自然言語処理

バスク語テキスト分類

バスク語のニュースや科学文献などを自動分類

BHTCタスクでF1スコア78.05を達成

バスク語質問応答システム

バスク語インテリジェントQAアプリケーションの構築

QNLIタスクで精度73.84を達成

言語学研究

バスク語言語分析

バスク語の文法、意味論などの言語学研究をサポート

🚀 ElhBERTeu

これは、BasqueGLUE: A Natural Language Understanding Benchmark for Basque で紹介されたバスク語用のBERTモデルです。

ElhBERTeuを学習するために、いくつかのドメインからさまざまなコーパスを収集しました。更新された（2021年）国内および地域のニュースソース、バスク語版Wikipedia、新しいニュースソース、そして科学（学術的および啓蒙的な両方）、文学、字幕などの他のドメインのテキストです。使用したコーパスとそのサイズの詳細は、次の表に示されています。ニュースソースのテキストは、BERTeusの学習時と同様にオーバーサンプリング（複製）されました。合計で575MトークンがElhBERTeuの事前学習に使用されました。

ドメイン	サイズ
ニュース	2 x 224M
Wikipedia	40M
科学	58M
文学	24M
その他	7M
合計	575M

ElhBERTeuは、ボキャブラリサイズが50Kのバスク語用の基本的な大文字小文字を区別する単言語BERTモデルで、合計で1億2400万のパラメータを持っています。

ここに中規模のモデルがあります：ElhBERTeu-medium

ElhBERTeuは、BERTeus の設計決定に従って学習されました。トークナイザーとハイパーパラメータの設定は同じまま（batch_size=256）で、唯一の違いは、モデルの完全な事前学習（100万ステップ）が、v3 - 8 TPU上でシーケンス長512で実行されたことです。

このモデルは、最近作成された BasqueGLUE NLUベンチマークで評価されています：

モデル	AVG	NERC	F_intent	F_slot	BHTC	BEC	Vaxx	QNLI	WiC	coref
		F1	F1	F1	F1	F1	MF1	acc	acc	acc
BERTeus	73.23	81.92	82.52	74.34	78.26	69.43	59.30	74.26	70.71	68.31
ElhBERTeu	73.71	82.30	82.24	75.64	78.05	69.89	63.81	73.84	71.71	65.93

このモデルを使用する場合は、次の論文を引用してください：

G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June 2022. Marseille, France

@InProceedings{urbizu2022basqueglue,
  author    = {Urbizu, Gorka  and  San Vicente, Iñaki  and  Saralegi, Xabier  and  Agerri, Rodrigo  and  Soroa, Aitor},
  title     = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1603--1612},
  abstract  = {Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.},
  url       = {https://aclanthology.org/2022.lrec-1.172}
}