ElhBERTeu開源巴斯克語BERT模型 - 多領域語料訓練，基準測試表現佳

首頁

Elhberteu

由orai-nlp開發

ElhBERTeu是為巴斯克語推出的BERT模型，基於多領域語料訓練，在BasqueGLUE基準測試中表現優異。

大型語言模型

Transformers

其他#巴斯克語理解 #多領域預訓練 #單語BERT

下載量 529

發布時間 : 5/6/2022

模型概述

ElhBERTeu是一個基礎版、區分大小寫的巴斯克語單語BERT模型，專為自然語言理解任務設計，參數總量1.24億。

模型特點

多領域語料訓練

彙集新聞、維基百科、科學、文學等多領域巴斯克語文本，總規模達5.75億詞元。

優化訓練方案

全程採用512序列長度在TPU上完成100萬步預訓練，batch_size設置為256。

基準測試表現優異

在BasqueGLUE基準測試中平均得分73.71，超越同類模型BERTeus。

模型能力

巴斯克語文本理解

命名實體識別

意圖分類

槽位填充

文本分類

問答系統

詞義消歧

指代消解

使用案例

自然語言處理

巴斯克語文本分類

對巴斯克語新聞、科學文獻等進行自動分類

在BHTC任務上F1得分78.05

巴斯克語問答系統

構建巴斯克語智能問答應用

在QNLI任務上準確率73.84

語言學研究

巴斯克語語言分析

支持巴斯克語語法、語義等語言學研究

🚀 ElhBERTeu

ElhBERTeu 是一個用於巴斯克語的 BERT 模型，該模型在論文 BasqueGLUE: A Natural Language Understanding Benchmark for Basque 中被提出。它旨在解決巴斯克語自然語言理解的相關問題，為巴斯克語的語言處理任務提供了強大的支持。

✨ 主要特性

多領域語料訓練：使用來自多個領域的不同語料庫進行訓練，包括更新後的（2021 年）國家和地方新聞來源、巴斯克語維基百科，以及來自科學（學術和科普）、文學或字幕等其他領域的新新聞來源和文本。
模型規格多樣：有基礎版和中等規模版本可供選擇，中等規模版本為 ElhBERTeu-medium。
性能表現優異：在 BasqueGLUE 自然語言理解基準測試中取得了良好的成績。

📚 詳細文檔

訓練語料

為了訓練 ElhBERTeu，我們收集了來自多個領域的不同語料庫來源。新聞來源的文本進行了過採樣（複製），這與 BERTeus 訓練期間的做法相同。總共使用了 5.75 億個標記用於 ElhBERTeu 的預訓練。具體語料庫及其規模如下表所示：

領域	規模
新聞	2 x 2.24 億
維基百科	4000 萬
科學	5800 萬
文學	2400 萬
其他	700 萬
總計	5.75 億

模型參數

ElhBERTeu 是一個用於巴斯克語的基礎大小寫敏感單語 BERT 模型，詞彙量為 5 萬，總共有 1.24 億個參數。

訓練設置

ElhBERTeu 是按照 BERTeus 的設計決策進行訓練的。分詞器和超參數設置保持不變（batch_size = 256），唯一的區別是模型的完整預訓練（100 萬步）是在 v3 - 8 TPU 上以 512 的序列長度進行的。

模型評估

該模型在最近創建的 BasqueGLUE 自然語言理解基準測試中進行了評估，結果如下：

模型	平均分	命名實體識別（NERC）	意圖分類 F1	槽填充 F1	巴斯克語仇恨言論分類（BHTC）	巴斯克語情感分類（BEC）	疫苗相關文本分類（Vaxx）	問答自然語言推理（QNLI）	詞義消歧（WiC）	指代消解
		F1	F1	F1	F1	F1	MF1	準確率	準確率	準確率
BERTeus	73.23	81.92	82.52	74.34	78.26	69.43	59.30	74.26	70.71	68.31
ElhBERTeu	73.71	82.30	82.24	75.64	78.05	69.89	63.81	73.84	71.71	65.93

引用說明

如果您使用此模型，請引用以下論文：

G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June 2022. Marseille, France

@InProceedings{urbizu2022basqueglue,
  author    = {Urbizu, Gorka  and  San Vicente, Iñaki  and  Saralegi, Xabier  and  Agerri, Rodrigo  and  Soroa, Aitor},
  title     = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1603--1612},
  abstract  = {Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.},
  url       = {https://aclanthology.org/2022.lrec-1.172}
}