ElhBERTeu开源巴斯克语BERT模型 - 多领域语料训练，基准测试表现佳

首页

Elhberteu

由 orai-nlp 开发

ElhBERTeu是为巴斯克语推出的BERT模型，基于多领域语料训练，在BasqueGLUE基准测试中表现优异。

大型语言模型

Transformers

其他#巴斯克语理解 #多领域预训练 #单语BERT

下载量 529

发布时间 : 5/6/2022

模型简介

ElhBERTeu是一个基础版、区分大小写的巴斯克语单语BERT模型，专为自然语言理解任务设计，参数总量1.24亿。

模型特点

多领域语料训练

汇集新闻、维基百科、科学、文学等多领域巴斯克语文本，总规模达5.75亿词元。

优化训练方案

全程采用512序列长度在TPU上完成100万步预训练，batch_size设置为256。

基准测试表现优异

在BasqueGLUE基准测试中平均得分73.71，超越同类模型BERTeus。

模型能力

巴斯克语文本理解

命名实体识别

意图分类

槽位填充

文本分类

问答系统

词义消歧

指代消解

使用案例

自然语言处理

巴斯克语文本分类

对巴斯克语新闻、科学文献等进行自动分类

在BHTC任务上F1得分78.05

巴斯克语问答系统

构建巴斯克语智能问答应用

在QNLI任务上准确率73.84

语言学研究

巴斯克语语言分析

支持巴斯克语语法、语义等语言学研究

🚀 ElhBERTeu

ElhBERTeu 是一个用于巴斯克语的 BERT 模型，该模型在论文 BasqueGLUE: A Natural Language Understanding Benchmark for Basque 中被提出。它旨在解决巴斯克语自然语言理解的相关问题，为巴斯克语的语言处理任务提供了强大的支持。

✨ 主要特性

多领域语料训练：使用来自多个领域的不同语料库进行训练，包括更新后的（2021 年）国家和地方新闻来源、巴斯克语维基百科，以及来自科学（学术和科普）、文学或字幕等其他领域的新新闻来源和文本。
模型规格多样：有基础版和中等规模版本可供选择，中等规模版本为 ElhBERTeu-medium。
性能表现优异：在 BasqueGLUE 自然语言理解基准测试中取得了良好的成绩。

📚 详细文档

训练语料

为了训练 ElhBERTeu，我们收集了来自多个领域的不同语料库来源。新闻来源的文本进行了过采样（复制），这与 BERTeus 训练期间的做法相同。总共使用了 5.75 亿个标记用于 ElhBERTeu 的预训练。具体语料库及其规模如下表所示：

领域	规模
新闻	2 x 2.24 亿
维基百科	4000 万
科学	5800 万
文学	2400 万
其他	700 万
总计	5.75 亿

模型参数

ElhBERTeu 是一个用于巴斯克语的基础大小写敏感单语 BERT 模型，词汇量为 5 万，总共有 1.24 亿个参数。

训练设置

ElhBERTeu 是按照 BERTeus 的设计决策进行训练的。分词器和超参数设置保持不变（batch_size = 256），唯一的区别是模型的完整预训练（100 万步）是在 v3 - 8 TPU 上以 512 的序列长度进行的。

模型评估

该模型在最近创建的 BasqueGLUE 自然语言理解基准测试中进行了评估，结果如下：

模型	平均分	命名实体识别（NERC）	意图分类 F1	槽填充 F1	巴斯克语仇恨言论分类（BHTC）	巴斯克语情感分类（BEC）	疫苗相关文本分类（Vaxx）	问答自然语言推理（QNLI）	词义消歧（WiC）	指代消解
		F1	F1	F1	F1	F1	MF1	准确率	准确率	准确率
BERTeus	73.23	81.92	82.52	74.34	78.26	69.43	59.30	74.26	70.71	68.31
ElhBERTeu	73.71	82.30	82.24	75.64	78.05	69.89	63.81	73.84	71.71	65.93

引用说明

如果您使用此模型，请引用以下论文：

G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June 2022. Marseille, France

@InProceedings{urbizu2022basqueglue,
  author    = {Urbizu, Gorka  and  San Vicente, Iñaki  and  Saralegi, Xabier  and  Agerri, Rodrigo  and  Soroa, Aitor},
  title     = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1603--1612},
  abstract  = {Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.},
  url       = {https://aclanthology.org/2022.lrec-1.172}
}