ltg-bert-babylm开源语言模型 - 在中等规模语料库上有优化表现

首页

Ltg Bert Babylm

由 ltg 开发

基于100MW BabyLM挑战赛数据集训练的BERT变体，优化了在中等规模语料库上的表现

大型语言模型

Transformers

英语#中等规模语料优化 #英语语言建模 #可复现基准

下载量 594

发布时间 : 1/8/2024

模型简介

LTG-BERT是基于英国国家语料库(BNC)训练的BERT模型，专门针对中等规模但高质量语料库进行了优化，在多项任务中表现优于原始BERT

模型特点

中等规模语料优化

专门针对100MW中等规模但高质量的英国国家语料库进行优化训练

性能超越原始BERT

在多项任务评估中表现优于原始BERT模型

可复现研究设计

采用公平、可复现的实验设计验证模型效果

模型能力

文本表征学习

上下文理解

语言模型预训练

使用案例

自然语言处理研究

语言模型基准测试

作为中等规模语料库训练的基准模型

提供可比较的性能指标

教育应用

英语语言教学辅助

基于标准英语语料库的语言模型应用

🚀 LTG - BERT 用于 BabyLM 挑战赛

这是在 1亿词 BabyLM 挑战赛数据集上训练的 LTG - BERT 基线模型。该模型为自然语言处理领域提供了在特定规模数据集上的有效解决方案，具有一定的研究和应用价值。

🚀 快速开始

本项目是在 1亿词 BabyLM 挑战赛数据集上训练的 LTG - BERT 基线模型。

论文：《训练一亿词仍状态良好：BERT 与英国国家语料库相遇》
GitHub 仓库：[ltgoslo/ltg - bert](https://github.com/ltgoslo/ltg - bert)

📄 许可证

本项目采用 CC - BY - 4.0 许可证。

📚 详细文档

引用说明

如果您使用了本项目，请引用以下出版物：

@inproceedings{samuel-etal-2023-trained,
    title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.146",
    pages = "1954--1974",
    abstract = "While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source {--} the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.",
}

信息表格

属性	详情
模型类型	BERT 相关模型（LTG - BERT）
训练数据	1亿词 BabyLM 挑战赛数据集
论文	《训练一亿词仍状态良好：BERT 与英国国家语料库相遇》
GitHub 仓库	[ltgoslo/ltg - bert](https://github.com/ltgoslo/ltg - bert)
许可证	CC - BY - 4.0