🚀 LTG - BERT 用於 BabyLM 挑戰賽
這是在 1億詞 BabyLM 挑戰賽數據集 上訓練的 LTG - BERT 基線模型。該模型為自然語言處理領域提供了在特定規模數據集上的有效解決方案,具有一定的研究和應用價值。
🚀 快速開始
本項目是在 1億詞 BabyLM 挑戰賽數據集 上訓練的 LTG - BERT 基線模型。
📄 許可證
本項目採用 CC - BY - 4.0 許可證。
📚 詳細文檔
引用說明
如果您使用了本項目,請引用以下出版物:
@inproceedings{samuel-etal-2023-trained,
title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
author = "Samuel, David and
Kutuzov, Andrey and
{\O}vrelid, Lilja and
Velldal, Erik",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
month = may,
year = "2023",
address = "Dubrovnik, Croatia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-eacl.146",
pages = "1954--1974",
abstract = "While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source {--} the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.",
}
信息表格