ltg-bert-babylm開源語言模型 - 在中等規模語料庫上有優化表現

首頁

Ltg Bert Babylm

由ltg開發

基於100MW BabyLM挑戰賽數據集訓練的BERT變體，優化了在中等規模語料庫上的表現

大型語言模型

Transformers

英語#中等規模語料優化 #英語語言建模 #可復現基準

下載量 594

發布時間 : 1/8/2024

模型概述

LTG-BERT是基於英國國家語料庫(BNC)訓練的BERT模型，專門針對中等規模但高質量語料庫進行了優化，在多項任務中表現優於原始BERT

模型特點

中等規模語料優化

專門針對100MW中等規模但高質量的英國國家語料庫進行優化訓練

性能超越原始BERT

在多項任務評估中表現優於原始BERT模型

可復現研究設計

採用公平、可復現的實驗設計驗證模型效果

模型能力

文本表徵學習

上下文理解

語言模型預訓練

使用案例

自然語言處理研究

語言模型基準測試

作為中等規模語料庫訓練的基準模型

提供可比較的性能指標

教育應用

英語語言教學輔助

基於標準英語語料庫的語言模型應用

🚀 LTG - BERT 用於 BabyLM 挑戰賽

這是在 1億詞 BabyLM 挑戰賽數據集上訓練的 LTG - BERT 基線模型。該模型為自然語言處理領域提供了在特定規模數據集上的有效解決方案，具有一定的研究和應用價值。

🚀 快速開始

本項目是在 1億詞 BabyLM 挑戰賽數據集上訓練的 LTG - BERT 基線模型。

論文：《訓練一億詞仍狀態良好：BERT 與英國國家語料庫相遇》
GitHub 倉庫：[ltgoslo/ltg - bert](https://github.com/ltgoslo/ltg - bert)

📄 許可證

本項目採用 CC - BY - 4.0 許可證。

📚 詳細文檔

引用說明

如果您使用了本項目，請引用以下出版物：

@inproceedings{samuel-etal-2023-trained,
    title = "Trained on 100 million words and still in shape: {BERT} meets {B}ritish {N}ational {C}orpus",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      {\O}vrelid, Lilja  and
      Velldal, Erik",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-eacl.146",
    pages = "1954--1974",
    abstract = "While modern masked language models (LMs) are trained on ever larger corpora, we here explore the effects of down-scaling training to a modestly-sized but representative, well-balanced, and publicly available English text source {--} the British National Corpus. We show that pre-training on this carefully curated corpus can reach better performance than the original BERT model. We argue that this type of corpora has great potential as a language modeling benchmark. To showcase this potential, we present fair, reproducible and data-efficient comparative studies of LMs, in which we evaluate several training objectives and model architectures and replicate previous empirical results in a systematic way. We propose an optimized LM architecture called LTG-BERT.",
}

信息表格

屬性	詳情
模型類型	BERT 相關模型（LTG - BERT）
訓練數據	1億詞 BabyLM 挑戰賽數據集
論文	《訓練一億詞仍狀態良好：BERT 與英國國家語料庫相遇》
GitHub 倉庫	[ltgoslo/ltg - bert](https://github.com/ltgoslo/ltg - bert)
許可證	CC - BY - 4.0