🚀 IceBERT
IceBERT是使用fairseq基於RoBERTa-base架構訓練的冰島語語言模型,可用於多種自然語言處理下游任務。
🚀 快速開始
該模型使用fairseq並基於RoBERTa-base架構進行訓練。它是我們為冰島語訓練的眾多模型之一,更多詳細信息請參閱下面引用的論文。訓練使用的數據如下表所示。
數據集 |
大小 |
詞元數量 |
冰島語千兆詞料庫v20.05 (IGC) |
8.2 GB |
1,388M |
冰島語通用爬蟲語料庫 (IC3) |
4.9 GB |
824M |
Greynir新聞文章 |
456 MB |
76M |
冰島薩迦 |
9 MB |
1.7M |
開放冰島電子書籍 (Rafbókavefurinn) |
14 MB |
2.6M |
蘭德斯皮塔裡醫院醫學圖書館的數據 |
33 MB |
5.2M |
冰島大學學生論文 (Skemman) |
2.2 GB |
367M |
總計 |
15.8 GB |
2,664M |
📚 詳細文檔
該模型在論文 https://arxiv.org/abs/2201.05601 中有詳細描述。如果您使用了該模型,請引用此論文。
@inproceedings{snaebjarnarson-etal-2022-warm,
title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
author = "Sn{\ae}bjarnarson, V{\'e}steinn and
S{\'\i}monarson, Haukur Barri and
Ragnarsson, P{\'e}tur Orri and
Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja and
J{\'o}nsson, Haukur and
Thorsteinsson, Vilhjalmur and
Einarsson, Hafsteinn",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.464",
pages = "4356--4366",
abstract = "We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain .is. Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we manually translate and adapt the WinoGrande commonsense reasoning dataset. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.",
}
📄 許可證
本模型採用CC BY 4.0許可證。