IceBERT開源冰島語模型 - 利用海量文本數據助力冰島語應用處理

首頁

Icebert

由mideind開發

基於RoBERTa-base架構訓練的冰島語掩碼語言模型，使用16GB冰島語文本數據訓練

大型語言模型

Transformers

其他#冰島語專用 #大規模語料訓練 #NLP下游任務優化

下載量 1,203

發布時間 : 3/2/2022

模型概述

專為冰島語設計的預訓練語言模型，適用於各種自然語言處理任務

模型特點

大規模冰島語訓練數據

整合了7個不同來源的冰島語語料庫，總計15.8GB文本數據

多領域覆蓋

訓練數據包含新聞、醫學文獻、學術論文、古典文學等多種文本類型

下游任務表現優異

在詞性標註、命名實體識別等任務中達到最先進水平

模型能力

文本補全

語言理解

上下文預測

使用案例

自然語言處理

詞性標註

自動識別冰島語文本中單詞的詞性

達到最先進水平

命名實體識別

識別冰島語文本中的人名、地名等實體

達到最先進水平

文本分析

語法錯誤檢測

檢測冰島語文本中的語法錯誤

表現優異

🚀 IceBERT

IceBERT是使用fairseq基於RoBERTa-base架構訓練的冰島語語言模型，可用於多種自然語言處理下游任務。

🚀 快速開始

該模型使用fairseq並基於RoBERTa-base架構進行訓練。它是我們為冰島語訓練的眾多模型之一，更多詳細信息請參閱下面引用的論文。訓練使用的數據如下表所示。

數據集	大小	詞元數量
冰島語千兆詞料庫v20.05 (IGC)	8.2 GB	1,388M
冰島語通用爬蟲語料庫 (IC3)	4.9 GB	824M
Greynir新聞文章	456 MB	76M
冰島薩迦	9 MB	1.7M
開放冰島電子書籍 (Rafbókavefurinn)	14 MB	2.6M
蘭德斯皮塔裡醫院醫學圖書館的數據	33 MB	5.2M
冰島大學學生論文 (Skemman)	2.2 GB	367M
總計	15.8 GB	2,664M

📚 詳細文檔

該模型在論文 https://arxiv.org/abs/2201.05601 中有詳細描述。如果您使用了該模型，請引用此論文。

@inproceedings{snaebjarnarson-etal-2022-warm,
    title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
    author = "Sn{\ae}bjarnarson, V{\'e}steinn  and
      S{\'\i}monarson, Haukur Barri  and
      Ragnarsson, P{\'e}tur Orri  and
      Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja  and
      J{\'o}nsson, Haukur  and
      Thorsteinsson, Vilhjalmur  and
      Einarsson, Hafsteinn",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.464",
    pages = "4356--4366",
    abstract = "We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain .is. Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we manually translate and adapt the WinoGrande commonsense reasoning dataset. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.",
}