IceBERTオープンソースアイスランド語モデル - 大量のテキストデータを活用してアイスランド語アプリケーションの処理を支援

ホーム

Icebert

mideindによって開発

RoBERTa-baseアーキテクチャに基づいて訓練されたアイスランド語のマスク言語モデル。16GBのアイスランド語テキストデータを使用して訓練されました。

大規模言語モデル

Transformers

その他#アイスランド語専用 #大規模コーパスによる訓練 #NLP下流タスクの最適化

ダウンロード数 1,203

リリース時間 : 3/2/2022

モデル概要

アイスランド語用に特別に設計された事前学習言語モデルで、様々な自然言語処理タスクに適しています。

モデル特徴

大規模なアイスランド語訓練データ

7つの異なるソースからのアイスランド語コーパスを統合し、合計15.8GBのテキストデータを収集しました。

多領域カバー

訓練データには、ニュース、医学文献、学術論文、古典文学などの様々なテキストタイプが含まれています。

下流タスクでの優れた性能

品詞タグ付け、固有表現抽出などのタスクで最先端レベルの性能を達成しています。

モデル能力

テキスト補完

言語理解

コンテキスト予測

使用事例

自然言語処理

品詞タグ付け

アイスランド語テキスト中の単語の品詞を自動的に識別します。

最先端レベル

固有表現抽出

アイスランド語テキスト中の人名、地名などの固有表現を識別します。

最先端レベル

テキスト分析

文法エラー検出

アイスランド語テキスト中の文法エラーを検出します。

優れた性能

🚀 IceBERT

IceBERTは、fairseqを使用してRoBERTa-baseアーキテクチャに基づいて訓練されたアイスランド語の言語モデルで、さまざまな自然言語処理の下流タスクに使用できます。

🚀 クイックスタート

このモデルは、fairseqを使用し、RoBERTa-baseアーキテクチャに基づいて訓練されています。これは、我々がアイスランド語用に訓練した多数のモデルの1つです。詳細については、以下に引用する論文を参照してください。訓練に使用されたデータは、次の表の通りです。

データセット	サイズ	トークン数
アイスランド語千兆語彙コーパスv20.05 (IGC)	8.2 GB	1,388M
アイスランド語一般クロールコーパス (IC3)	4.9 GB	824M
Greynirニュース記事	456 MB	76M
アイスランドのサーガ	9 MB	1.7M
オープンアイスランド電子書籍 (Rafbókavefurinn)	14 MB	2.6M
ランデスピタリ病院医学図書館のデータ	33 MB	5.2M
アイスランド大学の学生論文 (Skemman)	2.2 GB	367M
合計	15.8 GB	2,664M

📚 ドキュメント

このモデルは、論文 https://arxiv.org/abs/2201.05601 で詳細に説明されています。このモデルを使用した場合は、この論文を引用してください。

@inproceedings{snaebjarnarson-etal-2022-warm,
    title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
    author = "Sn{\ae}bjarnarson, V{\'e}steinn  and
      S{\'\i}monarson, Haukur Barri  and
      Ragnarsson, P{\'e}tur Orri  and
      Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja  and
      J{\'o}nsson, Haukur  and
      Thorsteinsson, Vilhjalmur  and
      Einarsson, Hafsteinn",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.464",
    pages = "4356--4366",
    abstract = "We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain .is. Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we manually translate and adapt the WinoGrande commonsense reasoning dataset. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.",
}