IceBERT Open-Source Icelandic Model - Leveraging Massive Text Data to Aid Icelandic Application Processing

Home

Icebert

Developed by mideind

Icelandic masked language model trained on RoBERTa-base architecture using 16GB of Icelandic text data

Large Language Model

Transformers

Other#Icelandic-specific #Large-scale corpus training #NLP downstream task optimization

Downloads 1,203

Release Time : 3/2/2022

Model Overview

Pre-trained language model specifically designed for Icelandic, suitable for various natural language processing tasks

Model Features

Large-scale Icelandic training data

Integrated 7 different sources of Icelandic corpora, totaling 15.8GB of text data

Multi-domain coverage

Training data includes various text types such as news, medical literature, academic papers, and classical literature

Outstanding downstream task performance

Achieved state-of-the-art levels in tasks like part-of-speech tagging and named entity recognition

Model Capabilities

Text completion

Language understanding

Context prediction

Use Cases

Natural Language Processing

Part-of-speech tagging

Automatically identify the part-of-speech of words in Icelandic text

Achieved state-of-the-art performance

Named entity recognition

Identify entities such as person names and locations in Icelandic text

Achieved state-of-the-art performance

Text analysis

Grammar error detection

Detect grammatical errors in Icelandic text

Excellent performance

🚀 IceBERT

IceBERT is a language model trained for Icelandic using the RoBERTa - base architecture. It offers high - quality language processing capabilities for Icelandic text.

🚀 Quick Start

This model was trained with fairseq using the RoBERTa - base architecture. It is one of many models we have trained for Icelandic. For further details, refer to the paper mentioned below. The training data used is presented in the table below.

Property	Details
Model Type	RoBERTa - base
Training Data	See the table below

Dataset	Size	Tokens
Icelandic Gigaword Corpus v20.05 (IGC)	8.2 GB	1,388M
Icelandic Common Crawl Corpus (IC3)	4.9 GB	824M
Greynir News articles	456 MB	76M
Icelandic Sagas	9 MB	1.7M
Open Icelandic e - books (Rafbókavefurinn)	14 MB	2.6M
Data from the medical library of Landspitali	33 MB	5.2M
Student theses from Icelandic universities (Skemman)	2.2 GB	367M
Total	15.8 GB	2,664M

📚 Documentation

The model is described in this paper https://arxiv.org/abs/2201.05601. Please cite the paper if you make use of the model.

@inproceedings{snaebjarnarson-etal-2022-warm,
    title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
    author = "Sn{\ae}bjarnarson, V{\'e}steinn  and
      S{\'\i}monarson, Haukur Barri  and
      Ragnarsson, P{\'e}tur Orri  and
      Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja  and
      J{\'o}nsson, Haukur  and
      Thorsteinsson, Vilhjalmur  and
      Einarsson, Hafsteinn",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.464",
    pages = "4356--4366",
    abstract = "We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks, including part-of-speech tagging, named entity recognition, grammatical error detection and constituency parsing. To train the models we introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain .is. Several other public data sources are also collected for a total of 16GB of Icelandic text. To enhance the evaluation of model performance and to raise the bar in baselines for Icelandic, we manually translate and adapt the WinoGrande commonsense reasoning dataset. Through these efforts we demonstrate that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages, by comparison with models trained on a curated corpus. We further show that initializing models using existing multilingual models can lead to state-of-the-art results for some downstream tasks.",
}

📄 License

This model is released under the CC - BY - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご