XLMR - MaCoCu is an open - source language model that relies on Icelandic pre - training to facilitate language understanding and application.

XLMR MaCoCu Is

Developed by MaCoCu

XLMR-MaCoCu-is is a large-scale pre-trained language model based on Icelandic text, built by further training the XLM-RoBERTa-large model and belongs to the MaCoCu project.

Large Language Model Other#Icelandic NLP #Multi-task fine-tuning #Large-scale pre-training

Downloads 27

Release Time : 8/11/2022

Model Overview

This model is primarily used for Icelandic natural language processing tasks such as part-of-speech tagging, named entity recognition, and commonsense reasoning.

Model Features

Large-scale Icelandic pre-training

Trained on 4.4GB of Icelandic text (688 million tokens), focusing on Icelandic natural language processing tasks.

Superior multi-task performance

Outperforms comparable models on multiple benchmarks including UPOS, XPOS, NER, and COPA.

Based on XLM-RoBERTa-large

Inherits the powerful architecture and vocabulary of XLM-RoBERTa-large, optimized for Icelandic.

Model Capabilities

Part-of-speech tagging

Named entity recognition

Commonsense reasoning

Text understanding

Use Cases

Linguistic analysis

Part-of-speech tagging

Performs part-of-speech tagging on Icelandic text

Achieves 97.0 accuracy on the UPOS test set

Named entity recognition

Identifies named entities in Icelandic text

Achieves 93.2 F1 score on the NER test set

Commonsense reasoning

COPA task

Completes commonsense reasoning tasks in Icelandic

Achieves 59.6 accuracy on the COPA test set

🚀 XLMR-MaCoCu-is: A Large Pre-trained Icelandic Language Model

XLMR-MaCoCu-is is a large pre-trained language model specifically trained on Icelandic texts. It continues the training from the XLM-RoBERTa-large model and is developed as part of the MaCoCu project, using only the data crawled during the project. The main developer is Rik van Noord from the University of Groningen.

🚀 Quick Start

How to use

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-is")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-is") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-is") # Tensorflow

✨ Features

Icelandic Focus: Trained on 4.4GB of Icelandic text, equivalent to 688M tokens, making it well - suited for Icelandic language tasks.
Continued Training: Built upon the XLM-RoBERTa-large model, leveraging its pre - trained knowledge.
Same Vocabulary: Shares the same vocabulary as the original XLMR - large model.

📦 Installation

The installation is mainly about loading the model and tokenizer using the transformers library. You can install the transformers library if you haven't:

pip install transformers

📚 Documentation

Model description

XLMR-MaCoCu-is was trained for 75,000 steps with a batch size of 1,024. The training and fine - tuning procedures are described in detail on our Github repo.

Data

For training, all Icelandic data from the monolingual Icelandic MaCoCu corpus was used. After de - duplicating, 4.4GB of text (688M tokens) remained.

Benchmark performance

We tested the performance of XLMR-MaCoCu-is on benchmarks of XPOS, UPOS, NER and COPA.

Data Sources:
- For UPOS and XPOS, data from the Universal Dependencies project was used.
- For NER, data from the MIM - GOLD - NER data set was used.
- For COPA, the English data set was automatically translated using Google Translate.
Comparison Models: We compared its performance with the strong multi - lingual models XLMR - base and XLMR - large, as well as the monolingual IceBERT model.
Scores: Scores are averages of three runs, except for COPA, which uses 10 runs. The same hyperparameter settings were used for all models.

Property	Details
Model Type	XLMR-MaCoCu-is, a large pre-trained language model for Icelandic
Training Data	4.4GB of Icelandic text (688M tokens) from the MaCoCu corpus

	UPOS	UPOS	XPOS	XPOS	NER	NER	COPA
	Dev	Test	Dev	Test	Dev	Test	Test
XLM-R-base	96.8	96.5	94.6	94.3	85.3	89.7	55.2
XLM-R-large	97.0	96.7	94.9	94.7	88.5	91.7	54.3
IceBERT	96.4	96.0	94.0	93.7	83.8	89.7	54.6
XLMR-MaCoCu-is	97.3	97.0	95.4	95.1	90.8	93.2	59.6

📄 License

This model is licensed under the CC0-1.0 license.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

Citation

If you use this model, please cite the following paper:

@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご