๐ XLMR-MaCoCu-is: A Large Pre-trained Icelandic Language Model
XLMR-MaCoCu-is is a large pre-trained language model specifically trained on Icelandic texts. It continues the training from the XLM-RoBERTa-large model and is developed as part of the MaCoCu project, using only the data crawled during the project. The main developer is Rik van Noord from the University of Groningen.
๐ Quick Start
How to use
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-is")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-is")
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-is")
โจ Features
- Icelandic Focus: Trained on 4.4GB of Icelandic text, equivalent to 688M tokens, making it well - suited for Icelandic language tasks.
- Continued Training: Built upon the XLM-RoBERTa-large model, leveraging its pre - trained knowledge.
- Same Vocabulary: Shares the same vocabulary as the original XLMR - large model.
๐ฆ Installation
The installation is mainly about loading the model and tokenizer using the transformers
library. You can install the transformers
library if you haven't:
pip install transformers
๐ Documentation
Model description
XLMR-MaCoCu-is was trained for 75,000 steps with a batch size of 1,024. The training and fine - tuning procedures are described in detail on our Github repo.
Data
For training, all Icelandic data from the monolingual Icelandic MaCoCu corpus was used. After de - duplicating, 4.4GB of text (688M tokens) remained.
Benchmark performance
We tested the performance of XLMR-MaCoCu-is on benchmarks of XPOS, UPOS, NER and COPA.
- Data Sources:
- For UPOS and XPOS, data from the Universal Dependencies project was used.
- For NER, data from the MIM - GOLD - NER data set was used.
- For COPA, the English data set was automatically translated using Google Translate.
- Comparison Models: We compared its performance with the strong multi - lingual models XLMR - base and XLMR - large, as well as the monolingual IceBERT model.
- Scores: Scores are averages of three runs, except for COPA, which uses 10 runs. The same hyperparameter settings were used for all models.
Property |
Details |
Model Type |
XLMR-MaCoCu-is, a large pre-trained language model for Icelandic |
Training Data |
4.4GB of Icelandic text (688M tokens) from the MaCoCu corpus |
|
UPOS |
UPOS |
XPOS |
XPOS |
NER |
NER |
COPA |
|
Dev |
Test |
Dev |
Test |
Dev |
Test |
Test |
XLM-R-base |
96.8 |
96.5 |
94.6 |
94.3 |
85.3 |
89.7 |
55.2 |
XLM-R-large |
97.0 |
96.7 |
94.9 |
94.7 |
88.5 |
91.7 |
54.3 |
IceBERT |
96.4 |
96.0 |
94.0 |
93.7 |
83.8 |
89.7 |
54.6 |
XLMR-MaCoCu-is |
97.3 |
97.0 |
95.4 |
95.1 |
90.8 |
93.2 |
59.6 |
๐ License
This model is licensed under the CC0-1.0 license.
Acknowledgements
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Unionโs Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
Citation
If you use this model, please cite the following paper:
@inproceedings{non-etal-2022-macocu,
title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
author = "Ba{\~n}{\'o}n, Marta and
Espl{\`a}-Gomis, Miquel and
Forcada, Mikel L. and
Garc{\'\i}a-Romero, Cristian and
Kuzman, Taja and
Ljube{\v{s}}i{\'c}, Nikola and
van Noord, Rik and
Sempere, Leopoldo Pla and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Rupnik, Peter and
Suchomel, V{\'\i}t and
Toral, Antonio and
van der Werff, Tobias and
Zaragoza, Jaume",
booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
month = jun,
year = "2022",
address = "Ghent, Belgium",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2022.eamt-1.41",
pages = "303--304"
}