๐ XLMR-MaltBERTa
XLMR-MaltBERTa is a large pre-trained language model trained on Maltese texts. It was developed as part of the MaCoCu project, aiming to provide better language processing capabilities for Maltese.
๐ Quick Start
You can use the following code to load the model:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa")
model = TFAutoModel.from_pretrained("RVN/XLMR-MaltBERTa")
โจ Features
- Based on XLM-RoBERTa-large: Continued training from the XLM-RoBERTa-large model.
- Large-scale Training: Trained on 3.2GB of text (439M tokens) for 50,000 steps with a batch size of 1,024.
- Same Vocabulary: Uses the same vocabulary as the original XLMR-large model.
- Comparable Data: Trained on the same data as MaltBERTa, but from scratch using the RoBERTa architecture.
๐ฆ Installation
The installation can be achieved by using the transformers
library. You can install it via pip
:
pip install transformers
๐ป Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa")
input_text = "Your input text here"
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model(**inputs)
๐ Documentation
Model Description
XLMR-MaltBERTa is a large pre-trained language model trained on Maltese texts. It was created by continuing training from the XLM-RoBERTa-large model. It was developed as part of the MaCoCu project. The main developer is Rik van Noord from the University of Groningen.
The model was trained on 3.2GB of text, equal to 439M tokens, for 50,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as MaltBERTa, but this model was trained from scratch using the RoBERTa architecture.
The training and fine-tuning procedures are described in detail on our Github repo.
Data
For training, all Maltese data from the MaCoCu, Oscar, and mc4 corpora were used. After de-duplicating the data, a total of 3.2GB of text remained.
Benchmark Performance
The performance of MaltBERTa was tested on the UPOS and XPOS benchmark of the Universal Dependencies project. Additionally, it was tested on a Google Translated version of the COPA data set (see our Github repo for details).
The performance was compared to strong multi-lingual models XLMR-base and XLMR-large (note that Maltese was not one of the training languages for those models), as well as recently introduced Maltese language models BERTu, mBERTu, and our own MaltBERTa.
Scores are averages of three runs for UPOS/XPOS and 10 runs for COPA. The same hyperparameter settings were used for all models for UPOS/XPOS, while for COPA, optimization was done on the dev set.
|
UPOS |
UPOS |
XPOS |
XPOS |
COPA |
|
Dev |
Test |
Dev |
Test |
Test |
XLM-R-base |
93.6 |
93.2 |
93.4 |
93.2 |
52.2 |
XLM-R-large |
94.9 |
94.4 |
95.1 |
94.7 |
54.0 |
BERTu |
97.5 |
97.6 |
95.7 |
95.8 |
55.6 |
mBERTu |
97.7 |
97.8 |
97.9 |
98.1 |
52.6 |
MaltBERTa |
95.7 |
95.8 |
96.1 |
96.0 |
53.7 |
XLMR-MaltBERTa |
97.7 |
98.1 |
98.1 |
98.2 |
54.4 |
๐ License
This model is licensed under the CC0-1.0 license.
Acknowledgements
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Unionโs Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
Citation
If you use this model, please cite the following paper:
@inproceedings{non-etal-2022-macocu,
title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
author = "Ba{\~n}{\'o}n, Marta and
Espl{\`a}-Gomis, Miquel and
Forcada, Mikel L. and
Garc{\'\i}a-Romero, Cristian and
Kuzman, Taja and
Ljube{\v{s}}i{\'c}, Nikola and
van Noord, Rik and
Sempere, Leopoldo Pla and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Rupnik, Peter and
Suchomel, V{\'\i}t and
Toral, Antonio and
van der Werff, Tobias and
Zaragoza, Jaume",
booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
month = jun,
year = "2022",
address = "Ghent, Belgium",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2022.eamt-1.41",
pages = "303--304"
}