๐ MaltBERTa
MaltBERTa is a large pre - trained language model trained on Maltese texts. It offers high - quality language processing capabilities for Maltese language tasks, developed as part of the MaCoCu project.
๐ Quick Start
The following code demonstrates how to use MaltBERTa in Python. You can use it with both PyTorch and TensorFlow.
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("RVN/MaltBERTa")
model = AutoModel.from_pretrained("RVN/MaltBERTa")
model = TFAutoModel.from_pretrained("RVN/MaltBERTa")
โจ Features
- Maltese - specific Training: MaltBERTa is trained from scratch on Maltese texts, which makes it well - suited for Maltese language processing tasks.
- RoBERTa Architecture: It uses the RoBERTa architecture, which is known for its strong performance in natural language processing.
- Large - scale Training: Trained on 3.2GB of text (439M tokens) for 100,000 steps with a batch size of 1,024.
๐ฆ Installation
The installation steps are mainly about using the transformers
library in Python. You can install it via pip install transformers
if not already installed. Then, you can load the model as shown in the "How to use" section.
๐ Documentation
Model Details
MaltBERTa is a large pre - trained language model developed by Rik van Noord from the University of Groningen as part of the MaCoCu project. It was trained on a large amount of Maltese text data.
Training Data
For training, all Maltese data from the MaCoCu, Oscar, and mc4 corpora were used. After de - duplicating, a total of 3.2GB of text was left. Experiments showed that incorporating all data led to better performance than using only data from the .mt
domain in Oscar and mc4.
Training Procedure
The training and fine - tuning procedures are described in detail on our Github repo. MaltBERTa was trained for 100,000 steps with a batch size of 1,024 on 3.2GB of text (439M tokens).
Benchmark Performance
We tested MaltBERTa on the UPOS and XPOS benchmark of the Universal Dependencies project and a Google Translated version of the COPA data set. We compared its performance with strong multi - lingual models (XLMR - base and XLMR - large) and other Maltese language models (BERTu, mBERTu).
Property |
Details |
Model Type |
Pre - trained language model (RoBERTa architecture) |
Training Data |
3.2GB of Maltese text from MaCoCu, Oscar, and mc4 corpora |
|
UPOS |
UPOS |
XPOS |
XPOS |
COPA |
|
Dev |
Test |
Dev |
Test |
Test |
XLM - R - base |
93.6 |
93.2 |
93.4 |
93.2 |
52.2 |
XLM - R - large |
94.9 |
94.4 |
95.1 |
94.7 |
54.0 |
BERTu |
97.5 |
97.6 |
95.7 |
95.8 |
55.6 |
mBERTu |
97.7 |
97.8 |
97.9 |
98.1 |
52.6 |
MaltBERTa |
95.7 |
95.8 |
96.1 |
96.0 |
53.7 |
๐ License
This project is licensed under the CC0 - 1.0 license.
Acknowledgements
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Unionโs Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
Citation
If you use this model, please cite the following paper:
@inproceedings{non-etal-2022-macocu,
title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
author = "Ba{\~n}{\'o}n, Marta and
Espl{\`a}-Gomis, Miquel and
Forcada, Mikel L. and
Garc{\'\i}a-Romero, Cristian and
Kuzman, Taja and
Ljube{\v{s}}i{\'c}, Nikola and
van Noord, Rik and
Sempere, Leopoldo Pla and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Rupnik, Peter and
Suchomel, V{\'\i}t and
Toral, Antonio and
van der Werff, Tobias and
Zaragoza, Jaume",
booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
month = jun,
year = "2022",
address = "Ghent, Belgium",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2022.eamt-1.41",
pages = "303--304"
}