XLMR-MaltBERTa Open-Source Language Model - Free to Use for Maltese Text Processing

XLMR MaltBERTa

Developed by MaCoCu

A language model based on large-scale pre-training of Maltese text, further trained on the XLM-RoBERTa-large foundation

Large Language Model Other#Maltese language processing #Multi-task fine-tuning #Large-scale pre-training

Downloads 20

Release Time : 8/11/2022

Model Overview

XLMR-MaltBERTa is a language model specifically optimized for Maltese, suitable for various natural language processing tasks.

Model Features

Maltese Optimization

Specially designed for large-scale pre-training of Maltese, providing better language understanding capabilities

Based on XLM-RoBERTa-large

Further trained on the powerful XLM-RoBERTa-large model, inheriting its excellent features

Large-scale Training Data

Trained using 3.2GB of Maltese text (439 million tokens)

Model Capabilities

Text understanding

Part-of-speech tagging

Language reasoning

Use Cases

Natural Language Processing

Part-of-speech Tagging

Performing part-of-speech tagging on the UPOS/XPOS benchmarks of the Universal Dependencies project

Achieved high accuracy rates of 98.1 (UPOS) and 98.2 (XPOS) on the test set

Language Reasoning

Conducting language reasoning on the Google-translated COPA dataset

Achieved an accuracy rate of 54.4 on the test set

🚀 XLMR-MaltBERTa

XLMR-MaltBERTa is a large pre-trained language model trained on Maltese texts. It was developed as part of the MaCoCu project, aiming to provide better language processing capabilities for Maltese.

🚀 Quick Start

You can use the following code to load the model:

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaltBERTa") # Tensorflow

✨ Features

Based on XLM-RoBERTa-large: Continued training from the XLM-RoBERTa-large model.
Large-scale Training: Trained on 3.2GB of text (439M tokens) for 50,000 steps with a batch size of 1,024.
Same Vocabulary: Uses the same vocabulary as the original XLMR-large model.
Comparable Data: Trained on the same data as MaltBERTa, but from scratch using the RoBERTa architecture.

📦 Installation

The installation can be achieved by using the transformers library. You can install it via pip:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaltBERTa")
model = AutoModel.from_pretrained("RVN/XLMR-MaltBERTa")

input_text = "Your input text here"
inputs = tokenizer(input_text, return_tensors='pt')
outputs = model(**inputs)

📚 Documentation

Model Description

XLMR-MaltBERTa is a large pre-trained language model trained on Maltese texts. It was created by continuing training from the XLM-RoBERTa-large model. It was developed as part of the MaCoCu project. The main developer is Rik van Noord from the University of Groningen.

The model was trained on 3.2GB of text, equal to 439M tokens, for 50,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as MaltBERTa, but this model was trained from scratch using the RoBERTa architecture.

The training and fine-tuning procedures are described in detail on our Github repo.

Data

For training, all Maltese data from the MaCoCu, Oscar, and mc4 corpora were used. After de-duplicating the data, a total of 3.2GB of text remained.

Benchmark Performance

The performance of MaltBERTa was tested on the UPOS and XPOS benchmark of the Universal Dependencies project. Additionally, it was tested on a Google Translated version of the COPA data set (see our Github repo for details).

The performance was compared to strong multi-lingual models XLMR-base and XLMR-large (note that Maltese was not one of the training languages for those models), as well as recently introduced Maltese language models BERTu, mBERTu, and our own MaltBERTa.

Scores are averages of three runs for UPOS/XPOS and 10 runs for COPA. The same hyperparameter settings were used for all models for UPOS/XPOS, while for COPA, optimization was done on the dev set.

	UPOS	UPOS	XPOS	XPOS	COPA
	Dev	Test	Dev	Test	Test
XLM-R-base	93.6	93.2	93.4	93.2	52.2
XLM-R-large	94.9	94.4	95.1	94.7	54.0
BERTu	97.5	97.6	95.7	95.8	55.6
mBERTu	97.7	97.8	97.9	98.1	52.6
MaltBERTa	95.7	95.8	96.1	96.0	53.7
XLMR-MaltBERTa	97.7	98.1	98.1	98.2	54.4

📄 License

This model is licensed under the CC0-1.0 license.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

Citation

If you use this model, please cite the following paper:

@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご