MaltBERTa Open-Source Language Model - Pre-trained on Maltese for Text Processing and Other Applications

Maltberta

Developed by MaCoCu

MaltBERTa is a large-scale pretrained language model based on Maltese text, using the RoBERTa architecture, developed by the MaCoCu project.

Large Language Model Other#Maltese-specific #RoBERTa architecture #Large-scale pretraining

Downloads 26

Release Time : 8/11/2022

Model Overview

This model is specifically optimized for the Maltese language and is suitable for various natural language processing tasks.

Model Features

Large-scale Maltese pretraining

Trained on 3.2GB of Maltese text (439 million tokens)

Multi-source data integration

Combines content from MaCoCu, Oscar, and mc4 corpora, with deduplication processing

High-performance

Outperforms XLM-R-base/large on UPOS/XPOS and COPA benchmarks

Model Capabilities

Text understanding

Part-of-speech tagging

Language inference

Use Cases

Natural Language Processing

Part-of-speech tagging

Used for part-of-speech tagging tasks in Maltese text

Achieves 95.8/96.0 accuracy on the UPOS/XPOS test set

Language inference

Used for language inference tasks in Maltese

Achieves 53.7 accuracy on the COPA test set

🚀 MaltBERTa

MaltBERTa is a large pre - trained language model trained on Maltese texts. It offers high - quality language processing capabilities for Maltese language tasks, developed as part of the MaCoCu project.

🚀 Quick Start

The following code demonstrates how to use MaltBERTa in Python. You can use it with both PyTorch and TensorFlow.

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/MaltBERTa")
model = AutoModel.from_pretrained("RVN/MaltBERTa") # PyTorch
model = TFAutoModel.from_pretrained("RVN/MaltBERTa") # Tensorflow

✨ Features

Maltese - specific Training: MaltBERTa is trained from scratch on Maltese texts, which makes it well - suited for Maltese language processing tasks.
RoBERTa Architecture: It uses the RoBERTa architecture, which is known for its strong performance in natural language processing.
Large - scale Training: Trained on 3.2GB of text (439M tokens) for 100,000 steps with a batch size of 1,024.

📦 Installation

The installation steps are mainly about using the transformers library in Python. You can install it via pip install transformers if not already installed. Then, you can load the model as shown in the "How to use" section.

📚 Documentation

Model Details

MaltBERTa is a large pre - trained language model developed by Rik van Noord from the University of Groningen as part of the MaCoCu project. It was trained on a large amount of Maltese text data.

Training Data

For training, all Maltese data from the MaCoCu, Oscar, and mc4 corpora were used. After de - duplicating, a total of 3.2GB of text was left. Experiments showed that incorporating all data led to better performance than using only data from the .mt domain in Oscar and mc4.

Training Procedure

The training and fine - tuning procedures are described in detail on our Github repo. MaltBERTa was trained for 100,000 steps with a batch size of 1,024 on 3.2GB of text (439M tokens).

Benchmark Performance

We tested MaltBERTa on the UPOS and XPOS benchmark of the Universal Dependencies project and a Google Translated version of the COPA data set. We compared its performance with strong multi - lingual models (XLMR - base and XLMR - large) and other Maltese language models (BERTu, mBERTu).

Property	Details
Model Type	Pre - trained language model (RoBERTa architecture)
Training Data	3.2GB of Maltese text from MaCoCu, Oscar, and mc4 corpora

	UPOS	UPOS	XPOS	XPOS	COPA
	Dev	Test	Dev	Test	Test
XLM - R - base	93.6	93.2	93.4	93.2	52.2
XLM - R - large	94.9	94.4	95.1	94.7	54.0
BERTu	97.5	97.6	95.7	95.8	55.6
mBERTu	97.7	97.8	97.9	98.1	52.6
MaltBERTa	95.7	95.8	96.1	96.0	53.7

📄 License

This project is licensed under the CC0 - 1.0 license.

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

Citation

If you use this model, please cite the following paper:

@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご