XLMR - BERTovski Open - source Language Model - Empowering Bulgarian and Macedonian Text Processing Applications

XLMR BERTovski

Developed by MaCoCu

A language model pretrained on large-scale Bulgarian and Macedonian texts, part of the MaCoCu project

Large Language Model Other#Bulgarian language optimization #Macedonian language optimization #Multilingual NLP

Downloads 36

Release Time : 8/11/2022

Model Overview

XLMR-BERTovski is a Bulgarian and Macedonian language model based on continued training of XLM-RoBERTa-large, primarily used for natural language processing tasks

Model Features

Large-scale bilingual pretraining

Trained on 74GB of Bulgarian and Macedonian texts, containing over 7 billion tokens

Optimized data sampling

Doubled sampling for Macedonian data with smaller volume to balance training between the two languages

High-quality training data

Strictly filtered .bg and .mk domain data to avoid low-quality machine-translated content

Model Capabilities

Part-of-speech tagging (UPOS/XPOS)

Named entity recognition (NER)

Common sense reasoning (COPA)

Bulgarian text processing

Macedonian text processing

Use Cases

Language analysis

Bulgarian part-of-speech tagging

Performing part-of-speech tagging on Bulgarian texts

Test set accuracy reached 99.5% (UPOS)

Macedonian named entity recognition

Identifying named entities in Macedonian texts

Test set F1 score reached 96.3%

Language understanding

Common sense reasoning tasks

Solving COPA common sense reasoning problems in Bulgarian and Macedonian

Accuracy reached 54.6% and 55.6% respectively

🚀 XLMR-BERTovski

XLMR-BERTovski is a large pre-trained language model trained on Bulgarian and Macedonian texts. It was developed as part of the MaCoCu project, aiming to enhance language processing capabilities for Bulgarian and Macedonian languages.

✨ Features

Multilingual Focus: Trained specifically on Bulgarian and Macedonian texts, providing better performance for these languages.
Continued Training: Built upon the XLM-RoBERTa-large model, leveraging its pre-trained knowledge.
Large-Scale Training: Trained on 74GB of text, equivalent to over 7 billion tokens.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-BERTovski")
model = AutoModel.from_pretrained("RVN/XLMR-BERTovski") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-BERTovski") # Tensorflow

📚 Documentation

Model Description

XLMR-BERTovski is a large pre-trained language model trained on Bulgarian and Macedonian texts. It was created by continuing training from the XLM-RoBERTa-large model. It was developed as part of the MaCoCu project. The main developer is Rik van Noord from the University of Groningen.

XLMR-BERTovski was trained on 74GB of Bulgarian and Macedonian text, which is equal to just over 7 billion tokens. It was trained for 67,500 steps with a batch size of 1,024, which was approximately 2.5 epochs. It uses the same vocabulary as the original XLMR-large model. The model is trained on the same data as BERTovski, but this model was trained from scratch using the RoBERTa architecture.

The training and fine-tuning procedures are described in detail on our Github repo.

Data

For training, we used all Bulgarian and Macedonian data that was present in the MaCoCu, Oscar, mc4 and Wikipedia corpora. In a manual analysis we found that for Oscar and mc4, if the data did not come from the corresponding domain (.bg or .mk), it was often (badly) machine translated. Therefore, we opted to only use data that originally came from a .bg or .mk domain.

After de-duplicating the data, we were left with a total of 54.5 GB of Bulgarian and 9 GB of Macedonian text. Since there was quite a bit more Bulgarian data, we simply doubled the Macedonian data during training.

Benchmark Performance

We tested performance of XLMR-BERTovski on benchmarks of XPOS, UPOS and NER. For Bulgarian, we used the data from the Universal Dependencies project. For Macedonian, we used the data sets created in the babushka-bench project. We also tested on a Google (Bulgarian) and human (Macedonian) translated version of the COPA data set (for details see our Github repo). We compare performance to BERTovski and the strong multi-lingual models XLMR-base and XLMR-large. For details regarding the fine-tuning procedure you can checkout our Github.

Scores are averages of three runs, except for COPA, for which we use 10 runs. We use the same hyperparameter settings for all models for UPOS/XPOS/NER, for COPA we optimized the learning rate on the dev set.

Bulgarian

	UPOS	UPOS	XPOS	XPOS	NER	NER	COPA
	Dev	Test	Dev	Test	Dev	Test	Test
XLM-R-base	99.2	99.4	98.0	98.3	93.2	92.9	56.9
XLM-R-large	99.3	99.4	97.4	97.7	93.7	93.5	53.1
BERTovski	98.8	99.1	97.6	97.8	93.5	93.3	51.7
XLMR-BERTovski	99.3	99.5	98.5	98.8	94.4	94.3	54.6

Macedonian

	UPOS	UPOS	XPOS	XPOS	NER	NER	COPA
	Dev	Test	Dev	Test	Dev	Test	Test
XLM-R-base	98.3	98.6	97.3	97.1	92.8	94.8	55.3
XLM-R-large	98.3	98.7	97.7	97.5	93.3	95.1	52.5
BERTovski	97.8	98.1	96.4	96.0	92.8	94.6	51.8
XLMR-BERTovski	98.6	98.8	98.0	97.7	94.4	96.3	55.6

Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union's Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

Citation

If you use this model, please cite the following paper:

@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}

📄 License

The model is released under the CC0-1.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご