๐ XLMR-MaCoCu-tr: A Turkish Pre-trained Language Model
XLMR-MaCoCu-tr is a large pre-trained language model tailored for Turkish texts. It continues the training from the XLM-RoBERTa-large model and is developed as part of the MaCoCu project, using only the data crawled during this project. The lead developer is Rik van Noord from the University of Groningen.
๐ Quick Start
Basic Usage
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-tr")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr")
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr")
โจ Features
- Turkish Focus: Trained specifically on Turkish texts, making it well - suited for Turkish language tasks.
- Model Continuity: Built upon the foundation of the XLM - RoBERTa - large model, leveraging its pre - training advantages.
- Project - Specific Data: Utilizes data crawled within the MaCoCu project, ensuring data relevance.
๐ฆ Installation
The installation process is mainly about using the pre - trained model through the transformers
library. As shown in the "How to use" section, you can use the following commands to load the tokenizer and the model:
from transformers import AutoTokenizer, AutoModel, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-tr")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr")
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr")
๐ Documentation
- Training Details: XLMR - MaCoCu - tr was trained on 35GB of Turkish text, equivalent to 4.4B tokens. It was trained for 70,000 steps with a batch size of 1,024 and uses the same vocabulary as the original XLMR - large model.
- Training and Fine - tuning: The training and fine - tuning procedures are described in detail on our Github repo.
- Data Source: For training, all Turkish data from the monolingual Turkish MaCoCu corpus was used. After de - duplicating, there were 35 GB of text (4.4 billion tokens) left.
๐ง Technical Details
- Model Training: The model was trained on a large amount of Turkish text data, with specific training steps and batch size settings.
- Benchmark Testing: The performance of XLMR - MaCoCu - tr was tested on benchmarks of XPOS, UPOS, NER from the Universal Dependencies project and COPA. The comparison was made with strong multi - lingual models (XLMR - base and XLMR - large) and the monolingual BERTurk model.
Property |
Details |
Model Type |
XLMR - MaCoCu - tr (a large pre - trained language model for Turkish) |
Training Data |
35GB of Turkish text (4.4B tokens) from the MaCoCu project |
Training Steps |
70,000 |
Batch Size |
1,024 |
๐ License
The model is licensed under the CC0 - 1.0 license.
๐ Benchmark Performance
We tested the performance of XLMR - MaCoCu - tr on multiple benchmarks. Scores are averages of multiple runs (three runs for most tasks, 10 runs for COPA). We used the same hyperparameter settings for all models for POS/NER and optimized each learning rate on the dev set for COPA.
|
UPOS |
UPOS |
XPOS |
XPOS |
NER |
NER |
COPA |
COPA |
|
Dev |
Test |
Dev |
Test |
Dev |
Test |
Test (MT) |
Test (HT) |
XLM - R - base |
89.0 |
89.0 |
90.4 |
90.6 |
92.8 |
92.6 |
56.0 |
53.2 |
XLM - R - large |
89.4 |
89.3 |
90.8 |
90.7 |
94.1 |
94.1 |
52.1 |
50.5 |
BERTurk |
88.2 |
88.4 |
89.7 |
89.6 |
92.6 |
92.6 |
57.0 |
56.4 |
XLMR - MaCoCu - tr |
89.1 |
89.4 |
90.7 |
90.5 |
94.4 |
94.4 |
60.7 |
58.5 |
๐ Acknowledgements
Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Unionโs Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
๐ Citation
If you use this model, please cite the following paper:
@inproceedings{non-etal-2022-macocu,
title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
author = "Ba{\~n}{\'o}n, Marta and
Espl{\`a}-Gomis, Miquel and
Forcada, Mikel L. and
Garc{\'\i}a-Romero, Cristian and
Kuzman, Taja and
Ljube{\v{s}}i{\'c}, Nikola and
van Noord, Rik and
Sempere, Leopoldo Pla and
Ram{\'\i}rez-S{\'a}nchez, Gema and
Rupnik, Peter and
Suchomel, V{\'\i}t and
Toral, Antonio and
van der Werff, Tobias and
Zaragoza, Jaume",
booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
month = jun,
year = "2022",
address = "Ghent, Belgium",
publisher = "European Association for Machine Translation",
url = "https://aclanthology.org/2022.eamt-1.41",
pages = "303--304"
}