XLMR-MaCoCu-tr Open-source Language Model - Empowering Turkish Language Applications with Training on 35GB of Turkish Texts

XLMR MaCoCu Tr

Developed by MaCoCu

XLMR-MaCoCu-tr is a language model pretrained on large-scale Turkish text, part of the MaCoCu project, trained using 35GB of Turkish text.

Large Language Model Other#Turkish pretraining #Multitask fine-tuning #Large-scale corpus

Downloads 26

Release Time : 8/11/2022

Model Overview

This model is further trained based on the XLM-RoBERTa-large model, specifically designed for Turkish, suitable for various natural language processing tasks.

Model Features

Large-scale Turkish training

Trained with 35GB of Turkish text (4.4 billion tokens), covering a wide range of linguistic features.

Optimized based on XLM-RoBERTa-large

Further trained on XLM-RoBERTa-large, retaining the original vocabulary while optimizing Turkish language processing capabilities.

Superior multitask performance

Outperforms other Turkish language models in tasks such as POS tagging, NER, and COPA.

Model Capabilities

Part-of-speech tagging (UPOS/XPOS)

Named entity recognition (NER)

Causal reasoning (COPA)

Turkish text understanding

Use Cases

Natural Language Processing

Turkish text annotation

Used for part-of-speech tagging and named entity recognition in Turkish text.

Achieved 94.4% NER accuracy on the Universal Dependencies test set.

Causal reasoning

Used for Turkish causal reasoning tasks (COPA).

Achieved 60.7% accuracy on the MT test set, outperforming BERTurk and XLM-R-large.

🚀 XLMR-MaCoCu-tr: A Turkish Pre-trained Language Model

XLMR-MaCoCu-tr is a large pre-trained language model tailored for Turkish texts. It continues the training from the XLM-RoBERTa-large model and is developed as part of the MaCoCu project, using only the data crawled during this project. The lead developer is Rik van Noord from the University of Groningen.

🚀 Quick Start

Basic Usage

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-tr")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr") # Tensorflow

✨ Features

Turkish Focus: Trained specifically on Turkish texts, making it well - suited for Turkish language tasks.
Model Continuity: Built upon the foundation of the XLM - RoBERTa - large model, leveraging its pre - training advantages.
Project - Specific Data: Utilizes data crawled within the MaCoCu project, ensuring data relevance.

📦 Installation

The installation process is mainly about using the pre - trained model through the transformers library. As shown in the "How to use" section, you can use the following commands to load the tokenizer and the model:

from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-tr")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr") # Tensorflow

📚 Documentation

Training Details: XLMR - MaCoCu - tr was trained on 35GB of Turkish text, equivalent to 4.4B tokens. It was trained for 70,000 steps with a batch size of 1,024 and uses the same vocabulary as the original XLMR - large model.
Training and Fine - tuning: The training and fine - tuning procedures are described in detail on our Github repo.
Data Source: For training, all Turkish data from the monolingual Turkish MaCoCu corpus was used. After de - duplicating, there were 35 GB of text (4.4 billion tokens) left.

🔧 Technical Details

Model Training: The model was trained on a large amount of Turkish text data, with specific training steps and batch size settings.
Benchmark Testing: The performance of XLMR - MaCoCu - tr was tested on benchmarks of XPOS, UPOS, NER from the Universal Dependencies project and COPA. The comparison was made with strong multi - lingual models (XLMR - base and XLMR - large) and the monolingual BERTurk model.

Property	Details
Model Type	XLMR - MaCoCu - tr (a large pre - trained language model for Turkish)
Training Data	35GB of Turkish text (4.4B tokens) from the MaCoCu project
Training Steps	70,000
Batch Size	1,024

📄 License

The model is licensed under the CC0 - 1.0 license.

📊 Benchmark Performance

We tested the performance of XLMR - MaCoCu - tr on multiple benchmarks. Scores are averages of multiple runs (three runs for most tasks, 10 runs for COPA). We used the same hyperparameter settings for all models for POS/NER and optimized each learning rate on the dev set for COPA.

	UPOS	UPOS	XPOS	XPOS	NER	NER	COPA	COPA
	Dev	Test	Dev	Test	Dev	Test	Test (MT)	Test (HT)
XLM - R - base	89.0	89.0	90.4	90.6	92.8	92.6	56.0	53.2
XLM - R - large	89.4	89.3	90.8	90.7	94.1	94.1	52.1	50.5
BERTurk	88.2	88.4	89.7	89.6	92.6	92.6	57.0	56.4
XLMR - MaCoCu - tr	89.1	89.4	90.7	90.5	94.4	94.4	60.7	58.5

📖 Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014 - 2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

📜 Citation

If you use this model, please cite the following paper:

@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご