đ Kanarya-750M: Turkish Language Model
Kanarya is a pre - trained Turkish GPT - J 750M model. As part of the Turkish Data Depository initiative, the Kanarya family offers two versions: Kanarya - 2B (the larger one) and Kanarya - 0.7B (the smaller one). Both are trained on a large - scale Turkish text corpus, filtered from the OSCAR and mC4 datasets. The training data, sourced from news, articles, and websites, forms a diverse and high - quality dataset. These models are trained using a JAX/Flax implementation of the [GPT - J](https://github.com/kingoflolz/mesh - transformer - jax) architecture. They are pre - trained and designed for fine - tuning on a wide array of Turkish NLP tasks.

⨠Features
- Pre - trained Turkish Model: Specifically tailored for the Turkish language, offering a solid foundation for various NLP tasks.
- Two Model Versions: The Kanarya family provides flexibility with different model sizes to suit different requirements.
- Diverse Training Data: Trained on data from multiple sources, ensuring a broad understanding of the Turkish language.
- GPT - J Architecture: Utilizes the well - known GPT - J architecture implemented in JAX/Flax.
đĻ Installation
No installation steps were provided in the original document, so this section is skipped.
đ Documentation
Model Details
Property |
Details |
Model Name |
Kanarya - 750M |
Model Size |
750M parameters |
Training Data |
OSCAR, mC4 |
Language |
Turkish |
Layers |
12 |
Hidden Size |
2048 |
Number of Heads |
16 |
Context Size |
2048 |
Positional Embeddings |
Rotary |
Vocabulary Size |
32,768 |
Intended Use
This model is pre - trained on Turkish text data and is meant to be fine - tuned for a wide range of Turkish NLP tasks, such as text generation, translation, summarization, etc. It should not be used for downstream tasks without fine - tuning.
Limitations and Ethical Considerations
The model, despite being trained on a high - quality and diverse Turkish text corpus, may generate toxic, biased, or unethical content. Users are strongly advised to use the model responsibly and ensure the generated content is appropriate for the use case. Please report any issues.
License
The model is licensed under the Apache 2.0 License, which allows free use for any purpose, including commercial use. We encourage users to contribute to the model and report any issues. However, the model is provided "as is" without any warranty.
Citation
If you use the model, please cite the following paper:
@inproceedings{safaya-etal-2022-mukayese,
title = "Mukayese: {T}urkish {NLP} Strikes Back",
author = "Safaya, Ali and
Kurtulu{\c{s}}, Emirhan and
Goktogan, Arda and
Yuret, Deniz",
editor = "Muresan, Smaranda and
Nakov, Preslav and
Villavicencio, Aline",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.findings-acl.69",
doi = "10.18653/v1/2022.findings-acl.69",
pages = "846--863",
}
Acknowledgments
During this work, Ali Safaya was supported by KUIS AI Center fellowship. Additionally, the pre - training of these models was carried out at TUBITAK ULAKBIM, High Performance and Grid Computing Center ([TRUBA](https://www.truba.gov.tr/index.php/en/main - page/) resources).
đģ Usage Examples
No code examples were provided in the original document, so this section is skipped.
đ§ Technical Details
No specific technical implementation details (more than 50 words) were provided in the original document, so this section is skipped.
đ License
The model is licensed under the Apache 2.0 License. It is free to use for any purpose, including commercial use. We encourage users to contribute to the model and report any issues. However, the model is provided "as is" without warranty of any kind.