đ VBART Model Card
VBART is the first large - scale sequence - to - sequence LLM pre - trained from scratch on Turkish corpora. Developed by VNGRS in February 2023, it excels in conditional text generation tasks after fine - tuning. Despite its relatively small size, it outperforms multilingual counterparts. This repository provides pre - trained TensorFlow and Safetensors weights of VBART - Small - Base.
đ Quick Start
This README offers a detailed introduction to the VBART model, including its description, training details, and citation information.
⨠Features
- Turkish - specific: Pre - trained on Turkish corpora, making it highly effective for Turkish language tasks.
- Conditional text generation: Capable of tasks like text summarization, paraphrasing, and title generation after fine - tuning.
- High performance: Outperforms multilingual counterparts despite its smaller size.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Description
VBART is the first sequence - to - sequence LLM pre - trained on Turkish corpora from scratch on a large scale. It was pre - trained by VNGRS in February 2023. The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine - tuned. It outperforms its multilingual counterparts, albeit being much smaller than other implementations.
Property |
Details |
Developed by |
VNGRS - AI |
Model Type |
Transformer encoder - decoder based on mBART architecture |
Language(s) (NLP) |
Turkish |
License |
CC BY - NC - SA 4.0 |
Paper |
arXiv |
Training Details
Training Data
The base model is pre - trained on [vngrs - web - corpus](https://huggingface.co/datasets/vngrs - ai/vngrs - web - corpus). It is curated by cleaning and filtering Turkish parts of [OSCAR - 2201](https://huggingface.co/datasets/oscar - corpus/OSCAR - 2201) and mC4 datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our paper.
Limitations
This model is the pre - trained base model and is capable of masked language modeling. Its purpose is to serve as the base model to be fine - tuned for downstream tasks.
Training Procedure
Pre - trained for a total of 52B tokens.
Hardware
- GPUs: 8 x Nvidia A100 - 80 GB
Software
Hyperparameters
- Pretraining:
- Training regime: fp16 mixed precision
- Training objective: Span masking (using mask lengths sampled from Poisson distribution Îģ = 3.5, masking 30% of tokens)
- Optimizer: Adam optimizer (β1 = 0.9, β2 = 0.98, Æ = 1e - 6)
- Scheduler: Custom scheduler from the original Transformers paper (20,000 warm - up steps)
- Dropout: 0.1
- Initial Learning rate: 5e - 6
- Training tokens: 52B
đ License
The model is released under the CC BY - NC - SA 4.0 license.
đ§ Technical Details
The model is a Transformer encoder - decoder based on the mBART architecture. It is pre - trained on Turkish corpora, which gives it an edge in Turkish language tasks. The training process involves filtering data from multiple datasets and using specific hyperparameters for pre - training.
đ Citation
@article{turker2024vbart,
title={VBART: The Turkish LLM},
author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
journal={arXiv preprint arXiv:2403.01308},
year={2024}
}