đ VBART Model Card
VBART is the first large - scale sequence - to - sequence LLM pre - trained from scratch on Turkish corpora, offering high - performance text generation capabilities.
đ Quick Start
This README provides detailed information about the VBART model, including its description, training details, and citation information.
⨠Features
- VBART is the first large - scale sequence - to - sequence LLM pre - trained from scratch on Turkish corpora.
- Capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation after fine - tuning.
- Outperforms its multilingual counterparts despite being smaller in size.
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Description
VBART is the first sequence - to - sequence LLM pre - trained on Turkish corpora from scratch on a large scale. It was pre - trained by VNGRS in February 2023. The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine - tuned. It outperforms its multilingual counterparts, albeit being much smaller than other implementations.
This repository contains pre - trained TensorFlow and Safetensors weights of VBART - Medium - Base.
Property |
Details |
Developed by |
VNGRS - AI |
Model Type |
Transformer encoder - decoder based on mBART architecture |
Language(s) (NLP) |
Turkish |
License |
CC BY - NC - SA 4.0 |
Paper |
arXiv |
Training Details
Training Data
The base model is pre - trained on [vngrs - web - corpus](https://huggingface.co/datasets/vngrs - ai/vngrs - web - corpus). It is curated by cleaning and filtering Turkish parts of [OSCAR - 2201](https://huggingface.co/datasets/oscar - corpus/OSCAR - 2201) and mC4 datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our paper.
Limitations
This model is the pre - trained base model and is capable of masked language modeling. Its purpose is to serve as the base model to be fine - tuned for downstream tasks.
Training Procedure
Pre - trained for a total of 63B tokens.
Hardware:
- GPUs: 8 x Nvidia A100 - 80 GB
Software:
Hyperparameters:
Pretraining
- Training regime: fp16 mixed precision
- Training objective: Span masking (using mask lengths sampled from Poisson distribution Îģ = 3.5, masking 30% of tokens)
- Optimizer: Adam optimizer (β1 = 0.9, β2 = 0.98, Æ = 1e - 6)
- Scheduler: Custom scheduler from the original Transformers paper (20,000 warm - up steps)
- Dropout: 0.1
- Initial Learning rate: 5e - 6
- Training tokens: 63B
đ§ Technical Details
The VBART model is pre - trained on Turkish corpora using a Transformer encoder - decoder architecture based on mBART. The pre - training process involves a large amount of data and specific hyperparameters, which contribute to its performance in conditional text generation tasks. The data is carefully curated and filtered, and the model is trained with techniques like span masking and a custom scheduler.
đ License
The VBART model is released under the CC BY - NC - SA 4.0 license.
đ Citation
@article{turker2024vbart,
title={VBART: The Turkish LLM},
author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
journal={arXiv preprint arXiv:2403.01308},
year={2024}
}