VBART-Small-Base Open-Source Language Model - Based on Turkish Corpus, Supports Sequence-to-Sequence Tasks

VBART Small Base

Developed by vngrs-ai

VBART is the first large-scale sequence-to-sequence language model pretrained from scratch on Turkish corpora, developed by VNGRS.

Large Language Model

Transformers

Other#Turkish text generation #Sequence-to-sequence models #Pretrained foundation models

Downloads 92

Release Time : 3/22/2024

Model Overview

VBART is a Transformer encoder-decoder model based on the mBART architecture, primarily designed for Turkish text-to-text generation tasks such as summarization, rewriting, and headline generation.

Model Features

Turkish-specific model

The first sequence-to-sequence model pretrained from scratch on large-scale Turkish corpora.

Efficient performance

Despite its smaller size, it outperforms multilingual counterparts.

Pretrained foundation model

Capable of masked language modeling, suitable for downstream task fine-tuning.

Model Capabilities

Text summarization

Text rewriting

Headline generation

Masked language modeling

Use Cases

Text processing

Text summarization

Condense long texts into concise summaries.

Text rewriting

Paraphrase text or transform its style.

Headline generation

Generate appropriate headlines based on text content.

🚀 VBART Model Card

VBART is the first large - scale sequence - to - sequence LLM pre - trained from scratch on Turkish corpora. Developed by VNGRS in February 2023, it excels in conditional text generation tasks after fine - tuning. Despite its relatively small size, it outperforms multilingual counterparts. This repository provides pre - trained TensorFlow and Safetensors weights of VBART - Small - Base.

🚀 Quick Start

This README offers a detailed introduction to the VBART model, including its description, training details, and citation information.

✨ Features

Turkish - specific: Pre - trained on Turkish corpora, making it highly effective for Turkish language tasks.
Conditional text generation: Capable of tasks like text summarization, paraphrasing, and title generation after fine - tuning.
High performance: Outperforms multilingual counterparts despite its smaller size.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Description

VBART is the first sequence - to - sequence LLM pre - trained on Turkish corpora from scratch on a large scale. It was pre - trained by VNGRS in February 2023. The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine - tuned. It outperforms its multilingual counterparts, albeit being much smaller than other implementations.

Property	Details
Developed by	VNGRS - AI
Model Type	Transformer encoder - decoder based on mBART architecture
Language(s) (NLP)	Turkish
License	CC BY - NC - SA 4.0
Paper	arXiv

Training Details

Training Data

The base model is pre - trained on [vngrs - web - corpus](https://huggingface.co/datasets/vngrs - ai/vngrs - web - corpus). It is curated by cleaning and filtering Turkish parts of [OSCAR - 2201](https://huggingface.co/datasets/oscar - corpus/OSCAR - 2201) and mC4 datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our paper.

Limitations

This model is the pre - trained base model and is capable of masked language modeling. Its purpose is to serve as the base model to be fine - tuned for downstream tasks.

Training Procedure

Pre - trained for a total of 52B tokens.

Hardware

GPUs: 8 x Nvidia A100 - 80 GB

Software

TensorFlow

Hyperparameters

Pretraining:
- Training regime: fp16 mixed precision
- Training objective: Span masking (using mask lengths sampled from Poisson distribution λ = 3.5, masking 30% of tokens)
- Optimizer: Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e - 6)
- Scheduler: Custom scheduler from the original Transformers paper (20,000 warm - up steps)
- Dropout: 0.1
- Initial Learning rate: 5e - 6
- Training tokens: 52B

📄 License

The model is released under the CC BY - NC - SA 4.0 license.

🔧 Technical Details

The model is a Transformer encoder - decoder based on the mBART architecture. It is pre - trained on Turkish corpora, which gives it an edge in Turkish language tasks. The training process involves filtering data from multiple datasets and using specific hyperparameters for pre - training.

📄 Citation

@article{turker2024vbart,
  title={VBART: The Turkish LLM},
  author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
  journal={arXiv preprint arXiv:2403.01308},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご