VBART-Medium-Base Open-Source Sequence-to-Sequence Large Model - Empowering Application Development with Turkish Corpora

VBART Medium Base

Developed by vngrs-ai

VBART is the first large-scale sequence-to-sequence language model pretrained from scratch on Turkish corpora, developed by VNGRS.

Large Language Model

Transformers

Other#Turkish text generation #Sequence-to-sequence models #Pretrained foundation models

Downloads 61

Release Time : 3/22/2024

Model Overview

VBART is a Transformer encoder-decoder model based on the mBART architecture, specifically pretrained for Turkish. After fine-tuning, the model can perform conditional text generation tasks such as summarization, paraphrasing, and headline generation.

Model Features

Turkish-specific model

The first sequence-to-sequence model pretrained from scratch on large-scale Turkish corpora

Efficient performance

Despite its smaller size, it outperforms multilingual counterparts

Large-scale pretraining

Pretrained on 63 billion tokens using high-quality filtered Turkish datasets

Model Capabilities

Text summarization

Text paraphrasing

Headline generation

Conditional text generation

Use Cases

Text processing

News summarization

Automatically generate concise summaries from long news articles

Content paraphrasing

Rephrase existing text to generate versions with different expressions

🚀 VBART Model Card

VBART is the first large - scale sequence - to - sequence LLM pre - trained from scratch on Turkish corpora, offering high - performance text generation capabilities.

🚀 Quick Start

This README provides detailed information about the VBART model, including its description, training details, and citation information.

✨ Features

VBART is the first large - scale sequence - to - sequence LLM pre - trained from scratch on Turkish corpora.
Capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation after fine - tuning.
Outperforms its multilingual counterparts despite being smaller in size.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Description

VBART is the first sequence - to - sequence LLM pre - trained on Turkish corpora from scratch on a large scale. It was pre - trained by VNGRS in February 2023. The model is capable of conditional text generation tasks such as text summarization, paraphrasing, and title generation when fine - tuned. It outperforms its multilingual counterparts, albeit being much smaller than other implementations.

This repository contains pre - trained TensorFlow and Safetensors weights of VBART - Medium - Base.

Property	Details
Developed by	VNGRS - AI
Model Type	Transformer encoder - decoder based on mBART architecture
Language(s) (NLP)	Turkish
License	CC BY - NC - SA 4.0
Paper	arXiv

Training Details

Training Data

The base model is pre - trained on [vngrs - web - corpus](https://huggingface.co/datasets/vngrs - ai/vngrs - web - corpus). It is curated by cleaning and filtering Turkish parts of [OSCAR - 2201](https://huggingface.co/datasets/oscar - corpus/OSCAR - 2201) and mC4 datasets. These datasets consist of documents of unstructured web crawl data. More information about the dataset can be found on their respective pages. Data is filtered using a set of heuristics and certain rules, explained in the appendix of our paper.

Limitations

This model is the pre - trained base model and is capable of masked language modeling. Its purpose is to serve as the base model to be fine - tuned for downstream tasks.

Training Procedure

Pre - trained for a total of 63B tokens.

Hardware:

GPUs: 8 x Nvidia A100 - 80 GB

Software:

TensorFlow

Hyperparameters:

Pretraining

Training regime: fp16 mixed precision
Training objective: Span masking (using mask lengths sampled from Poisson distribution λ = 3.5, masking 30% of tokens)
Optimizer: Adam optimizer (β1 = 0.9, β2 = 0.98, Ɛ = 1e - 6)
Scheduler: Custom scheduler from the original Transformers paper (20,000 warm - up steps)
Dropout: 0.1
Initial Learning rate: 5e - 6
Training tokens: 63B

🔧 Technical Details

The VBART model is pre - trained on Turkish corpora using a Transformer encoder - decoder architecture based on mBART. The pre - training process involves a large amount of data and specific hyperparameters, which contribute to its performance in conditional text generation tasks. The data is carefully curated and filtered, and the model is trained with techniques like span masking and a custom scheduler.

📄 License

The VBART model is released under the CC BY - NC - SA 4.0 license.

📚 Citation

@article{turker2024vbart,
  title={VBART: The Turkish LLM},
  author={Turker, Meliksah and Ari, Erdi and Han, Aydin},
  journal={arXiv preprint arXiv:2403.01308},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご