Vietnamese-llama2-7b-40GB Open Source Model - Optimize Vietnamese processing and enhance language interaction experience

Vietnamese Llama2 7b 40GB

Developed by bkai-foundation-models

A Vietnamese-optimized model based on Llama2-chat 7B, significantly improving Vietnamese language processing through incremental pre-training and an efficient tokenizer

Large Language Model

Transformers

Supports Multiple LanguagesOpen Source License:Other #Vietnamese Optimization #LoRA Fine-tuning #Multilingual Mixed Training

Downloads 23

Release Time : 10/26/2023

Model Overview

This model is a Vietnamese-optimized variant of Llama2, significantly enhancing Vietnamese text encoding efficiency through retraining the tokenizer and continuous pre-training, suitable for Vietnamese natural language processing tasks

Model Features

Efficient Vietnamese Tokenization

Uses a SentencePiece-trained dedicated tokenizer, improving Vietnamese encoding efficiency by 70% compared to the original Llama2

Mixed Data Training

Uses a 40.5GB mixed dataset (Vietnamese news, Wikipedia, legal documents, and English data) for incremental pre-training

LoRA Adaptation

Employs Low-Rank Adaptation (LoRA) technology for efficient training, providing independent LoRA modules for easy integration

Model Capabilities

Vietnamese Text Generation

English Text Generation

Cross-language Understanding

Use Cases

Content Generation

Vietnamese News Generation

Trained on news corpora, capable of generating news content that conforms to Vietnamese language conventions

Legal Assistance

Legal Document Processing

Trained on extensive Vietnamese legal texts, suitable for legal document analysis and generation

🚀 Vietnamese Llama2-7B-40GB Model

This project focuses on retraining a Vietnamese tokenizer and conducting continual pretraining on the Llama2 - chat 7B model. The new tokenizer significantly improves the encoding of Vietnamese text, and the model is trained on a mixed dataset.

🚀 Quick Start

For usage and other considerations, please refer to the Llama 2.

✨ Features

Improved Tokenizer: Employed SentencePiece to retrain a Vietnamese tokenizer with a 20K vocabulary size. Merged with the original Llama2 vocabulary, it reduces the number of tokens when encoding Vietnamese text by 50% compared to ChatGPT and approximately 70% compared to the original Llama2.
Continual Pretraining: Conducted single - epoch continual pretraining on the Llama2 - chat 7B model using a 40.5 GB mixed dataset.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Tokenizer Retraining

We used SentencePiece to retrain a Vietnamese tokenizer with a vocabulary size of 20K, without using Vietnamese word segmentation. Then we merged this vocabulary with the original one of Llama2, removing duplicate tokens.

Continual Pretraining

We carried out single - epoch continual pretraining on the Llama2 - chat 7B model. The mixed dataset (40.5 GB) includes:

19 GB NewsCorpus
1.1 GB Vietnamese Wikipedia
1.6 GB Vietnamese books
4.5 GB Vietnamese legal documents (crawled from thuvienphapluat and processed by ourselves)
2.1 GB Vietnamese legal text (from C4 - vi)
1.1 GB English Books (sub - sampled from pg19)
1.1 GB English Wikipedia (sub - sampled from 20220301.en wikipedia)
10 GB English Text (sub - sampled from C4 - en)

Training Environment

We trained the model on a DGX A100 system, using four GPU A100 for 10 days (about 1000 GPU hours).

Hyperparameters

Training Regime: BFloat16 mixed precision
Lora Config:

{
    "base_model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
    "bias": "none",
    "enable_lora": null,
    "fan_in_fan_out": false,
    "inference_mode": true,
    "lora_alpha": 32.0,
    "lora_dropout": 0.05,
    "merge_weights": false,
    "modules_to_save": [
        "embed_tokens",
        "lm_head"
    ],
    "peft_type": "LORA",
    "r": 8,
    "target_modules": [
        "q_proj",
        "v_proj",
        "k_proj",
        "o_proj",
        "gate_proj",
        "down_proj",
        "up_proj"
    ],
    "task_type": "CAUSAL_LM"
}

LoRA Part

We also provide the LoRA part so that you can integrate it with the original Llama2 - chat - 7b by yourself.

Training Loss

Training Loss Curve

🔧 Technical Details

The model is trained on a DGX A100 system with four GPU A100 for about 1000 GPU hours. The hyperparameters and the dataset composition are carefully designed to improve the model's performance on both Vietnamese and English text.

📄 License

The license is "other". This project is built upon Meta's Llama - 2 model. It is essential to strictly adhere to the open - source license agreement of Llama - 2 when using this model. If you incorporate third - party code, please ensure compliance with the relevant open - source license agreements.

📋 Information Table

Property	Details
Model Type	Vietnamese Llama2 - 7B - 40GB
Training Data	Mixed dataset including NewsCorpus, Vietnamese Wikipedia, Vietnamese books, Vietnamese legal documents, English books, English Wikipedia, and English text

⚠️ Important Note

This model requires further supervised fine - tuning (SFT) to be used in practice!

💡 Usage Tip

The content generated by the model may be influenced by various factors, such as calculation methods, random elements, and potential inaccuracies in quantification. Consequently, this project does not offer any guarantees regarding the accuracy of the model's outputs, and it disclaims any responsibility for consequences resulting from the use of the model's resources and its output. For those employing the models from this project for commercial purposes, developers must adhere to local laws and regulations to ensure the compliance of the model's output content. This project is not accountable for any products or services derived from such usage.

Acknowledgments

We extend our gratitude to PHPC - Phenikaa University and NVIDIA for their generous provision of computing resources for model training. Our appreciation also goes out to binhvq and the other authors for their diligent efforts in collecting and preparing the Vietnamese text corpus.

Citation

If this dataset is used for your work, please cite our manuscript:

@article{duc2024towards,
    title={Towards Comprehensive Vietnamese Retrieval - Augmented Generation and Large Language Models},
    author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang},
    journal={arXiv preprint arXiv:2403.01616},
    year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご