vbd-llama2-7B-50b-chat Open-source Conversational Model - Free Deployment to Facilitate Vietnamese Communication and Conversation

Vbd Llama2 7B 50b Chat

Developed by LR-AI-Labs

A Vietnamese-optimized dialogue model based on LLaMA2-7B, enhanced for Vietnamese language capabilities through continued self-supervised learning and supervised fine-tuning

Large Language Model

Transformers

Supports Multiple Languages#Vietnamese optimization #Bilingual dialogue #Low-resource adaptation

Downloads 25

Release Time : 12/26/2023

Model Overview

A Vietnamese-optimized dialogue-tuned version of the LLaMA2 model, supporting bilingual conversations in Vietnamese and English

Model Features

Vietnamese optimization

Continued pre-training on a corpus of 100 billion Vietnamese tokens and 40 billion English tokens, specifically optimized for Vietnamese language capabilities

Bilingual support

Supports both Vietnamese and English, retaining the original LLaMA2's English capabilities

Efficient transfer learning

Reduces hardware, time, and data costs required for building Vietnamese LLMs through knowledge transfer methods

Dialogue optimization

Enhanced dialogue capabilities through supervised fine-tuning using 2 million Vietnamese dialogue samples

Model Capabilities

Vietnamese text generation

English text generation

Multi-turn dialogue

Knowledge Q&A

Use Cases

Dialogue systems

Vietnamese virtual assistant

Provides professional and detailed responses as a Vietnamese dialogue assistant

Education

Vietnamese learning aid

Helps learners practice Vietnamese conversation and Q&A

🚀 VBD-LLaMA2-Chat - a Conversationally-tuned LLaMA2 for Vietnamese

VBD-LLaMA2-Chat is a conversationally - tuned LLaMA2 model specifically designed for the Vietnamese language. It aims to support the community in building Vietnamese Large Language Models, leveraging existing language models and adapting them to the Vietnamese language, thus reducing the associated costs.

✨ Features

Language Adaptation: The pretrained weight of VBD-LLaMA2-7B-Chat is trained by extending LLaMA2's vocab on a corpus with 100 billion Vietnamese tokens and 40 billion English tokens, followed by supervised finetuning on 2 million Vietnamese samples.
Cost - Efficiency: This approach attempts to reduce the hardware, time, and data cost associated with building LLMs for low - resource languages.
Versatile Performance: The model shows on - par or better performance than most models in Vietnamese tasks and also demonstrates competence in various multiple - choice question and answer tasks.

📦 Installation

No installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

model_path = "LR-AI-Labs/vbd-llama2-7B-50b-chat"

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.bfloat16,
    device_map='auto',
#     load_in_8bit=True
)
model.eval()

SYS_PROMPT = "A chat between a curious user and an artificial intelligence assistant. "\
        "The assistant gives helpful, detailed, and polite answers to the user's questions."

def response_generate(input_prompt):
    input_ids = tokenizer(input_prompt, return_tensors="pt")
    outputs = model.generate(
        inputs=input_ids["input_ids"].to("cuda"),
        attention_mask=input_ids["attention_mask"].to("cuda"),
        do_sample=True,
        temperature=0.7,
        top_k=50, 
        top_p=0.9,
        max_new_tokens=1024,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    response = response.split("ASSISTANT:")[-1].strip()
    return response

print(response_generate(f"{SYS_PROMPT} USER: Xin chào, bạn là ai? ASSISTANT:"))

# Xin chào, ViVi là một trợ lý trí tuệ nhân tạo có thể trả lời câu hỏi của bạn và trò chuyện với bạn.

Advanced Usage

For single - turn chat:

print(response_generate(f"{SYS_PROMPT} USER: Cách để nấu món phở ngon ASSISTANT:"))

print(response_generate(f"{SYS_PROMPT} USER: Viết cho tôi một email xin sếp tăng lương ASSISTANT:"))

print(response_generate(f'''{SYS_PROMPT} USER: Trả lời câu hỏi dựa vào thông tin lấy từ các đoạn văn sau. Nếu không có thông tin trả lời "Thật tiếc, câu hỏi của bạn Vivi không biết." Nếu có câu trả lời, yêu cầu trả ra cả nguồn chứa câu trả lời dạng "[đoạn k]."
Câu hỏi: {question}
Đoạn 0: {chunk_0}
Đoạn 1: {chunk_1}
Đoạn 2: {chunk_2}
Đoạn 3: {chunk_3}
Đoạn 4: {chunk_4} ASSISTANT:'''))

For conversation chat:

turns = []

while True:
    user_input = input("USER: ")
    turns.append("USER: " + user_input)
    bot_output = "ASSISTANT: " + response_generate("%s %s ASSISTANT:" %(SYS_PROMPT, " ".join(turns)))
    print(bot_output)
    turns.append(bot_output)

# USER: Xin chào
# ASSISTANT: Xin chào. Làm thế nào ViVi có thể giúp bạn?
# USER: Anh bị đau đầu quá
# ASSISTANT: Bạn có thể nói với ViVi nhiều hơn về cơn đau đầu của bạn?
# USER: anh bị đau nửa đầu về đêm, sáng dậy không tỉnh táo
# ASSISTANT: Điều quan trọng là bạn nhận thức được rằng cơn đau đầu của bạn không phải là điều hiếm gặp và nó có thể có nhiều nguyên nhân. Một số nguyên nhân phổ biến của cơn đau đầu là căng thẳng, thiếu ngủ, căng thẳng, chế độ ăn uống kém và mất nước.

# Nếu cơn đau đầu của bạn không giảm đi sau khi nghỉ ngơi và nghỉ ngơi

📚 Documentation

Model weights

VBD-LLaMA2-7B-50b: The snapshot of the pretrained model after 40b Vietnamese tokens and 16b English tokens ((~50b tokens total)).
VBD-LLaMA2-7B-50b-Chat: A snapshot demonstrating the efficacy of the proposed methodology. This base model is pretrained on 40b Vietnamese tokens and 16b English tokens and SFT on 2 million samples.

Pre - training Proposal

We propose to conduct continued pretraining of large language models (such as LLaMA, Bloom, MPT, Falcon) with 3/7/13 billion parameters for Vietnamese and English. The steps include:

Start with an English/multilingual large language model (e.g., https://huggingface.co/meta - llama/Llama - 2 - 7b - hf).
Rebuild the BPE - based tokenizers by preserving the original tokens and incorporating Vietnamese syllables.
Transfer knowledge in the latent space by fine - tuning the added latent space while freezing the original latent space using En - Vi and Vi - En translation tasks.
Fine - tune self - supervised learning (SSL) using 40B English tokens and 100B Vietnamese tokens of unsupervised corpora in the new latent space. A hybrid training strategy is used to enhance zero - shot/few - shot capabilities.
The training time for the 3B model is roughly 8k GPU hours (about 44 days on GPU DGX 8 A100s 40GB), and 16k GPU hours for the 7B model (about 84 days on GPU DGX 8 A100s 40GB).
Periodically evaluate the model to observe improvements and the possibility of early completion of the training progress.

Self - supervised Fine - Tuning (SFT)

VBD-LLaMA2-7B-50b - Chat is finetuned on 2 million conversational data, aiming for more applications of LLMs in conversational systems.

Evaluation

We evaluated our model via peer comparison on multiple publicly available datasets using [@hieunguyen1053 fork of lm - evaluation - harness](https://github.com/hieunguyen1053/lm - evaluation - harness) and combined the results with those provided by the authors of VinaLLaMA.

Model	arc_vi (acc)	hellaswag_vi (acc)	mmlu_vi (acc)	truthfulqa_vi (acc)	Average
URA-LLaMA-13B	0,3752	0,4830	0,3973	0,4574	0,4282
BLOOMZ-7B	0,3205	0,4930	0,3975	0,4523	0,4158
PhoGPT-7B5-Instruct	0,2470	0,2578	0,2413	0,4759	0,3055
SeaLLM-7B-chat	0,3607	0,5112	0,3339	0,4948	0,4252
Vietcuna-7b-v3	0,3419	0,4939	0,3354	0,4807	0,4130
VinaLLaMA-2.7B-chat	0,3273	0,4814	0,3051	0,4972	0,4028
VinaLLaMA-7B-chat	0,4239	0,5407	0,3932	0,5251	0,4707
VBD-LLaMA2-7B-50b	0,3222	0,5195	0,2964	0,4614	0,3999
VBD-LLaMA2-7B-50b-Chat	0,3585	0,5207	0,3444	0,5179	0,4354

Table 1. Benchmark on Vietnamese datasets

Organization	Model	Model size	ARC (ACC)	HellaSwag (ACC)	LAMBADA (perplexity)	MMLU (ACC)
VLSP	hoa-7b	~7B	0,2722	0,4867	18,53
BK Lab	LLaMA-2-BK	~7B	0,4164	0,7216	5,010
ViLM	vietcuna-7b-v3	~7B	0,3976	0,6309	7,125
BigScience	Bloomz-T0	~7B	0,436	0,6401	6,542	0,3785
TII	Falcon-7B-Instruct	~7B	0,4258	0,6976	7,463	0,2584
MosaicML	MPT-7B-Chat	~7B	0,4258	0,7438	5,797	0,3762
Meta	LLaMA-2-Chat	~7B	0,442	0,7547	3,968	0,4832
AISingapore	Sealion7b	~7B	0,3422	0,6705	6,715	0,268
VBD	VBD-LLaMA2-7B-50b-Chat	~7B	0,4556	0,7384	4,645	0,4558

Table 2. Benchmark on English datasets

The model also has results on the VMLU datasets.

Pretraining loss

🔧 Technical Details

The model's pretrained weight is trained through continuous self - supervised learning (SSL) by extending LLaMA2's vocab on a specific corpus. The subsequent supervised finetuning (SFT) is conducted using an internal SFT dataset. The training process involves knowledge transfer between different latent spaces and a hybrid training strategy to enhance the model's capabilities.

📄 License

By using our released weights, you agree to and comply with the terms and conditions specified in Meta's LLaMA - 2 license.

⚠️ Important Note

Disclaimer 1: VBD - LLaMA family is an effort by VinBigData to support and promote research on LLM in Vietnam. This model is not related to the ViGPT/ViViChat or any other product operating at VinBigData.

Disclaimer 2: While we have made considerable efforts to minimize misleading, inaccurate, and harmful content generation, it's important to acknowledge that our released model carries inherent risks. We strongly recommend utilizing this model exclusively within a closely supervised environment and/or conducting additional testing, red teaming, and alignment procedures. The utilization of this model must adhere to and comply with local governance and regulations. The authors of this model shall not be held liable for any claims, damages, or other liabilities arising from the use of the released weights.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご