Vbd Llama2 7B 50b Chat

Model Overview
Model Features
Model Capabilities
Use Cases
🚀 VBD-LLaMA2-Chat - a Conversationally-tuned LLaMA2 for Vietnamese
VBD-LLaMA2-Chat is a conversationally - tuned LLaMA2 model specifically designed for the Vietnamese language. It aims to support the community in building Vietnamese Large Language Models, leveraging existing language models and adapting them to the Vietnamese language, thus reducing the associated costs.
✨ Features
- Language Adaptation: The pretrained weight of VBD-LLaMA2-7B-Chat is trained by extending LLaMA2's vocab on a corpus with 100 billion Vietnamese tokens and 40 billion English tokens, followed by supervised finetuning on 2 million Vietnamese samples.
- Cost - Efficiency: This approach attempts to reduce the hardware, time, and data cost associated with building LLMs for low - resource languages.
- Versatile Performance: The model shows on - par or better performance than most models in Vietnamese tasks and also demonstrates competence in various multiple - choice question and answer tasks.
📦 Installation
No installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
import torch
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
model_path = "LR-AI-Labs/vbd-llama2-7B-50b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(
model_path, torch_dtype=torch.bfloat16,
device_map='auto',
# load_in_8bit=True
)
model.eval()
SYS_PROMPT = "A chat between a curious user and an artificial intelligence assistant. "\
"The assistant gives helpful, detailed, and polite answers to the user's questions."
def response_generate(input_prompt):
input_ids = tokenizer(input_prompt, return_tensors="pt")
outputs = model.generate(
inputs=input_ids["input_ids"].to("cuda"),
attention_mask=input_ids["attention_mask"].to("cuda"),
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.9,
max_new_tokens=1024,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
response = response.split("ASSISTANT:")[-1].strip()
return response
print(response_generate(f"{SYS_PROMPT} USER: Xin chào, bạn là ai? ASSISTANT:"))
# Xin chào, ViVi là một trợ lý trí tuệ nhân tạo có thể trả lời câu hỏi của bạn và trò chuyện với bạn.
Advanced Usage
For single - turn chat:
print(response_generate(f"{SYS_PROMPT} USER: Cách để nấu món phở ngon ASSISTANT:"))
print(response_generate(f"{SYS_PROMPT} USER: Viết cho tôi một email xin sếp tăng lương ASSISTANT:"))
print(response_generate(f'''{SYS_PROMPT} USER: Trả lời câu hỏi dựa vào thông tin lấy từ các đoạn văn sau. Nếu không có thông tin trả lời "Thật tiếc, câu hỏi của bạn Vivi không biết." Nếu có câu trả lời, yêu cầu trả ra cả nguồn chứa câu trả lời dạng "[đoạn k]."
Câu hỏi: {question}
Đoạn 0: {chunk_0}
Đoạn 1: {chunk_1}
Đoạn 2: {chunk_2}
Đoạn 3: {chunk_3}
Đoạn 4: {chunk_4} ASSISTANT:'''))
For conversation chat:
turns = []
while True:
user_input = input("USER: ")
turns.append("USER: " + user_input)
bot_output = "ASSISTANT: " + response_generate("%s %s ASSISTANT:" %(SYS_PROMPT, " ".join(turns)))
print(bot_output)
turns.append(bot_output)
# USER: Xin chào
# ASSISTANT: Xin chào. Làm thế nào ViVi có thể giúp bạn?
# USER: Anh bị đau đầu quá
# ASSISTANT: Bạn có thể nói với ViVi nhiều hơn về cơn đau đầu của bạn?
# USER: anh bị đau nửa đầu về đêm, sáng dậy không tỉnh táo
# ASSISTANT: Điều quan trọng là bạn nhận thức được rằng cơn đau đầu của bạn không phải là điều hiếm gặp và nó có thể có nhiều nguyên nhân. Một số nguyên nhân phổ biến của cơn đau đầu là căng thẳng, thiếu ngủ, căng thẳng, chế độ ăn uống kém và mất nước.
# Nếu cơn đau đầu của bạn không giảm đi sau khi nghỉ ngơi và nghỉ ngơi
📚 Documentation
Model weights
- VBD-LLaMA2-7B-50b: The snapshot of the pretrained model after 40b Vietnamese tokens and 16b English tokens ((~50b tokens total)).
- VBD-LLaMA2-7B-50b-Chat: A snapshot demonstrating the efficacy of the proposed methodology. This base model is pretrained on 40b Vietnamese tokens and 16b English tokens and SFT on 2 million samples.
Pre - training Proposal
We propose to conduct continued pretraining of large language models (such as LLaMA, Bloom, MPT, Falcon) with 3/7/13 billion parameters for Vietnamese and English. The steps include:
- Start with an English/multilingual large language model (e.g., https://huggingface.co/meta - llama/Llama - 2 - 7b - hf).
- Rebuild the BPE - based tokenizers by preserving the original tokens and incorporating Vietnamese syllables.
- Transfer knowledge in the latent space by fine - tuning the added latent space while freezing the original latent space using En - Vi and Vi - En translation tasks.
- Fine - tune self - supervised learning (SSL) using 40B English tokens and 100B Vietnamese tokens of unsupervised corpora in the new latent space. A hybrid training strategy is used to enhance zero - shot/few - shot capabilities.
- The training time for the 3B model is roughly 8k GPU hours (about 44 days on GPU DGX 8 A100s 40GB), and 16k GPU hours for the 7B model (about 84 days on GPU DGX 8 A100s 40GB).
- Periodically evaluate the model to observe improvements and the possibility of early completion of the training progress.
Self - supervised Fine - Tuning (SFT)
VBD-LLaMA2-7B-50b - Chat is finetuned on 2 million conversational data, aiming for more applications of LLMs in conversational systems.
Evaluation
We evaluated our model via peer comparison on multiple publicly available datasets using [@hieunguyen1053 fork of lm - evaluation - harness](https://github.com/hieunguyen1053/lm - evaluation - harness) and combined the results with those provided by the authors of VinaLLaMA.
Model | Model size | arc_vi (acc) | hellaswag_vi (acc) | mmlu_vi (acc) | truthfulqa_vi (acc) | Average |
---|---|---|---|---|---|---|
URA-LLaMA-13B | 0,3752 | 0,4830 | 0,3973 | 0,4574 | 0,4282 | |
BLOOMZ-7B | 0,3205 | 0,4930 | 0,3975 | 0,4523 | 0,4158 | |
PhoGPT-7B5-Instruct | 0,2470 | 0,2578 | 0,2413 | 0,4759 | 0,3055 | |
SeaLLM-7B-chat | 0,3607 | 0,5112 | 0,3339 | 0,4948 | 0,4252 | |
Vietcuna-7b-v3 | 0,3419 | 0,4939 | 0,3354 | 0,4807 | 0,4130 | |
VinaLLaMA-2.7B-chat | 0,3273 | 0,4814 | 0,3051 | 0,4972 | 0,4028 | |
VinaLLaMA-7B-chat | 0,4239 | 0,5407 | 0,3932 | 0,5251 | 0,4707 | |
VBD-LLaMA2-7B-50b | 0,3222 | 0,5195 | 0,2964 | 0,4614 | 0,3999 | |
VBD-LLaMA2-7B-50b-Chat | 0,3585 | 0,5207 | 0,3444 | 0,5179 | 0,4354 |
Table 1. Benchmark on Vietnamese datasets
Organization | Model | Model size | ARC (ACC) | HellaSwag (ACC) | LAMBADA (perplexity) | MMLU (ACC) |
---|---|---|---|---|---|---|
VLSP | hoa-7b | ~7B | 0,2722 | 0,4867 | 18,53 | |
BK Lab | LLaMA-2-BK | ~7B | 0,4164 | 0,7216 | 5,010 | |
ViLM | vietcuna-7b-v3 | ~7B | 0,3976 | 0,6309 | 7,125 | |
BigScience | Bloomz-T0 | ~7B | 0,436 | 0,6401 | 6,542 | 0,3785 |
TII | Falcon-7B-Instruct | ~7B | 0,4258 | 0,6976 | 7,463 | 0,2584 |
MosaicML | MPT-7B-Chat | ~7B | 0,4258 | 0,7438 | 5,797 | 0,3762 |
Meta | LLaMA-2-Chat | ~7B | 0,442 | 0,7547 | 3,968 | 0,4832 |
AISingapore | Sealion7b | ~7B | 0,3422 | 0,6705 | 6,715 | 0,268 |
VBD | VBD-LLaMA2-7B-50b-Chat | ~7B | 0,4556 | 0,7384 | 4,645 | 0,4558 |
Table 2. Benchmark on English datasets
The model also has results on the VMLU datasets.
Pretraining loss
🔧 Technical Details
The model's pretrained weight is trained through continuous self - supervised learning (SSL) by extending LLaMA2's vocab on a specific corpus. The subsequent supervised finetuning (SFT) is conducted using an internal SFT dataset. The training process involves knowledge transfer between different latent spaces and a hybrid training strategy to enhance the model's capabilities.
📄 License
By using our released weights, you agree to and comply with the terms and conditions specified in Meta's LLaMA - 2 license.
⚠️ Important Note
Disclaimer 1: VBD - LLaMA family is an effort by VinBigData to support and promote research on LLM in Vietnam. This model is not related to the ViGPT/ViViChat or any other product operating at VinBigData.
Disclaimer 2: While we have made considerable efforts to minimize misleading, inaccurate, and harmful content generation, it's important to acknowledge that our released model carries inherent risks. We strongly recommend utilizing this model exclusively within a closely supervised environment and/or conducting additional testing, red teaming, and alignment procedures. The utilization of this model must adhere to and comply with local governance and regulations. The authors of this model shall not be held liable for any claims, damages, or other liabilities arising from the use of the released weights.

