🚀 Phi2-Chinese-0.2B: Train Your Own Small Chinese Phi2 Model from Scratch
This is an experimental project that open-sources the code and model weights. The pre-training data is limited. For a better-performing small Chinese model, refer to the project ChatLM-mini-Chinese.
Github Repository: Phi2-mini-Chinese
🚀 Quick Start
Project Overview
This project focuses on training a small Chinese Phi2 model from scratch. It involves multiple steps including data cleaning, tokenizer training, causal language model (CLM) pre - training, supervised fine - tuning (SFT), and reinforcement learning with human feedback (RLHF) optimization.
✨ Features
- Open Source: Both the code and model weights are open - sourced.
- Multiple Training Steps: Covers data cleaning, tokenizer training, CLM pre - training, SFT, and RLHF optimization.
- Easy to Use: Provides clear code examples for model usage.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
The following code demonstrates how to use the trained model to generate text:
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = AutoTokenizer.from_pretrained('charent/Phi2-Chinese-0.2B')
model = AutoModelForCausalLM.from_pretrained('charent/Phi2-Chinese-0.2B').to(device)
txt = '感冒了要怎么办?'
prompt = f"##提问:\n{txt}\n##回答:\n"
gen_conf = GenerationConfig(
num_beams=1,
do_sample=False,
max_length=320,
max_new_tokens=256,
no_repeat_ngram_size=4,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
tokend = tokenizer.encode_plus(text=prompt)
input_ids, attention_mask = torch.LongTensor([tokend.input_ids]).to(device), \
torch.LongTensor([tokend.attention_mask]).to(device)
outputs = model.generate(
inputs=input_ids,
attention_mask=attention_mask,
generation_config=gen_conf,
)
outs = tokenizer.decode(outputs[0].cpu().numpy(), clean_up_tokenization_spaces=True, skip_special_tokens=True,)
print(outs)
Example Output
##提问:
感冒了要怎么办?
##回答:
感冒是由病毒引起的,感冒一般由病毒引起,以下是一些常见感冒的方法:
- 洗手,特别是在接触其他人或物品后。
- 咳嗽或打喷嚏时用纸巾或手肘遮住口鼻。
- 用手触摸口鼻,特别是喉咙和鼻子。
- 如果咳嗽或打喷嚏,可以用纸巾或手绢来遮住口鼻,但要远离其他人。
- 如果你感冒了,最好不要触摸自己的眼睛、鼻子和嘴巴。
- 在感冒期间,最好保持充足的水分和休息,以缓解身体的疲劳。
- 如果您已经感冒了,可以喝一些温水或盐水来补充体液。
- 另外,如果感冒了,建议及时就医。
📚 Documentation
1. ⚗️ Data Cleaning
Code: dataset.ipynb.
Data cleaning includes adding a period at the end of a sentence, converting traditional Chinese to simplified Chinese, converting full - width characters to half - width characters, and removing repeated punctuation marks. For more details, refer to the project ChatLM-mini-Chinese.
2. 🗨️ Tokenizer Training
Code: tokeinzer.ipynb.
This project uses a byte level
BPE tokenizer. Training codes for both char level
and byte level
tokenizers are provided. After training, check if the tokenizer's vocabulary contains common special symbols like \t
and \n
. You can try encoding and decoding a text with special characters to see if it can be restored. If not, add these characters using the add_tokens
function. Note that len(tokenizer)
gets the size of the vocabulary, while tokenizer.vocab_size
does not count the characters added through the add_tokens
function.
3. ⛏️ CLM Causal Model Pre - training
Code: pretrain.ipynb.
Unsupervised pre - training is performed on a large amount of text, mainly using the bell open source
dataset BELLE. The dataset format requires each sample to be a single sentence, and long sentences can be truncated into multiple samples. During CLM pre - training, the model's input and output are the same. When calculating the cross - entropy loss, the output should be shifted by one position. Special tokens like EOS
and BOS
can be omitted during pre - training.
4. ⚒️ SFT Instruction Fine - tuning
Code: sft.ipynb.
The bell open source
dataset is mainly used. Thanks to the project BELLE. The data format for SFT training is as follows:
text = f"##提问:\n{example['instruction']}\n##回答:\n{example['output'][EOS]"
The model ignores the part before the "##回答:"
(including "##回答:"
) when calculating the loss. Remember to add the EOS
special token to indicate the end of a sentence; otherwise, the model will not know when to stop during decoding. The BOS
token for the start of a sentence is optional.
5. 📝 RLHF Optimization
This project uses the DPO optimization method.
Code: dpo.ipynb.
Fine - tune the SFT model according to personal preferences. The dataset should have three columns: prompt
, chosen
, and rejected
. For the rejected
column, some data can be generated from the initial model in the SFT stage. If the similarity between the generated rejected
and chosen
is above 0.9, discard the data. During the DPO process, two models are required: one for training and one as a reference. Although they are initially the same model, the reference model does not participate in parameter updates.
6. 📑 Model Usage
The model weights are available on the huggingface
repository: Phi2-Chinese-0.2B.
🔧 Technical Details
- Model Training: The model training process involves multiple steps including data cleaning, tokenizer training, CLM pre - training, SFT, and RLHF optimization. Each step has its own requirements and considerations.
- Data Requirements: Different datasets are used in different training stages, and specific data formats are required. For example, the CLM pre - training dataset requires each sample to be a single sentence, and the SFT training dataset has a specific text format.
- Memory Requirements: Tokenizer training is memory - intensive. For example, training a
byte level
tokenizer on 100 million characters requires at least 32GB of memory, and training a char level
tokenizer on 650 million characters also requires at least 32GB of memory.
📄 License
This project is licensed under the Apache - 2.0 license.
🎓 Citation
If you find this project helpful, please consider citing it:
@misc{Charent2023,
author={Charent Chen},
title={A small Chinese causal language model with 0.2B parameters base on Phi2},
year={2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/charent/Phi2-mini-Chinese}},
}
🤔 Other Notes
The project does not assume any risks or responsibilities arising from data security, public opinion issues, or any misuse, spread, or improper use of the open - source model and code.
Property |
Details |
Model Type |
Phi2 - Chinese - 0.2B |
Training Data |
BelleGroup/train_1M_CN |
Library Name |
transformers |
Pipeline Tag |
text - generation |