Phi2-Chinese-0.2B Open-Source Chinese Language Model - Supports Free Deployment for Text Generation Tasks

Phi2 Chinese 0.2B

Developed by charent

A 200-million-parameter Chinese causal language model based on the Phi2 architecture, supporting text generation tasks

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Chinese small parameter model #Instruction fine-tuning optimization #Lightweight text generation

Downloads 65

Release Time : 12/25/2023

Model Overview

This is an experimental small Chinese model using the Phi2 architecture, trained from scratch to support text generation tasks. The project has open-sourced code and model weights, but the pre-training data scale is limited.

Model Features

Chinese optimization

Specifically optimized for Chinese text, including traditional to simplified conversion, full-width to half-width conversion, etc.

Efficient training for small models

Only 200 million parameters, suitable for training and inference with limited hardware resources

Complete training process

Provides a complete workflow from data cleaning, tokenizer training to pre-training, instruction fine-tuning, and RLHF optimization

Model Capabilities

Chinese text generation

Q&A system

Instruction following

Use Cases

Medical consultation

Cold treatment advice

Provides coping suggestions for cold symptoms

Generates reasonable medical advice text

Company introduction

Company profile generation

Generates corporate introduction text

For example, introduction text for Apple Inc.

Daily life

Lunch recommendation

Recommends lunch options based on user needs

Generates reasonable dietary suggestions

🚀 Phi2-Chinese-0.2B: Train Your Own Small Chinese Phi2 Model from Scratch

This is an experimental project that open-sources the code and model weights. The pre-training data is limited. For a better-performing small Chinese model, refer to the project ChatLM-mini-Chinese.

Github Repository: Phi2-mini-Chinese

🚀 Quick Start

Project Overview

This project focuses on training a small Chinese Phi2 model from scratch. It involves multiple steps including data cleaning, tokenizer training, causal language model (CLM) pre - training, supervised fine - tuning (SFT), and reinforcement learning with human feedback (RLHF) optimization.

✨ Features

Open Source: Both the code and model weights are open - sourced.
Multiple Training Steps: Covers data cleaning, tokenizer training, CLM pre - training, SFT, and RLHF optimization.
Easy to Use: Provides clear code examples for model usage.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The following code demonstrates how to use the trained model to generate text:

from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

tokenizer = AutoTokenizer.from_pretrained('charent/Phi2-Chinese-0.2B')
model = AutoModelForCausalLM.from_pretrained('charent/Phi2-Chinese-0.2B').to(device)


txt = '感冒了要怎么办？'
prompt = f"##提问:\n{txt}\n##回答:\n"

# greedy search
gen_conf = GenerationConfig(
    num_beams=1,
    do_sample=False,
    max_length=320,
    max_new_tokens=256,
    no_repeat_ngram_size=4,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

tokend = tokenizer.encode_plus(text=prompt)
input_ids, attention_mask = torch.LongTensor([tokend.input_ids]).to(device), \
    torch.LongTensor([tokend.attention_mask]).to(device)

outputs = model.generate(
    inputs=input_ids,
    attention_mask=attention_mask,
    generation_config=gen_conf,
)

outs = tokenizer.decode(outputs[0].cpu().numpy(), clean_up_tokenization_spaces=True, skip_special_tokens=True,)
print(outs)

Example Output

##提问:
感冒了要怎么办？
##回答:
感冒是由病毒引起的，感冒一般由病毒引起，以下是一些常见感冒的方法：
- 洗手，特别是在接触其他人或物品后。
- 咳嗽或打喷嚏时用纸巾或手肘遮住口鼻。
- 用手触摸口鼻，特别是喉咙和鼻子。
- 如果咳嗽或打喷嚏，可以用纸巾或手绢来遮住口鼻，但要远离其他人。
- 如果你感冒了，最好不要触摸自己的眼睛、鼻子和嘴巴。
- 在感冒期间，最好保持充足的水分和休息，以缓解身体的疲劳。
- 如果您已经感冒了，可以喝一些温水或盐水来补充体液。
- 另外，如果感冒了，建议及时就医。

📚 Documentation

1. ⚗️ Data Cleaning

Code: dataset.ipynb. Data cleaning includes adding a period at the end of a sentence, converting traditional Chinese to simplified Chinese, converting full - width characters to half - width characters, and removing repeated punctuation marks. For more details, refer to the project ChatLM-mini-Chinese.

2. 🗨️ Tokenizer Training

Code: tokeinzer.ipynb. This project uses a byte level BPE tokenizer. Training codes for both char level and byte level tokenizers are provided. After training, check if the tokenizer's vocabulary contains common special symbols like \t and \n. You can try encoding and decoding a text with special characters to see if it can be restored. If not, add these characters using the add_tokens function. Note that len(tokenizer) gets the size of the vocabulary, while tokenizer.vocab_size does not count the characters added through the add_tokens function.

3. ⛏️ CLM Causal Model Pre - training

Code: pretrain.ipynb. Unsupervised pre - training is performed on a large amount of text, mainly using the bell open source dataset BELLE. The dataset format requires each sample to be a single sentence, and long sentences can be truncated into multiple samples. During CLM pre - training, the model's input and output are the same. When calculating the cross - entropy loss, the output should be shifted by one position. Special tokens like EOS and BOS can be omitted during pre - training.

4. ⚒️ SFT Instruction Fine - tuning

Code: sft.ipynb. The bell open source dataset is mainly used. Thanks to the project BELLE. The data format for SFT training is as follows:

text = f"##提问:\n{example['instruction']}\n##回答:\n{example['output'][EOS]"

The model ignores the part before the "##回答:" (including "##回答:") when calculating the loss. Remember to add the EOS special token to indicate the end of a sentence; otherwise, the model will not know when to stop during decoding. The BOS token for the start of a sentence is optional.

5. 📝 RLHF Optimization

This project uses the DPO optimization method. Code: dpo.ipynb. Fine - tune the SFT model according to personal preferences. The dataset should have three columns: prompt, chosen, and rejected. For the rejected column, some data can be generated from the initial model in the SFT stage. If the similarity between the generated rejected and chosen is above 0.9, discard the data. During the DPO process, two models are required: one for training and one as a reference. Although they are initially the same model, the reference model does not participate in parameter updates.

6. 📑 Model Usage

The model weights are available on the huggingface repository: Phi2-Chinese-0.2B.

🔧 Technical Details

Model Training: The model training process involves multiple steps including data cleaning, tokenizer training, CLM pre - training, SFT, and RLHF optimization. Each step has its own requirements and considerations.
Data Requirements: Different datasets are used in different training stages, and specific data formats are required. For example, the CLM pre - training dataset requires each sample to be a single sentence, and the SFT training dataset has a specific text format.
Memory Requirements: Tokenizer training is memory - intensive. For example, training a byte level tokenizer on 100 million characters requires at least 32GB of memory, and training a char level tokenizer on 650 million characters also requires at least 32GB of memory.

📄 License

This project is licensed under the Apache - 2.0 license.

🎓 Citation

If you find this project helpful, please consider citing it:

@misc{Charent2023,
    author={Charent Chen},
    title={A small Chinese causal language model with 0.2B parameters base on Phi2},
    year={2023},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/charent/Phi2-mini-Chinese}},
}

🤔 Other Notes

The project does not assume any risks or responsibilities arising from data security, public opinion issues, or any misuse, spread, or improper use of the open - source model and code.

Property	Details
Model Type	Phi2 - Chinese - 0.2B
Training Data	BelleGroup/train_1M_CN
Library Name	transformers
Pipeline Tag	text - generation

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご