Model Overview
Model Features
Model Capabilities
Use Cases
🚀 MiniLLM Project
This project aims to build a small-parameter LLM, going through the four stages of pretraining
-> instruction fine-tuning
-> reward model
-> reinforcement learning
to complete a chat model capable of simple chat tasks at a controllable cost.
🚀 Quick Start
Environment Installation
pip install bert4torch==0.4.9.post2 # If not found, specify -i https://pypi.org/simple
Script Explanation
# To prevent the terminal from closing, you can use nohup, tmux, or screen to start the script.
# eg. nohup torchrun --standalone --nproc_per_node=4 pretrain.py --name baby > nohup.log&
# Pretraining
cd pretrain
torchrun --standalone --nproc_per_node=4 pretrain.py # Some reports indicate that DDP training may crash halfway. You need to set `export NCCL_IB_DISABLE=1`.
# Pretraining Inference (Command-line Chat)
cd pretrain
python infer.py # python infer_transformers.py
# Instruction Fine-tuning Training
cd sft
python sft.py
# Instruction Fine-tuning Inference (Command-line Chat)
cd sft
python infer.py # python infer_transformers.py
# Convert the checkpoint to a format that can be run by transformers
cd docs
python convert.py
✨ Features
- Efficient Training Framework: Uses the bert4torch training framework, with concise and efficient code.
- Inference Compatibility: The trained checkpoints can be directly used for inference with the
transformers
package. - Memory Optimization: Optimizes memory usage during training.
- Reproducibility: Provides complete training logs for reproduction and comparison.
Note: The model trained in this experiment currently only has simple chat capabilities (limited by the size of the corpus, model scale, and the size and quality of the SFT corpus) and is not capable of answering complex questions.
📦 Installation
The installation steps are included in the "Quick Start" section. Please follow the instructions there to install the necessary environment and run the scripts.
💻 Usage Examples
Basic Usage - Pretraining Model Inference
# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
from transformers import AutoTokenizer, LlamaForCausalLM
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao' # 'Tongjilibo/MiniLLM-L12_H1024_A8-NoWudao'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)
query = '王鹏是一名'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)
print(response)
Basic Usage - Instruction Fine-tuning Model Inference
# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
from transformers import AutoTokenizer, LlamaForCausalLM
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)
query = '你好'
query = f'<human>{query}<robot>'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)[len(query):]
print(response)
📚 Documentation
Update History
- 20240316: Initial submission, including the pretrained models
MiniLLM-MiniLLM-L12_H1024_A8-NoWudao
andMiniLLM-MiniLLM-L12_H1024_A8-WithWudao
, and the SFT modelMiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca
.
Pretraining
Pretraining Corpus (from baby-llama2-chinese)
Chinese Pretraining Corpus | Description |
---|---|
Wiki Chinese Encyclopedia | Data from Chinese Wikipedia |
BaiduBaiKe (Extraction code: bwvb) | Data from Chinese BaiduBaiKe |
C4_zh: part1 (Extraction code: zv4r); C4_zh: part2 (Extraction code: sb83); C4_zh: part3 (Extraction code: l89d) | C4 is one of the largest available language datasets, collecting over 156 billion tokens from over 365 million domains on the Internet. C4_zh is a part of it. |
WuDaoCorpora | 200G of open-source Chinese Wudao data |
shibing624/medical | A part of the pretraining data in the medical field from shibing624 |
The project has open-sourced the pretraining corpus processed by the ChatGLM2-6B tokenizer, with a total data volume of 63.4 billion Tokens. The link is as follows: Corpus (Extraction code: 6unr).
Pretraining Weights
Pretraining Weights | Pretraining Corpus | Download Address |
---|---|---|
MiniLLM-L12_H1024_A8-NoWudao | (14 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh |
Baidu Netdisk, HuggingFace |
MiniLLM-L12_H1024_A8-WithWudao | (64 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh, WuDaoCorpora |
Baidu Netdisk, HuggingFace |
Pretraining Process
- Training Parameter Configuration and Training Duration
Weights | Pretraining Settings | Hardware Usage and Training Duration |
---|---|---|
MiniLLM-L12_H1024_A8-NoWudao | 14 billion Tokens; btz=32*4gpu; lr=3e-4; warmup_steps=5000 | 4×A800(80G), about 60G per card, taking 20 hours |
MiniLLM-L12_H1024_A8-WithWudao | 64 billion Tokens; btz=32*4gpu; lr=1.5e-4; warmup_steps=5000 | ✅ 4×A800(80G), about 60G per card, taking 3.79 days ✅ 2×4090 in the baby-llama2 project, taking 26 days ✅ In personal tests, with a single card and btz=8, the GPU usage is about 17G, and the duration is unknown (gradient accumulation can be used to further reduce the usage). |
Pretraining Model Invocation
The code example is provided in the "Usage Examples" section.
Pretraining Continuation Effect
- MiniLLM-L12_H1024_A8-NoWudao
User: Xiaoming has excellent academic performance and good health. He is a
Continuation: Xiaoming has excellent academic performance and good health. He is an outstanding student at Dalian University of Technology. Since his freshman year in college, he has been very fond of sports, with top-notch grades, and has participated in many competitions. In the school's sports activities, he ranked among the top and was awarded the title of "Outstanding Athlete" by Dalian University of Technology.
In 2006, in his sophomore year of high school, Xiaoming's classmate Yang Fan came to Dalian University of Technology. He was determined to change his fate. Every day after school, he would run to the playground to practice basketball. By around 3 pm, Xiaoming's basketball skills began to show. His basketball moves were agile and powerful. Moreover, he often used his free time to practice jump shots.
User: Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon.
Continuation: Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon. Lowering my head, I miss my hometown...
When I read these poems, I also feel very kind and familiar.
- MiniLLM-L12_H1024_A8-WithWudao
User: Xiaoming has excellent academic performance and good health. He is a
Continuation: Xiaoming has excellent academic performance and good health. He is an outstanding teacher. He is honest and responsible. In class, he can listen attentively, think actively, and actively raise his hand to answer questions in every class. After class, he actively asks to help the teacher, carefully completes his homework, and finishes the homework assigned by the teacher on time, which is well-received by his classmates.
User: Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon.
Continuation: "Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon. Lowering my head, I miss my hometown." This poem not only describes the warmth and tranquility of the Mid-Autumn night but also expresses the author's deep affection and longing for his hometown. The phrase "The moon rises above the willow branches" describes what is seen under the moon. "I want to pour my feelings into the lute" refers to the desire to express farewell feelings; "I raise my head to look at the bright moon" describes the scene of the moon hanging high in the sky, shining brightly with stars on the Mid-Autumn night; "Lowering my head, I miss my hometown" expresses the deep affection and longing for the hometown.
In terms of writing techniques, this poem mainly uses symbolism.
Instruction Fine-tuning
Instruction Fine-tuning Corpus (Selected Available Datasets)
Dataset Name | Introduction |
---|---|
shibing624/alpaca-zh | Self-instruct data obtained based on the Alpaca method using GPT4, about 50,000 entries |
BelleGroup/Belle-0.5M-cn | Contains about 500,000 Chinese instruction data generated by the BELLE project |
BelleGroup/Belle-1M-cn | Contains about 1 million Chinese instruction data generated by the BELLE project |
BelleGroup/Belle-school_math_0.25M | Belle's open 0.25M math instruction dataset |
BelleGroup/Belle-multiturn_chat_0.8M | Belle's open 0.8M multi-turn task dialogue dataset |
YeungNLP/firefly-train-1.1M | Data for 23 common Chinese NLP tasks in Firefly, and many data related to Chinese culture are constructed, such as couplets, poem writing, classical Chinese translation, prose, and Jin Yong's novels. For each task, several instruction templates are written manually to ensure the high quality and richness of the data, with a data volume of 1.15 million. |
fnlp/moss-002-sft-data | The multi-turn dialogue data used by MOSS-002, covering three aspects of usefulness, faithfulness, and harmlessness, including about 570,000 English dialogues and 590,000 Chinese dialogues generated by text-davinci-003. |
fnlp/moss-003-sft-data | The multi-turn dialogue data used by moss-moon-003-sft, constructed based on about 100,000 user input data collected during the internal testing phase of MOSS-002 and gpt-3.5-turbo. Compared with moss-002-sft-data, moss-003-sft-data is more in line with the real user intention distribution, containing more fine-grained usefulness category labels, more extensive harmlessness data, and longer dialogue turns, with about 1.1 million dialogue data. |
shareAI/CodeChat | Mainly contains corpus samples related to logical reasoning, code Q&A, and code generation. |
shareAI/ShareGPT-Chinese-English-90k | A high-quality Chinese-English parallel bilingual human-machine Q&A dataset covering real and complex user questions. |
deepctrl/deepctrl-sft-data | The SFT dataset of the Jiangshu large model is a high-quality dataset carefully collected and organized by Jiangshu Technology, including a Chinese dataset with 10 million entries and an English dataset with 2 million entries. |
Instruction Fine-tuning Weights
Instruction Fine-tuning Weights | Corpus | Download Address |
---|---|---|
MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca | shibing624/alpaca-zh | Baidu Netdisk, HuggingFace |
Instruction Fine-tuning Training Process
- Training Parameter Configuration and Training Duration
Weights | Pretraining Settings | Hardware Usage and Training Duration |
---|---|---|
MiniLLM-L12_H1024_A8-NoWudao | shibing624/alpaca-zh dataset; btz=8; lr=2e-5; 5 epochs | Single 4090 card, 17G of video memory, taking 45 minutes |
Instruction Fine-tuning Model Invocation
The code example is provided in the "Usage Examples" section.
Instruction Fine-tuning Chat Effect
- MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca
User: Hello
Assistant: Hello! Nice to meet you.
User: What can you do?
Assistant: Hello! I'm an AI assistant. Is there anything I can do for you?
User: How to make scrambled eggs with tomatoes?
Assistant: If you want to make scrambled eggs with tomatoes, the following steps may help you:
1. Prepare ingredients: Prepare eggs, salt, coriander, shredded carrots, shredded cucumbers, etc.
2. Wash the eggs: Break the eggs into a mixing bowl to clean their surfaces.
3. Prepare materials: Prepare a pot, pour an appropriate amount of water, and pour the egg liquid into the pot.
4. Boil the eggs: Fry the eggs in a soup pot or pan until their surfaces turn golden and they are cooked.
5. Stir-fry the vegetables: Add an appropriate amount of oil to the pot, stir-fry the shredded carrots and shredded cucumbers (according to personal preference) for a few times until the carrots are cooked.
6. Add the eggs: Take out the eggs from the pot and pour them into the pot.
7. Season: Pour the stir-fried eggs into the pot and stir-fry them with the vegetables for a few times.
Note: If you want to stir-fry the vegetables to have a richer flavor, you can add an appropriate amount of salt and coriander to enhance the flavor.
📄 License
This project is licensed under the Apache-2.0 license.

