MiniLLM-0.2B-WithWudao Open-Source Chinese Language Model - Full-Process Coverage, Supporting Basic Conversations

Minillm 0.2B WithWudao

Developed by Tongjilibo

MiniLLM is a lightweight Chinese language model developed based on the bert4torch framework, covering the entire process from pre-training to instruction fine-tuning, with basic dialogue capabilities

Large Language Model

Transformers

Open Source License:Apache-2.0 #Lightweight Language Model #Chinese Instruction Fine-tuning #Cost-effective Training

Downloads 127

Release Time : 4/25/2025

Model Overview

A lightweight Chinese language model with a 12-layer Transformer structure, pre-trained and fine-tuned with Alpaca instructions, suitable for basic dialogue scenarios

Model Features

Full-process Training

Complete implementation of the entire process: pre-training → instruction fine-tuning → reward modeling → reinforcement learning

Lightweight and Efficient

12-layer Transformer structure, fine-tuning possible with just 17GB of GPU memory

Framework Compatibility

Training uses the bert4torch framework, while inference is compatible with transformers

Model Capabilities

Chinese text generation

Basic dialogue interaction

Instruction following

Use Cases

Intelligent Assistant

Daily Q&A

Answering everyday life-related questions

Generates concise responses based on common sense

🚀 MiniLLM Project

This project aims to build a small-parameter LLM, going through the four stages of pretraining -> instruction fine-tuning -> reward model -> reinforcement learning to complete a chat model capable of simple chat tasks at a controllable cost.

🚀 Quick Start

Environment Installation

pip install bert4torch==0.4.9.post2  # If not found, specify -i https://pypi.org/simple

Script Explanation

# To prevent the terminal from closing, you can use nohup, tmux, or screen to start the script.
# eg. nohup torchrun --standalone --nproc_per_node=4 pretrain.py --name baby > nohup.log&

# Pretraining
cd pretrain
torchrun --standalone --nproc_per_node=4 pretrain.py  # Some reports indicate that DDP training may crash halfway. You need to set `export NCCL_IB_DISABLE=1`.

# Pretraining Inference (Command-line Chat)
cd pretrain
python infer.py  # python infer_transformers.py

# Instruction Fine-tuning Training
cd sft
python sft.py

# Instruction Fine-tuning Inference (Command-line Chat)
cd sft
python infer.py  # python infer_transformers.py

# Convert the checkpoint to a format that can be run by transformers
cd docs
python convert.py

✨ Features

Efficient Training Framework: Uses the bert4torch training framework, with concise and efficient code.
Inference Compatibility: The trained checkpoints can be directly used for inference with the transformers package.
Memory Optimization: Optimizes memory usage during training.
Reproducibility: Provides complete training logs for reproduction and comparison.

Note: The model trained in this experiment currently only has simple chat capabilities (limited by the size of the corpus, model scale, and the size and quality of the SFT corpus) and is not capable of answering complex questions.

📦 Installation

The installation steps are included in the "Quick Start" section. Please follow the instructions there to install the necessary environment and run the scripts.

💻 Usage Examples

Basic Usage - Pretraining Model Inference

# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

from transformers import AutoTokenizer, LlamaForCausalLM
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao'  # 'Tongjilibo/MiniLLM-L12_H1024_A8-NoWudao'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)

query = '王鹏是一名'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)
print(response)

Basic Usage - Instruction Fine-tuning Model Inference

# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

from transformers import AutoTokenizer, LlamaForCausalLM
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)

query = '你好'
query = f'<human>{query}<robot>'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)[len(query):]
print(response)

📚 Documentation

Update History

20240316: Initial submission, including the pretrained models MiniLLM-MiniLLM-L12_H1024_A8-NoWudao and MiniLLM-MiniLLM-L12_H1024_A8-WithWudao, and the SFT model MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca.

Pretraining

Pretraining Corpus (from baby-llama2-chinese)

Chinese Pretraining Corpus	Description
Wiki Chinese Encyclopedia	Data from Chinese Wikipedia
BaiduBaiKe (Extraction code: bwvb)	Data from Chinese BaiduBaiKe
C4_zh: part1 (Extraction code: zv4r); C4_zh: part2 (Extraction code: sb83); C4_zh: part3 (Extraction code: l89d)	C4 is one of the largest available language datasets, collecting over 156 billion tokens from over 365 million domains on the Internet. C4_zh is a part of it.
WuDaoCorpora	200G of open-source Chinese Wudao data
shibing624/medical	A part of the pretraining data in the medical field from shibing624

The project has open-sourced the pretraining corpus processed by the ChatGLM2-6B tokenizer, with a total data volume of 63.4 billion Tokens. The link is as follows: Corpus (Extraction code: 6unr).

Pretraining Weights

Pretraining Weights	Pretraining Corpus	Download Address
MiniLLM-L12_H1024_A8-NoWudao	(14 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh	Baidu Netdisk, HuggingFace
MiniLLM-L12_H1024_A8-WithWudao	(64 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh, WuDaoCorpora	Baidu Netdisk, HuggingFace

Pretraining Process

Training Parameter Configuration and Training Duration

Weights	Pretraining Settings	Hardware Usage and Training Duration
MiniLLM-L12_H1024_A8-NoWudao	14 billion Tokens; btz=32*4gpu; lr=3e-4; warmup_steps=5000	4×A800(80G), about 60G per card, taking 20 hours
MiniLLM-L12_H1024_A8-WithWudao	64 billion Tokens; btz=32*4gpu; lr=1.5e-4; warmup_steps=5000	✅ 4×A800(80G), about 60G per card, taking 3.79 days ✅ 2×4090 in the baby-llama2 project, taking 26 days ✅ In personal tests, with a single card and btz=8, the GPU usage is about 17G, and the duration is unknown (gradient accumulation can be used to further reduce the usage).

Pretraining Model Invocation

The code example is provided in the "Usage Examples" section.

Pretraining Continuation Effect

MiniLLM-L12_H1024_A8-NoWudao

User: Xiaoming has excellent academic performance and good health. He is a

Continuation: Xiaoming has excellent academic performance and good health. He is an outstanding student at Dalian University of Technology. Since his freshman year in college, he has been very fond of sports, with top-notch grades, and has participated in many competitions. In the school's sports activities, he ranked among the top and was awarded the title of "Outstanding Athlete" by Dalian University of Technology.
In 2006, in his sophomore year of high school, Xiaoming's classmate Yang Fan came to Dalian University of Technology. He was determined to change his fate. Every day after school, he would run to the playground to practice basketball. By around 3 pm, Xiaoming's basketball skills began to show. His basketball moves were agile and powerful. Moreover, he often used his free time to practice jump shots.

User: Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon.

Continuation: Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon. Lowering my head, I miss my hometown...
When I read these poems, I also feel very kind and familiar.

MiniLLM-L12_H1024_A8-WithWudao

User: Xiaoming has excellent academic performance and good health. He is a

Continuation: Xiaoming has excellent academic performance and good health. He is an outstanding teacher. He is honest and responsible. In class, he can listen attentively, think actively, and actively raise his hand to answer questions in every class. After class, he actively asks to help the teacher, carefully completes his homework, and finishes the homework assigned by the teacher on time, which is well-received by his classmates.

User: Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon.

Continuation: "Before my bed, the moonlight is so bright. I wonder if it's frost on the ground. I raise my head to look at the bright moon. Lowering my head, I miss my hometown." This poem not only describes the warmth and tranquility of the Mid-Autumn night but also expresses the author's deep affection and longing for his hometown. The phrase "The moon rises above the willow branches" describes what is seen under the moon. "I want to pour my feelings into the lute" refers to the desire to express farewell feelings; "I raise my head to look at the bright moon" describes the scene of the moon hanging high in the sky, shining brightly with stars on the Mid-Autumn night; "Lowering my head, I miss my hometown" expresses the deep affection and longing for the hometown.
In terms of writing techniques, this poem mainly uses symbolism.

Instruction Fine-tuning

Instruction Fine-tuning Corpus (Selected Available Datasets)

Dataset Name	Introduction
shibing624/alpaca-zh	Self-instruct data obtained based on the Alpaca method using GPT4, about 50,000 entries
BelleGroup/Belle-0.5M-cn	Contains about 500,000 Chinese instruction data generated by the BELLE project
BelleGroup/Belle-1M-cn	Contains about 1 million Chinese instruction data generated by the BELLE project
BelleGroup/Belle-school_math_0.25M	Belle's open 0.25M math instruction dataset
BelleGroup/Belle-multiturn_chat_0.8M	Belle's open 0.8M multi-turn task dialogue dataset
YeungNLP/firefly-train-1.1M	Data for 23 common Chinese NLP tasks in Firefly, and many data related to Chinese culture are constructed, such as couplets, poem writing, classical Chinese translation, prose, and Jin Yong's novels. For each task, several instruction templates are written manually to ensure the high quality and richness of the data, with a data volume of 1.15 million.
fnlp/moss-002-sft-data	The multi-turn dialogue data used by MOSS-002, covering three aspects of usefulness, faithfulness, and harmlessness, including about 570,000 English dialogues and 590,000 Chinese dialogues generated by text-davinci-003.
fnlp/moss-003-sft-data	The multi-turn dialogue data used by moss-moon-003-sft, constructed based on about 100,000 user input data collected during the internal testing phase of MOSS-002 and gpt-3.5-turbo. Compared with moss-002-sft-data, moss-003-sft-data is more in line with the real user intention distribution, containing more fine-grained usefulness category labels, more extensive harmlessness data, and longer dialogue turns, with about 1.1 million dialogue data.
shareAI/CodeChat	Mainly contains corpus samples related to logical reasoning, code Q&A, and code generation.
shareAI/ShareGPT-Chinese-English-90k	A high-quality Chinese-English parallel bilingual human-machine Q&A dataset covering real and complex user questions.
deepctrl/deepctrl-sft-data	The SFT dataset of the Jiangshu large model is a high-quality dataset carefully collected and organized by Jiangshu Technology, including a Chinese dataset with 10 million entries and an English dataset with 2 million entries.

Instruction Fine-tuning Weights

Instruction Fine-tuning Weights	Corpus	Download Address
MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca	shibing624/alpaca-zh	Baidu Netdisk, HuggingFace

Instruction Fine-tuning Training Process

Training Parameter Configuration and Training Duration

Weights	Pretraining Settings	Hardware Usage and Training Duration
MiniLLM-L12_H1024_A8-NoWudao	shibing624/alpaca-zh dataset; btz=8; lr=2e-5; 5 epochs	Single 4090 card, 17G of video memory, taking 45 minutes

Instruction Fine-tuning Model Invocation

The code example is provided in the "Usage Examples" section.

Instruction Fine-tuning Chat Effect

MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca

User: Hello

Assistant: Hello! Nice to meet you.

User: What can you do?

Assistant: Hello! I'm an AI assistant. Is there anything I can do for you?

User: How to make scrambled eggs with tomatoes?

Assistant: If you want to make scrambled eggs with tomatoes, the following steps may help you:

1. Prepare ingredients: Prepare eggs, salt, coriander, shredded carrots, shredded cucumbers, etc.
2. Wash the eggs: Break the eggs into a mixing bowl to clean their surfaces.
3. Prepare materials: Prepare a pot, pour an appropriate amount of water, and pour the egg liquid into the pot.
4. Boil the eggs: Fry the eggs in a soup pot or pan until their surfaces turn golden and they are cooked.
5. Stir-fry the vegetables: Add an appropriate amount of oil to the pot, stir-fry the shredded carrots and shredded cucumbers (according to personal preference) for a few times until the carrots are cooked.
6. Add the eggs: Take out the eggs from the pot and pour them into the pot.
7. Season: Pour the stir-fried eggs into the pot and stir-fry them with the vegetables for a few times.

Note: If you want to stir-fry the vegetables to have a richer flavor, you can add an appropriate amount of salt and coriander to enhance the flavor.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご