MiniLLM-0.2B-Base Open Source Lightweight Language Model - Build a Chat Model with Basic Conversation Ability at Low Cost

Minillm 0.2B Base

Developed by Tongjilibo

MiniLLM is a lightweight language model project that fully implements the entire process from pre-training → instruction fine-tuning → reward modeling → reinforcement learning, economically and efficiently building a chat model with basic conversational capabilities

Large Language Model

Transformers

Open Source License:Apache-2.0 #Lightweight conversational model #Chinese instruction fine-tuning #Cost-effective training

Downloads 41

Release Time : 3/16/2024

Model Overview

This project is dedicated to creating a lightweight language model using the bert4torch training framework, with concise and efficient code. The trained model can directly integrate with the transformers inference ecosystem. The current experimental model only has basic conversational functionality.

Model Features

Lightweight and efficient

Uses the bert4torch training framework with concise and efficient code, optimizing GPU memory usage during training

Strong compatibility

The trained model can directly integrate with the transformers inference ecosystem

Full-process implementation

Complete implementation of the entire process from pre-training → instruction fine-tuning → reward modeling → reinforcement learning

Model Capabilities

Chinese text generation

Basic conversation

Text continuation

Use Cases

Education

Learning assistant

Helps students answer basic learning questions

Can generate explanations and examples of basic learning content

Entertainment

Simple chat

Engages in daily conversation

Capable of basic greetings and simple topic discussions

🚀 MiniLLM

This project aims to build a small-parameter LLM, going through the four stages of pretraining -> instruction fine-tuning -> reward model -> reinforcement learning to complete a chat model capable of simple chat tasks at a controllable cost.

🚀 Quick Start

Environment Installation

pip install bert4torch==0.4.9.post2  # If not found, specify -i https://pypi.org/simple

Script Explanation

# To prevent the terminal from closing, you can use nohup, tmux, or screen to start the process
# eg. nohup torchrun --standalone --nproc_per_node=4 pretrain.py --name baby > nohup.log&

# Pretraining
cd pretrain
torchrun --standalone --nproc_per_node=4 pretrain.py  # Sometimes DDP training may crash. You need to set `export NCCL_IB_DISABLE=1`

# Pretraining inference (command-line chat)
cd pretrain
python infer.py  # python infer_transformers.py

# Instruction fine-tuning training
cd sft
python sft.py

# Instruction fine-tuning inference (command-line chat)
cd sft
python infer.py  # python infer_transformers.py

# Convert ckpt to a format that can be run by transformers
cd docs
python convert.py

✨ Features

Training Framework: Uses the bert4torch training framework, with concise and efficient code.
Inference Compatibility: The trained checkpoints can be directly used for inference with the transformers package.
Memory Optimization: Optimizes memory usage during training.
Reproducibility: Provides complete training logs for reproduction and comparison.

Disclaimer: The model trained in this experiment currently only has simple chat functions (limited by the size of the corpus, model scale, and the size and quality of the SFT corpus) and does not have the ability to answer complex questions.

📚 Documentation

Update History

20240316: Initial submission, including pretrained models MiniLLM-MiniLLM-L12_H1024_A8-NoWudao and MiniLLM-MiniLLM-L12_H1024_A8-WithWudao; SFT model MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca.

Pretraining

Pretraining Corpus (from baby-llama2-chinese)

Chinese Pretraining Corpus	Description
Wiki Chinese Encyclopedia	Data from Chinese Wikipedia
BaiduBaiKe (Extraction code: bwvb)	Data from Chinese BaiduBaiKe
C4_zh: part1 (Extraction code: zv4r); C4_zh: part2 (Extraction code: sb83); C4_zh: part3 (Extraction code: l89d)	C4 is one of the largest available language datasets, collecting over 156 billion tokens from over 365 million domains on the Internet. C4_zh is a part of it.
WuDaoCorpora	200G of open-source Chinese data from WuDao
shibing624/medical	Part of the medical domain pretraining data from shibing624

The project has open-sourced the pretraining corpus processed by the ChatGLM2-6B tokenizer, with a total data volume of 63.4 billion Tokens. The link is as follows: Corpus (Extraction code: 6unr).

Pretraining Weights

Pretraining Weights	Pretraining Corpus	Download Address
MiniLLM-L12_H1024_A8-NoWudao	(14 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh	Baidu Netdisk, HuggingFace
MiniLLM-L12_H1024_A8-WithWudao	(64 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh, WuDaoCorpora	Baidu Netdisk, HuggingFace

Pretraining Process

Training Parameter Configuration and Training Duration

Weights	Pretraining Settings	Hardware Usage and Training Duration
MiniLLM-L12_H1024_A8-NoWudao	14 billion Tokens; btz = 32 * 4 GPUs; lr = 3e-4; warmup_steps = 5000	4 × A800 (80G), about 60G per GPU, 20 hours
MiniLLM-L12_H1024_A8-WithWudao	64 billion Tokens; btz = 32 * 4 GPUs; lr = 1.5e-4; warmup_steps = 5000	✅ 4 × A800 (80G), about 60G per GPU, 3.79 days ✅ 2 × 4090 in the baby-llama2 project, 26 days ✅ In personal tests, with a single GPU and btz = 8, the GPU usage is about 17G, and the duration is unknown (gradient accumulation can be used to further reduce the usage)

Pretraining Model Call

# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

from transformers import AutoTokenizer, LlamaForCausalLM
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao'  # 'Tongjilibo/MiniLLM-L12_H1024_A8-NoWudao'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)

query = 'Wang Peng is a'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)
print(response)

Pretraining Continuation Effect

MiniLLM-L12_H1024_A8-NoWudao

User: Xiaoming has excellent academic performance and good health. He is a

Continuation: Xiaoming has excellent academic performance and good health. He is an outstanding student at Dalian University of Technology. Since his freshman year in college, he has been very obsessed with sports, with top-notch grades. He has also participated in many competitions and ranked among the top in the school's sports activities. He was awarded the title of "Outstanding Athlete" by Dalian University of Technology.
In 2006, in his sophomore year of high school, Xiaoming's classmate Yang Fan came to Dalian University of Technology. He was determined to change his fate. Every day after school, he would run to the playground to practice basketball. By around 3 pm, Xiaoming's basketball skills began to show. His basketball movements were agile and powerful. Moreover, he often used his free time to practice jump shots.

User: Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright.

Continuation: Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright. Bowing my head, I'm lost in homesickness..."
When I read these verses, I also feel very kind and familiar!

MiniLLM-L12_H1024_A8-WithWudao

User: Xiaoming has excellent academic performance and good health. He is a

Continuation: Xiaoming has excellent academic performance and good health. He is an excellent teacher. He is honest and responsible. In class, he can listen carefully, think actively, and raise his hand to answer questions in every class. After class, he actively offers to help the teacher, does his homework carefully, and completes the homework assigned by the teacher on time, which is well-received by his classmates.

User: Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright.

Continuation: "Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright. Bowing my head, I'm lost in homesickness." This poem not only describes the warmth and tranquility of the Mid-Autumn night but also expresses the author's deep attachment to his hometown and homesickness. The phrase "The moon rises above the willow branches" describes what is seen under the moon. "I want to pour my heart out to the zither" refers to the desire to express farewell feelings; "Looking up at the bright moon" describes the scene of the moon hanging high in the sky, the bright moon shining, and the stars twinkling on the Mid-Autumn night; "Bowing my head, I'm lost in homesickness" expresses the deep affection for the hometown.
In terms of writing techniques, this poem mainly uses symbolism.

Instruction Fine-Tuning

Instruction Fine-Tuning Corpus (Selected Available Datasets)

Dataset Name	Introduction
shibing624/alpaca-zh	Self-instruct data obtained based on the Alpaca method using GPT4, about 50,000 entries
BelleGroup/Belle-0.5M-cn	Contains about 500,000 Chinese instruction data generated by the BELLE project
BelleGroup/Belle-1M-cn	Contains about 1 million Chinese instruction data generated by the BELLE project
BelleGroup/Belle-school_math_0.25M	Belle's open 0.25M math instruction dataset
BelleGroup/Belle-multiturn_chat_0.8M	Belle's open 0.8M multi-turn task dialogue dataset
YeungNLP/firefly-train-1.1M	Data for 23 common Chinese NLP tasks in Firefly, and many data related to Chinese culture are constructed, such as couplets, poem writing, classical Chinese translation, prose, and Jin Yong's novels. For each task, several instruction templates are written manually to ensure the high quality and richness of the data, with a data volume of 1.15 million
fnlp/moss-002-sft-data	Multi-turn dialogue data used by MOSS-002, covering three aspects of usefulness, faithfulness, and harmlessness, including about 570,000 English dialogues and 590,000 Chinese dialogues generated by text-davinci-003
fnlp/moss-003-sft-data	Multi-turn dialogue data used by moss-moon-003-sft, constructed based on about 100,000 user input data collected during the internal testing phase of MOSS-002 and gpt-3.5-turbo. Compared with moss-002-sft-data, moss-003-sft-data is more in line with the real user intention distribution, containing more fine-grained usefulness category labels, more extensive harmlessness data, and longer dialogue turns, about 1.1 million dialogue data
shareAI/CodeChat	Mainly contains corpus samples related to logical reasoning, code Q&A, and code generation.
shareAI/ShareGPT-Chinese-English-90k	High-quality Chinese-English parallel bilingual human-machine Q&A dataset, covering user questions in real and complex scenarios.
deepctrl/deepctrl-sft-data	The SFT dataset of the Jiangshu large model is a high-quality dataset carefully collected and organized by Jiangshu Technology, including a Chinese dataset with 10 million entries and an English dataset with 2 million entries.

Instruction Fine-Tuning Weights

Instruction Fine-Tuning Weights	Corpus	Download Address
MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca	shibing624/alpaca-zh	Baidu Netdisk, HuggingFace

Instruction Fine-Tuning Training Process

Training Parameter Configuration and Training Duration

Weights	Pretraining Settings	Hardware Usage and Training Duration
MiniLLM-L12_H1024_A8-NoWudao	shibing624/alpaca-zh dataset; btz = 8; lr = 2e-5; 5 epochs	Single 4090 GPU, 17G of video memory, 45 minutes

Instruction Fine-Tuning Model Call

# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"

from transformers import AutoTokenizer, LlamaForCausalLM
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca'

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)

query = 'Hello'
query = f'<human>{query}<robot>'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)[len(query):]
print(response)

Instruction Fine-Tuning Chat Effect

MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca

User: Hello

Assistant: Hello! Nice to meet you.

User: What can you do?

Assistant: Hello! I'm an AI assistant. Is there anything I can do for you?

User: How to make scrambled eggs with tomatoes?

Assistant: If you want to make scrambled eggs with tomatoes, the following steps may help you:

1. Prepare ingredients: Prepare eggs, salt, coriander, shredded carrots, shredded cucumbers, etc.
2. Wash the eggs: Break the eggs into a mixing bowl to clean their surfaces.
3. Prepare materials: Prepare a pot, pour an appropriate amount of water, and pour the egg liquid into the pot.
4. Cook the eggs: Fry the eggs in a soup pot or pan until the surface turns golden brown and they are cooked through.
5. Stir-fry the vegetables: Add an appropriate amount of oil to the pot, add the shredded carrots and shredded cucumbers according to your preference, and stir-fry a few times until the carrots are cooked.
6. Add the eggs: Take out the eggs from the pot and pour them into the pot.
7. Season: Pour the stir-fried eggs into the pot and stir-fry them with the vegetables a few times.

Note: To make the vegetables taste richer, you can add an appropriate amount of salt and coriander to enhance the flavor.

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご