Minillm 0.2B Base
MiniLLM is a lightweight language model project that fully implements the entire process from pre-training → instruction fine-tuning → reward modeling → reinforcement learning, economically and efficiently building a chat model with basic conversational capabilities
Downloads 41
Release Time : 3/16/2024
Model Overview
This project is dedicated to creating a lightweight language model using the bert4torch training framework, with concise and efficient code. The trained model can directly integrate with the transformers inference ecosystem. The current experimental model only has basic conversational functionality.
Model Features
Lightweight and efficient
Uses the bert4torch training framework with concise and efficient code, optimizing GPU memory usage during training
Strong compatibility
The trained model can directly integrate with the transformers inference ecosystem
Full-process implementation
Complete implementation of the entire process from pre-training → instruction fine-tuning → reward modeling → reinforcement learning
Model Capabilities
Chinese text generation
Basic conversation
Text continuation
Use Cases
Education
Learning assistant
Helps students answer basic learning questions
Can generate explanations and examples of basic learning content
Entertainment
Simple chat
Engages in daily conversation
Capable of basic greetings and simple topic discussions
🚀 MiniLLM
This project aims to build a small-parameter LLM, going through the four stages of pretraining
-> instruction fine-tuning
-> reward model
-> reinforcement learning
to complete a chat model capable of simple chat tasks at a controllable cost.
🚀 Quick Start
Environment Installation
pip install bert4torch==0.4.9.post2 # If not found, specify -i https://pypi.org/simple
Script Explanation
# To prevent the terminal from closing, you can use nohup, tmux, or screen to start the process
# eg. nohup torchrun --standalone --nproc_per_node=4 pretrain.py --name baby > nohup.log&
# Pretraining
cd pretrain
torchrun --standalone --nproc_per_node=4 pretrain.py # Sometimes DDP training may crash. You need to set `export NCCL_IB_DISABLE=1`
# Pretraining inference (command-line chat)
cd pretrain
python infer.py # python infer_transformers.py
# Instruction fine-tuning training
cd sft
python sft.py
# Instruction fine-tuning inference (command-line chat)
cd sft
python infer.py # python infer_transformers.py
# Convert ckpt to a format that can be run by transformers
cd docs
python convert.py
✨ Features
- Training Framework: Uses the bert4torch training framework, with concise and efficient code.
- Inference Compatibility: The trained checkpoints can be directly used for inference with the
transformers
package. - Memory Optimization: Optimizes memory usage during training.
- Reproducibility: Provides complete training logs for reproduction and comparison.
Disclaimer: The model trained in this experiment currently only has simple chat functions (limited by the size of the corpus, model scale, and the size and quality of the SFT corpus) and does not have the ability to answer complex questions.
📚 Documentation
Update History
- 20240316: Initial submission, including pretrained models
MiniLLM-MiniLLM-L12_H1024_A8-NoWudao
andMiniLLM-MiniLLM-L12_H1024_A8-WithWudao
; SFT modelMiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca
.
Pretraining
Pretraining Corpus (from baby-llama2-chinese)
Chinese Pretraining Corpus | Description |
---|---|
Wiki Chinese Encyclopedia | Data from Chinese Wikipedia |
BaiduBaiKe (Extraction code: bwvb) | Data from Chinese BaiduBaiKe |
C4_zh: part1 (Extraction code: zv4r); C4_zh: part2 (Extraction code: sb83); C4_zh: part3 (Extraction code: l89d) | C4 is one of the largest available language datasets, collecting over 156 billion tokens from over 365 million domains on the Internet. C4_zh is a part of it. |
WuDaoCorpora | 200G of open-source Chinese data from WuDao |
shibing624/medical | Part of the medical domain pretraining data from shibing624 |
The project has open-sourced the pretraining corpus processed by the ChatGLM2-6B tokenizer, with a total data volume of 63.4 billion Tokens. The link is as follows: Corpus (Extraction code: 6unr).
Pretraining Weights
Pretraining Weights | Pretraining Corpus | Download Address |
---|---|---|
MiniLLM-L12_H1024_A8-NoWudao | (14 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh |
Baidu Netdisk, HuggingFace |
MiniLLM-L12_H1024_A8-WithWudao | (64 billion Tokens) Wiki Chinese Encyclopedia, BaiduBaiKe, shibing624/medical, C4_zh, WuDaoCorpora |
Baidu Netdisk, HuggingFace |
Pretraining Process
- Training Parameter Configuration and Training Duration
Weights | Pretraining Settings | Hardware Usage and Training Duration |
---|---|---|
MiniLLM-L12_H1024_A8-NoWudao | 14 billion Tokens; btz = 32 * 4 GPUs; lr = 3e-4; warmup_steps = 5000 | 4 × A800 (80G), about 60G per GPU, 20 hours |
MiniLLM-L12_H1024_A8-WithWudao | 64 billion Tokens; btz = 32 * 4 GPUs; lr = 1.5e-4; warmup_steps = 5000 | ✅ 4 × A800 (80G), about 60G per GPU, 3.79 days ✅ 2 × 4090 in the baby-llama2 project, 26 days ✅ In personal tests, with a single GPU and btz = 8, the GPU usage is about 17G, and the duration is unknown (gradient accumulation can be used to further reduce the usage) |
Pretraining Model Call
# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
from transformers import AutoTokenizer, LlamaForCausalLM
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao' # 'Tongjilibo/MiniLLM-L12_H1024_A8-NoWudao'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)
query = 'Wang Peng is a'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)
print(response)
Pretraining Continuation Effect
- MiniLLM-L12_H1024_A8-NoWudao
User: Xiaoming has excellent academic performance and good health. He is a
Continuation: Xiaoming has excellent academic performance and good health. He is an outstanding student at Dalian University of Technology. Since his freshman year in college, he has been very obsessed with sports, with top-notch grades. He has also participated in many competitions and ranked among the top in the school's sports activities. He was awarded the title of "Outstanding Athlete" by Dalian University of Technology.
In 2006, in his sophomore year of high school, Xiaoming's classmate Yang Fan came to Dalian University of Technology. He was determined to change his fate. Every day after school, he would run to the playground to practice basketball. By around 3 pm, Xiaoming's basketball skills began to show. His basketball movements were agile and powerful. Moreover, he often used his free time to practice jump shots.
User: Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright.
Continuation: Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright. Bowing my head, I'm lost in homesickness..."
When I read these verses, I also feel very kind and familiar!
- MiniLLM-L12_H1024_A8-WithWudao
User: Xiaoming has excellent academic performance and good health. He is a
Continuation: Xiaoming has excellent academic performance and good health. He is an excellent teacher. He is honest and responsible. In class, he can listen carefully, think actively, and raise his hand to answer questions in every class. After class, he actively offers to help the teacher, does his homework carefully, and completes the homework assigned by the teacher on time, which is well-received by his classmates.
User: Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright.
Continuation: "Before my bed a pool of light, I wonder if it's frost on the ground. Looking up, I find the moon bright. Bowing my head, I'm lost in homesickness." This poem not only describes the warmth and tranquility of the Mid-Autumn night but also expresses the author's deep attachment to his hometown and homesickness. The phrase "The moon rises above the willow branches" describes what is seen under the moon. "I want to pour my heart out to the zither" refers to the desire to express farewell feelings; "Looking up at the bright moon" describes the scene of the moon hanging high in the sky, the bright moon shining, and the stars twinkling on the Mid-Autumn night; "Bowing my head, I'm lost in homesickness" expresses the deep affection for the hometown.
In terms of writing techniques, this poem mainly uses symbolism.
Instruction Fine-Tuning
Instruction Fine-Tuning Corpus (Selected Available Datasets)
Dataset Name | Introduction |
---|---|
shibing624/alpaca-zh | Self-instruct data obtained based on the Alpaca method using GPT4, about 50,000 entries |
BelleGroup/Belle-0.5M-cn | Contains about 500,000 Chinese instruction data generated by the BELLE project |
BelleGroup/Belle-1M-cn | Contains about 1 million Chinese instruction data generated by the BELLE project |
BelleGroup/Belle-school_math_0.25M | Belle's open 0.25M math instruction dataset |
BelleGroup/Belle-multiturn_chat_0.8M | Belle's open 0.8M multi-turn task dialogue dataset |
YeungNLP/firefly-train-1.1M | Data for 23 common Chinese NLP tasks in Firefly, and many data related to Chinese culture are constructed, such as couplets, poem writing, classical Chinese translation, prose, and Jin Yong's novels. For each task, several instruction templates are written manually to ensure the high quality and richness of the data, with a data volume of 1.15 million |
fnlp/moss-002-sft-data | Multi-turn dialogue data used by MOSS-002, covering three aspects of usefulness, faithfulness, and harmlessness, including about 570,000 English dialogues and 590,000 Chinese dialogues generated by text-davinci-003 |
fnlp/moss-003-sft-data | Multi-turn dialogue data used by moss-moon-003-sft, constructed based on about 100,000 user input data collected during the internal testing phase of MOSS-002 and gpt-3.5-turbo. Compared with moss-002-sft-data, moss-003-sft-data is more in line with the real user intention distribution, containing more fine-grained usefulness category labels, more extensive harmlessness data, and longer dialogue turns, about 1.1 million dialogue data |
shareAI/CodeChat | Mainly contains corpus samples related to logical reasoning, code Q&A, and code generation. |
shareAI/ShareGPT-Chinese-English-90k | High-quality Chinese-English parallel bilingual human-machine Q&A dataset, covering user questions in real and complex scenarios. |
deepctrl/deepctrl-sft-data | The SFT dataset of the Jiangshu large model is a high-quality dataset carefully collected and organized by Jiangshu Technology, including a Chinese dataset with 10 million entries and an English dataset with 2 million entries. |
Instruction Fine-Tuning Weights
Instruction Fine-Tuning Weights | Corpus | Download Address |
---|---|---|
MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca | shibing624/alpaca-zh | Baidu Netdisk, HuggingFace |
Instruction Fine-Tuning Training Process
- Training Parameter Configuration and Training Duration
Weights | Pretraining Settings | Hardware Usage and Training Duration |
---|---|---|
MiniLLM-L12_H1024_A8-NoWudao | shibing624/alpaca-zh dataset; btz = 8; lr = 2e-5; 5 epochs | Single 4090 GPU, 17G of video memory, 45 minutes |
Instruction Fine-Tuning Model Call
# Add the following two lines depending on the network situation
import os
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
from transformers import AutoTokenizer, LlamaForCausalLM
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Tongjilibo/MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = LlamaForCausalLM.from_pretrained(model_name).to(device)
query = 'Hello'
query = f'<human>{query}<robot>'
inputs = tokenizer.encode(query, return_tensors='pt', add_special_tokens=False).to(device)
output_ids = model.generate(inputs)
response = tokenizer.decode(output_ids[0].cpu(), skip_special_tokens=True)[len(query):]
print(response)
Instruction Fine-Tuning Chat Effect
- MiniLLM-L12_H1024_A8-WithWudao-SFT_Alpaca
User: Hello
Assistant: Hello! Nice to meet you.
User: What can you do?
Assistant: Hello! I'm an AI assistant. Is there anything I can do for you?
User: How to make scrambled eggs with tomatoes?
Assistant: If you want to make scrambled eggs with tomatoes, the following steps may help you:
1. Prepare ingredients: Prepare eggs, salt, coriander, shredded carrots, shredded cucumbers, etc.
2. Wash the eggs: Break the eggs into a mixing bowl to clean their surfaces.
3. Prepare materials: Prepare a pot, pour an appropriate amount of water, and pour the egg liquid into the pot.
4. Cook the eggs: Fry the eggs in a soup pot or pan until the surface turns golden brown and they are cooked through.
5. Stir-fry the vegetables: Add an appropriate amount of oil to the pot, add the shredded carrots and shredded cucumbers according to your preference, and stir-fry a few times until the carrots are cooked.
6. Add the eggs: Take out the eggs from the pot and pour them into the pot.
7. Season: Pour the stir-fried eggs into the pot and stir-fry them with the vegetables a few times.
Note: To make the vegetables taste richer, you can add an appropriate amount of salt and coriander to enhance the flavor.
📄 License
This project is licensed under the Apache-2.0 license.
Phi 2 GGUF
Other
Phi-2 is a small yet powerful language model developed by Microsoft, featuring 2.7 billion parameters, focusing on efficient inference and high-quality text generation.
Large Language Model Supports Multiple Languages
P
TheBloke
41.5M
205
Roberta Large
MIT
A large English language model pre-trained with masked language modeling objectives, using improved BERT training methods
Large Language Model English
R
FacebookAI
19.4M
212
Distilbert Base Uncased
Apache-2.0
DistilBERT is a distilled version of the BERT base model, maintaining similar performance while being more lightweight and efficient, suitable for natural language processing tasks such as sequence classification and token classification.
Large Language Model English
D
distilbert
11.1M
669
Llama 3.1 8B Instruct GGUF
Meta Llama 3.1 8B Instruct is a multilingual large language model optimized for multilingual dialogue use cases, excelling in common industry benchmarks.
Large Language Model English
L
modularai
9.7M
4
Xlm Roberta Base
MIT
XLM-RoBERTa is a multilingual model pretrained on 2.5TB of filtered CommonCrawl data across 100 languages, using masked language modeling as the training objective.
Large Language Model Supports Multiple Languages
X
FacebookAI
9.6M
664
Roberta Base
MIT
An English pre-trained model based on Transformer architecture, trained on massive text through masked language modeling objectives, supporting text feature extraction and downstream task fine-tuning
Large Language Model English
R
FacebookAI
9.3M
488
Opt 125m
Other
OPT is an open pre-trained Transformer language model suite released by Meta AI, with parameter sizes ranging from 125 million to 175 billion, designed to match the performance of the GPT-3 series while promoting open research in large-scale language models.
Large Language Model English
O
facebook
6.3M
198
1
A pretrained model based on the transformers library, suitable for various NLP tasks
Large Language Model
Transformers

1
unslothai
6.2M
1
Llama 3.1 8B Instruct
Llama 3.1 is Meta's multilingual large language model series, featuring 8B, 70B, and 405B parameter scales, supporting 8 languages and code generation, with optimized multilingual dialogue scenarios.
Large Language Model
Transformers Supports Multiple Languages

L
meta-llama
5.7M
3,898
T5 Base
Apache-2.0
The T5 Base Version is a text-to-text Transformer model developed by Google with 220 million parameters, supporting multilingual NLP tasks.
Large Language Model Supports Multiple Languages
T
google-t5
5.4M
702
Featured Recommended AI Models