ð Baichuan-7B
Baichuan-7B is an open-source large-scale pre-trained model developed by Baichuan Intelligent Technology. It is based on the Transformer architecture and trained on approximately 1.2 trillion tokens. The model supports both Chinese and English, with a context window length of 4096, and achieves the best performance of its size on standard Chinese and English authoritative benchmarks.
ð Quick Start
If you wish to use Baichuan-7B (for inference, finetuning, etc.), we recommend using the accompanying code library Baichuan-7B.
âš Features
- Among models of the same size, Baichuan-7B has achieved the current state-of-the-art (SOTA) level, as evidenced by the following MMLU metrics.
- Baichuan-7B is trained on proprietary bilingual Chinese-English corpora, optimized for Chinese, and achieves SOTA performance on C-Eval.
- Unlike LLaMA, which completely prohibits commercial use, Baichuan-7B employs a more lenient open-source license, allowing for commercial purposes.
ð» Usage Examples
Basic Usage
The following is a task of performing 1-shot inference using Baichuan-7B, where the author's name is given based on the work, with the correct output being "å€éšå¯å->æåé"
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('ç»é¹³é楌->ç乿¶£\nå€éšå¯å->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
Advanced Usage
The following is a task of performing 1-shot inference using Baichuan-7B, where the author's name is given based on the work, with the correct output being "One Hundred Years of Solitude->Gabriel Garcia Marquez"
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("baichuan-inc/Baichuan-7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("baichuan-inc/Baichuan-7B", device_map="auto", trust_remote_code=True)
inputs = tokenizer('Hamlet->Shakespeare\nOne Hundred Years of Solitude->', return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs, max_new_tokens=64,repetition_penalty=1.1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
ð Documentation
Model Description
- Developed by: Baichuan Intelligent Technology
- Email: opensource@baichuan-inc.com
- Language(s) (NLP): Chinese/English
- License: Baichuan-7B License
Model Sources
The overall model is based on the standard Transformer structure, and we have adopted the same model design as LLaMA:
- Position Embedding: We use rotary-embedding, which is the position encoding scheme adopted by most models at this stage, and it has excellent extrapolation capabilities.
- Feedforward Layer: We use SwiGLU. The feedforward changes to (8/3) times the size of the hidden layer, that is, 11008.
- Layer Normalization: Pre-Normalization based on RMSNorm.
The specific parameters are as follows:
Hyperparameter |
Value |
n_parameters |
7000559616 |
n_layers |
32 |
n_heads |
32 |
d_model |
4096 |
vocab size |
64000 |
sequence length |
4096 |
Uses
Downstream Use
We have also open-sourced the training code that accompanies this model, allowing for efficient finetuning for downstream tasks. For more details, please refer to Baichuan-7B.
Out-of-Scope Use
Production use without adequate assessment of risks and mitigation; any use cases which may be considered irresponsible or harmful.
Bias, Risks, and Limitations
Baichuan-7B can produce factually incorrect output, and should not be relied on to produce factually accurate information. Baichuan-7B was trained on various public datasets. While great efforts have been taken to clean the pretraining data, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
Training Details
For specific training settings, please refer to Baichuan-7B.
Evaluation
Chinese Evaluation
C-Eval
CEval dataset is a comprehensive Chinese foundation model evaluation dataset, covering 52 disciplines and four difficulty levels. We used the dev set of this dataset as the source for few-shot, and conducted a 5-shot test on the test set.
Model 5-shot |
Average |
Avg(Hard) |
STEM |
Social Sciences |
Humanities |
Others |
GPT-4 |
68.7 |
54.9 |
67.1 |
77.6 |
64.5 |
67.8 |
ChatGPT |
54.4 |
41.4 |
52.9 |
61.8 |
50.9 |
53.6 |
Claude-v1.3 |
54.2 |
39.0 |
51.9 |
61.7 |
52.1 |
53.7 |
Claude-instant-v1.0 |
45.9 |
35.5 |
43.1 |
53.8 |
44.2 |
45.4 |
moss-moon-003-base (16B) |
27.4 |
24.5 |
27.0 |
29.1 |
27.2 |
26.9 |
Ziya-LLaMA-13B-pretrain |
30.2 |
22.7 |
27.7 |
34.4 |
32.0 |
28.9 |
LLaMA-7B-hf |
27.1 |
25.9 |
27.1 |
26.8 |
27.9 |
26.3 |
ChatGLM-6B |
34.5 |
23.1 |
30.4 |
39.6 |
37.4 |
34.5 |
Falcon-7B |
25.8 |
24.3 |
25.8 |
26.0 |
25.8 |
25.6 |
Open-LLaMA-v2-pretrain (7B) |
24.0 |
22.5 |
23.1 |
25.3 |
25.2 |
23.2 |
TigerBot-7B-base |
25.7 |
27.0 |
27.3 |
24.7 |
23.4 |
26.1 |
Aquila-7B* |
25.5 |
25.2 |
25.6 |
24.6 |
25.2 |
26.6 |
BLOOM-7B |
22.8 |
20.2 |
21.8 |
23.3 |
23.9 |
23.3 |
BLOOMZ-7B |
35.7 |
25.8 |
31.3 |
43.5 |
36.6 |
35.6 |
Baichuan-7B |
42.8 |
31.5 |
38.2 |
52.0 |
46.2 |
39.3 |
Gaokao
Gaokao is a dataset that uses Chinese college entrance examination questions to evaluate the capabilities of large language models, aiming to assess the language and logical reasoning abilities of the models. We only retained the single-choice questions and conducted a unified 5-shot test on all models.
The following are the test results:
Model |
Average |
Open-LLaMA-v2-pretrain |
21.41 |
Ziya-LLaMA-13B-pretrain |
23.17 |
Falcon-7B |
23.98 |
TigerBot-7B-base |
25.94 |
LLaMA-7B |
27.81 |
ChatGLM-6B |
21.41 |
BLOOM-7B |
26.96 |
BLOOMZ-7B |
28.72 |
Aquila-7B* |
24.39 |
Baichuan-7B |
36.24 |
AGIEval
AGIEval aims to evaluate the general capabilities of models in cognitive and problem-solving related tasks. We only retained the four-choice single-choice questions, randomly divided them, and conducted a unified 5-shot test on all models.
Model |
Average |
Open-LLaMA-v2-pretrain |
23.49 |
Ziya-LLaMA-13B-pretrain |
27.64 |
Falcon-7B |
27.18 |
TigerBot-7B-base |
25.19 |
LLaMA-7B |
28.17 |
ChatGLM-6B |
23.49 |
BLOOM-7B |
26.55 |
BLOOMZ-7B |
30.27 |
Aquila-7B* |
25.58 |
Baichuan-7B |
34.44 |
* The Aquila model is sourced from ZhiYuan official website, for reference only.
English Leaderboard
In addition to Chinese, we also tested the model's performance in English.
MMLU
MMLU is an English evaluation dataset that includes 57 multiple-choice tasks, covering elementary mathematics, American history, computer science, law, etc. The difficulty ranges from high school level to expert level, making it a mainstream LLM evaluation dataset.
We adopted the open-source evaluation scheme, and the final 5-shot results are as follows:
Model |
Humanities |
Social Sciences |
STEM |
Other |
Average |
LLaMA-7B2 |
34.0 |
38.3 |
30.5 |
38.1 |
35.1 |
Falcon-7B1 |
- |
- |
- |
- |
35.0 |
mpt-7B1 |
- |
- |
- |
- |
35.6 |
ChatGLM-6B0 |
35.4 |
41.0 |
31.3 |
40.5 |
36.9 |
BLOOM 7B0 |
25.0 |
24.4 |
26.5 |
26.4 |
25.5 |
BLOOMZ 7B0 |
31.3 |
42.1 |
34.4 |
39.0 |
36.1 |
moss-moon-003-base (16B)0 |
24.2 |
22.8 |
22.4 |
24.4 |
23.6 |
moss-moon-003-sft (16B)0 |
30.5 |
33.8 |
29.3 |
34.4 |
31.9 |
Baichuan-7B0 |
38.4 |
48.9 |
35.6 |
48.1 |
42.3 |
The superscript in the Model column indicates the source of the results.
0: reimplemented
1: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
2: https://paperswithcode.com/sota/multi-task-language-understanding-on-mmlu
Our Group
