Colossal-LLaMA-2-7b-base Bilingual (Chinese-English) Large Model - Open Source, Supports Long-Text Communication Needs

Colossal LLaMA 2 7b Base

Developed by hpcai-tech

An open-source bilingual Chinese-English large language model based on LLaMA-2, continuously pre-trained on approximately 8.5 billion tokens, supporting a context window of 4096 tokens.

Large Language Model

Transformers

Supports Multiple Languages#Bilingual Chinese-English Support #Low-cost Pre-training #Large Context Window

Downloads 147

Release Time : 9/18/2023

Model Overview

Colossal-LLaMA-2-7B-base is a bilingual Chinese-English open-source large language model based on LLaMA-2. It enhances Chinese capabilities through continuous pre-training while maintaining English proficiency, suitable for various natural language processing tasks.

Model Features

Low-cost Efficient Training

Completed approximately 8.5 billion tokens of continuous pre-training in just 15 hours using 64 A800 GPUs, costing less than $1000.

Bilingual Chinese-English Support

Enhances LLaMA-2's Chinese capabilities while maintaining English proficiency, supporting bilingual tasks.

Long Context Window

Supports a context window of 4096 tokens, suitable for long-text tasks.

Open Source with No Commercial Restrictions

Complies with LLaMA-2 License and Apache 2.0 License, with no additional commercial usage restrictions.

Model Capabilities

Text Generation

Natural Language Understanding

Bilingual Chinese-English Processing

Long-text Processing

Use Cases

General Natural Language Processing

Text Completion

Generates coherent subsequent content based on given text prompts.

Produces fluent and coherent text

Question Answering System

Answers user questions and provides relevant information.

Accurately answers various questions

Education

Language Learning Assistance

Helps learners practice bilingual Chinese-English writing and reading comprehension.

Provides high-quality language learning assistance

🚀 Colossal-LLaMA-2-7B

🎉 The Colossal-AI team has released the open - source model Colossal-LLaMA-2-7B-base, which is based on LLaMA-2. This model can handle both Chinese and English, and has shown excellent performance in relevant evaluations.

🚀 Quick Start

To load the Colossal-LLaMA-2-7B-base model using Transformers, you can use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", trust_remote_code=True)
input = "明月松间照，\n\n->\n\n"
inputs = tokenizer(input, return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs,
                        max_new_tokens=512,
                        do_sample=True,
                        temperature=0.3,
                        top_k=50,
                        top_p=0.95,
                        num_return_sequences=1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])

✨ Features

Open - source: The Colossal-LLaMA-2-7B-base model is open - source, which is beneficial for the research and development of the community.
Multilingual Support: It supports both Chinese and English, with a context window of 4096 tokens.
Cost - effective: With a pre - training cost of less than $1,000, it can achieve results similar to those of models that cost millions of dollars to pre - train from scratch.
Good Performance: It has shown excellent performance in standard Chinese and English evaluation metrics such as C - Eval and MMLU.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("hpcai-tech/Colossal-LLaMA-2-7b-base", trust_remote_code=True)
input = "明月松间照，\n\n->\n\n"
inputs = tokenizer(input, return_tensors='pt')
inputs = inputs.to('cuda:0')
pred = model.generate(**inputs,
                        max_new_tokens=512,
                        do_sample=True,
                        temperature=0.3,
                        top_k=50,
                        top_p=0.95,
                        num_return_sequences=1)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True)[len(input):])

📚 Documentation

Performance Evaluation

We conducted comprehensive evaluations on 4 datasets and compared our Colossal-Llama-2-7b-base model with various models.

We use 5 - shot for MMLU and calculate scores based on the logits of the first predicted token.
We use 5 - shot for CMMLU and calculate scores based on the logits of the first predicted token.
We use 5 - shot for AGIEval and only calculate scores for 4 - choice questions using a combination metric of exact match and the logits of the first predicted token. If any of the exact match or logits of the first predicted token is correct, the model will get the score.
We use 0 - shot for GAOKAO - Bench and only calculate scores for 4 - choice questions based on the logits of the first predicted token.
The generation config for all datasets is greedy search.
We also provided CEval scores from its latest leaderboard or the official repository of the model.

Property	Details
Model Type	Colossal-LLaMA-2-7B-base
Training Data	Approximately 8.5 billion tokens
Evaluation Datasets	MMLU, CMMLU, AGIEval, GAOKAO - Bench, CEval

	Backbone	Tokens Consumed	MMLU	CMMLU	AGIEval	GAOKAO	CEval
	-	-	5 - shot	5 - shot	5 - shot	0 - shot	5 - shot
Baichuan-7B	-	1.2T	42.32 (42.30)	44.53 (44.02)	38.72	36.74	42.80
Baichuan2-7B-Base	-	2.6T	46.97 (54.16)	57.67 (57.07)	45.76	52.60	54.00
ChatGLM-6B	-	1.0T	39.67 (40.63)	41.17 (-)	40.10	36.53	38.90
ChatGLM2-6B	-	1.4T	44.74 (45.46)	49.40 (-)	46.36	45.49	51.70
InternLM-7B	-	-	46.70 (51.00)	52.00 (-)	44.77	61.64	52.80
Qwen-7B (original)	-	2.2T	54.29 (56.70)	56.03 (58.80)	52.47	56.42	59.60
Qwen-7B	-	2.4T	58.33 (58.20)	62.54 (62.20)	64.34	74.05	63.50

Llama-2-7B	-	2.0T	44.47 (45.30)	32.97 (-)	32.60	25.46	-
Linly-AI/Chinese-LLaMA-2-7B-hf	Llama-2-7B	1.0T	37.43	29.92	32.00	27.57	-
wenge-research/yayi-7b-llama2	Llama-2-7B	-	38.56	31.52	30.99	25.95	-
ziqingyang/chinese-llama-2-7b	Llama-2-7B	-	33.86	34.69	34.52	25.18	34.2
TigerResearch/tigerbot-7b-base	Llama-2-7B	0.3T	43.73	42.04	37.64	30.61	-
LinkSoul/Chinese-Llama-2-7b	Llama-2-7B	-	48.41	38.31	38.45	27.72	-
FlagAlpha/Atom-7B	Llama-2-7B	0.1T	49.96	41.10	39.83	33.00	-

Colossal-LLaMA-2-7b-base	Llama-2-7B	0.0085T	53.06	49.89	51.48	58.82	50.20

The score in parentheses corresponds to the scores in the official repository of the model.

We use zero - shot for ChatGLM models.

To evaluate Qwen-7B on dataset MMLU, the prompt would be "xxx Answer:"(remove the space after ":") and we calculate the logits over " A", " B", " C" and " D" for Qwen-7B. Both the original and updated versions of Qwen-7B tend to be much more deterministic than other models. For example, the logits over " A" can be -inf and softmax would be exact 0.

For other models and other datasets, we calculate logits over "A", "B", "C" and "D".

❗️ More details of the evaluation methods and reproduction of the results, please refer to ColossalEval.

🔧 Technical Details

Data

Large language models such as LLaMA - 2 have been trained using a heterogeneous blend of high - quality datasets, yielding promising outcomes. Enhancing LLaMA - 2's performance for the Chinese corpus, while preserving its proficiency in English, critically hinges on two pivotal factors: the composition of the dataset, which encompasses both English and Chinese content, and the quality of each constituent dataset.

The following figure shows the data processing pipeline conducted for Colossal-LLaMA - 2.

❗️Important: We will open - source our data - processing toolkit soon, stay tuned!

Tokenizer

The original LLaMA - 2 vocabulary comprises fewer than a thousand Chinese characters, thus proving inadequate for encoding comprehensive Chinese texts effectively. Secondly, the utilization of byte tokens presents a challenge for transformer encoders to capture the semantic nuances of Chinese characters.

To address the above issues, we extend the LLaMA - 2 vocabulary from 32,000 to 69,104. To adapt the LLaMA - 2 model for use with the Colossal-LLaMA - 2 tokenizer, we initialize the new word embeddings by calculating the mean values from the original LLaMA - 2 embeddings and subsequently append these new rows to the end of the original embedding matrices.

Advantages of extending vocabulary size:

Improve the compression rate of string sequence encoding.
Enhance the integrity of information.
Enable encoded sequences to contain more valuable information, thereby theoretically enhancing the ability for chapter - level encoding.

Advantages of large vocabulary size under low - resource settings:

The presence of numerous unused tokens can be attributed to the limited training dataset, where an excessive number of tokens might not have been effectively learned.
Excessive vocabulary expansion leads to an increase in embedding - related parameters, resulting in higher memory usage, which, in turn, affects the efficiency of the training process.

To balance both sides, we finally construct our vocabulary with size 69,104. The following table presents a comparison of various models at the 7B level.

Model	Vocabulary Size	Compression Rate	Average Length of Samples (token - level)
Colossal-LLaMA - 2	69104	0.659	73.682
LLaMA - 2 - 7B	32000	1.205	134.689
Atom - 7B	65000	0.634	70.915
Baichuan - 7B	64000	0.678	75.857
Baichuan2 - 7B - base	125696	0.570	63.761
Chatglm2 - 6B	64789	0.645	72.178
InternLM - 7B	103168	0.566	63.349
Qwen - 7B	151643	0.578	64.703
Tigerbot - 7B - base	60515	0.630	70.515
Yayi - 7B - llama2	32005	1.214	135.689
Chinese - llama - 2 - 7b	55296	0.668	74.690
Chinese - Falcon - 7B	90046	0.669	74.858
LinkSoul - Chinese - Llama - 2 - 7b	40076	0.958	107.089
Ziya - LLaMA - 13B - v1.1	39410	0.958	107.074

Training Logs

Here are the training logs for our experiment: Training Loss by Steps Training Loss by Tokens

Training Strategy

Multi - stage Training

In order to enhance the model's performance and harness the full potential of the original LLaMA - 2, we have developed a multi - stage training strategy. This strategy is designed to systematically unlock the model's capabilities over a series of stages.

Therefore, we have divided the training process into three stages:

Large - scale pre - training stage (Conducted by LLaMA - 2): This initial stage is aimed at establishing the model's foundational capabilities from the ground up. It necessitates the use of a substantial dataset comprising no less than 1 trillion tokens.
Chinese knowledge injection stage: In this stage, we introduce Chinese knowledge into the model. It requires access to a high - quality dataset rich in comprehensive knowledge relevant to the Chinese language.
Knowledge replay stage: Knowledge is replayed through a question - answering (QA) mechanism, encompassing both the Chinese and English domains.

📄 License

This model is licensed under the LLaMA - 2 license and Apache 2.0 License without any additional commercial use restrictions.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご