đ ProSparse-LLaMA-2-7B
ProSparse-LLaMA-2-7B is a fine - tuned large language model that uses the "ProSparse" method to achieve high activation sparsity while maintaining comparable performance to the original model, and it shows significant inference acceleration effects.
⨠Features
- High Activation Sparsity: By applying the "ProSparse" method, it achieves high activation sparsity, such as 89.32% for ProSparse - 7B, which is higher than many existing models.
- Comparable Performance: Maintains comparable performance to the original Swish - activated LLaMA2 models on various benchmarks.
- Inference Acceleration: Demonstrates practical speed - up effects on both the PowerInfer framework and custom sparse GPU operators.
đĻ Installation
The README does not provide specific installation steps, so this section is skipped.
đģ Usage Examples
The README does not provide code examples for using the model, so this section is skipped.
đ Documentation
Introduction
The utilization of activation sparsity is a promising method for inference acceleration of large language models (LLMs). Most recent mainstream LLMs use non - sparse activation functions. In this work, the "ProSparse" method is introduced to push LLMs for higher activation sparsity while maintaining comparable performance. By applying it to different models, high - sparsity ReLU - activated models are obtained, and their performance is comparable to the original versions.
Training Dataset
The 7B model is trained on about 34.60 billion tokens within 16,500 steps. The training data consists of two categories:
- Language modeling datasets: StarCoder, Wikipedia, Pile, and other collected datasets.
- Instruction tuning datasets: UltraChat, P3 (multiple - choice QA), PAQ, Unnatural Instructions, Flan, Super - Natural Instructions, and other collected datasets.
ProSparse: Training Methodology
The training process of ProSparse consists of three steps:
- Activation Function Substitution: Substitute the activation function of FFNs with ReLU and apply continual training.
- Progressive Sparsity Regularization: Jointly optimize the model on the conventional next - token prediction loss and (L_1) regularization loss. The regularization factor (\lambda) increases progressively in multiple stages.
- Activation Threshold Shifting: Replace ReLU with FATReLU, a ReLU variant with a positive threshold, to further boost sparsity.
The 7B model is trained on 8 A100 GPUs. The learning rate is controlled by a cosine scheduler with a peak LR of (3e - 5). The hyper - parameters for each stage are as follows:
Step Number (i) |
(\lambda_i) |
(T_i) |
Accumulated Tokens (B) |
0 |
0 |
5,000 |
10.49 |
1 |
(5e - 3) |
6,000 |
12.58 |
2 |
(5e - 2) |
10,000 |
20.97 |
3 |
(5e - 2) |
12,000 |
25.17 |
4 |
(2e - 1) |
16,000 |
33.55 |
5 |
(2e - 1) |
16,500 |
34.60 |
Evaluation Results
The evaluation results on various benchmarks show the advantage of ProSparse. It is the only method achieving high sparsity and comparable performance to the original Swish - activated LLaMA2. The evaluation is based on the framework UltraEval.
Setting |
Average Sparsity |
Average Performance |
Code Generation |
Commonsense Reasoning |
Reading Comprehension |
GSM8K |
MMLU |
BBH |
AGI Eval |
LLaMA2 - 7B |
- |
37.96 |
16.37 |
69.59 |
61.87 |
12.96 |
44.45 |
32.96 |
27.53 |
ReluLLaMA - 7B |
66.98 |
37.62 |
15.85 |
69.64 |
70.54 |
5.84 |
38.64 |
35.07 |
27.73 |
ProSparse - 7B* |
88.11 |
38.31 |
19.47 |
66.29 |
63.33 |
12.74 |
45.21 |
33.59 |
27.55 |
ProSparse - 7B |
89.32 |
38.46 |
19.42 |
66.27 |
63.50 |
12.13 |
45.48 |
34.99 |
27.46 |
LLaMA2 - 13B |
- |
44.06 |
20.19 |
72.58 |
71.55 |
22.21 |
54.69 |
37.89 |
29.33 |
ReluLLaMA - 13B |
71.56 |
42.74 |
20.19 |
70.44 |
73.29 |
18.50 |
50.58 |
37.97 |
28.22 |
ProSparse - 13B* |
87.97 |
45.07 |
29.03 |
69.75 |
67.54 |
25.40 |
54.78 |
40.20 |
28.76 |
ProSparse - 13B |
88.80 |
44.90 |
28.42 |
69.76 |
66.91 |
26.31 |
54.35 |
39.90 |
28.67 |
MiniCPM - 1B |
- |
44.44 |
36.85 |
63.67 |
60.90 |
35.48 |
50.44 |
35.03 |
28.71 |
ProSparse - 1B* |
86.25 |
44.72 |
41.38 |
64.55 |
60.69 |
34.72 |
49.36 |
34.04 |
28.27 |
ProSparse - 1B |
87.89 |
44.72 |
42.04 |
64.37 |
60.73 |
34.57 |
49.51 |
34.08 |
27.77 |
Evaluation Issues with LM - Eval
The results can be replicated with UltraEval. Abnormal results with [LM - Eval](https://github.com/EleutherAI/lm - evaluation - harness) may be due to the absence of the cls token <s>
. A quick fix is shown in the following code:
for _, context_enc, continuation_enc in chunk:
assert len(context_enc) > 0
if context_enc[0] != 1:
context_enc = [1] + context_enc
assert len(continuation_enc) > 0
assert len(continuation_enc) <= self.max_length
Adapting vLLM to ProSparse LLaMA models
- Replace the file vllm/model_executor/models/llama.py in original vLLM with this [file](https://github.com/Raincleared - Song/DejaVu_predictor/blob/main/llama.py).
- Replace the contents of the original [config.json](https://huggingface.co/SparseLLM/prosparse - llama - 2 - 7b/blob/main/config.json) with this [file](https://github.com/Raincleared - Song/DejaVu_predictor/blob/main/config.json).
- Set the environment variable
ACT_INFO
. To test the version without activation threshold shifting, export ACT_INFO = relu
. To test the version with activation threshold shifting, export ACT_INFO = fatrelu_0.01
.
Inference Acceleration Effects
- PowerInfer: Utilize PowerInfer, a state - of - the - art acceleration framework. Report activation recall, predicted sparsity, and the number of tokens generated per second. The GGUF files and activation predictors for ProSparse - 7B are available at [ProSparse - LLaMA - 2 - 7B - GGUF](https://huggingface.co/PowerInfer/prosparse - llama - 2 - 7b - gguf) and [ProSparse - LLaMA - 2 - 7B - Predictor](https://huggingface.co/PowerInfer/prosparse - llama - 2 - 7b - predictor).
- Sparse GPU Operators: Implement two sparse GPU [operators](https://github.com/Raincleared - Song/sparse_gpu_operator) for faster accurate inference. They are responsible for the speedup of two key steps in a gated FFN.
Setting |
Average Sparsity |
Activation Recall |
Predicted Sparsity |
PowerInfer Speed |
Speedup to Dense |
S2 Time |
Speedup to Dense |
S3 Time |
Speedup to Dense |
Dense - 7B |
- |
- |
- |
3.67 |
1.00 |
90.55 |
1.00 |
82.92 |
1.00 |
ReluLLaMA - 7B |
66.98 |
90.89 |
58.95 |
11.37 |
3.10 |
67.12 |
1.35 |
63.00 |
1.32 |
ProSparse - 7B* |
88.11 |
93.46 |
79.39 |
14.77 |
4.03 |
50.32 |
1.80 |
47.72 |
1.74 |
ProSparse - 7B |
89.32 |
93.42 |
80.24 |
15.17 |
4.13 |
49.03 |
1.85 |
46.44 |
1.79 |
đ§ Technical Details
The README provides detailed technical information about the ProSparse method, including the training process (activation function substitution, progressive sparsity regularization, activation threshold shifting), hyper - parameters, and the implementation of sparse GPU operators.
đ License
The model uses the llama2 license.