ProSparse-Llama-2-7B Open Source Large Language Model - A Practical Choice to Maintain Performance at High Sparsity Rates

Prosparse Llama 2 7b

Developed by SparseLLM

A large language model based on LLaMA-2-7B with activation sparsification, achieving high sparsity (89.32%) while maintaining original performance through the ProSparse method

Large Language Model

Transformers

English#Activation Sparsification #Inference Acceleration #ReLU Activation

Downloads 152

Release Time : 2/19/2024

Model Overview

A ReLU-activated LLaMA-2 variant trained with progressive sparse regularization, significantly improving inference efficiency, suitable for text generation and comprehension tasks

Model Features

High Activation Sparsity

Achieves 89.32% sparsity through the ProSparse method, significantly higher than similar ReLU models (e.g., ReluLLaMA-7B's 66.98%)

Performance Preservation

Maintains task performance comparable to the original Swish-activated LLaMA-2 while achieving sparsification

Inference Acceleration

High sparsity supports PowerInfer framework and custom GPU operators, achieving 1.27-2.17x speedup in tests

Progressive Training

Three-phase training process: activation replacement → progressive regularization → threshold shifting, effectively balancing sparsity and performance

Model Capabilities

Text Generation

Code Generation

Common-sense Reasoning

Reading Comprehension

Mathematical Reasoning

Use Cases

Efficient Inference

Edge Device Deployment

Leverages high sparsity for efficient inference on resource-constrained devices

Achieves 218.3 tokens/s on a single A100 GPU with PowerInfer framework

Academic Research

Sparsification Method Validation

Serves as a benchmark model for activation sparsification research

Currently the sparsest activation model among open-source LLaMA models of the same size

🚀 ProSparse-LLaMA-2-7B

ProSparse-LLaMA-2-7B is a fine - tuned large language model that uses the "ProSparse" method to achieve high activation sparsity while maintaining comparable performance to the original model, and it shows significant inference acceleration effects.

✨ Features

High Activation Sparsity: By applying the "ProSparse" method, it achieves high activation sparsity, such as 89.32% for ProSparse - 7B, which is higher than many existing models.
Comparable Performance: Maintains comparable performance to the original Swish - activated LLaMA2 models on various benchmarks.
Inference Acceleration: Demonstrates practical speed - up effects on both the PowerInfer framework and custom sparse GPU operators.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

The README does not provide code examples for using the model, so this section is skipped.

📚 Documentation

Introduction

The utilization of activation sparsity is a promising method for inference acceleration of large language models (LLMs). Most recent mainstream LLMs use non - sparse activation functions. In this work, the "ProSparse" method is introduced to push LLMs for higher activation sparsity while maintaining comparable performance. By applying it to different models, high - sparsity ReLU - activated models are obtained, and their performance is comparable to the original versions.

Training Dataset

The 7B model is trained on about 34.60 billion tokens within 16,500 steps. The training data consists of two categories:

Language modeling datasets: StarCoder, Wikipedia, Pile, and other collected datasets.
Instruction tuning datasets: UltraChat, P3 (multiple - choice QA), PAQ, Unnatural Instructions, Flan, Super - Natural Instructions, and other collected datasets.

ProSparse: Training Methodology

The training process of ProSparse consists of three steps:

Activation Function Substitution: Substitute the activation function of FFNs with ReLU and apply continual training.
Progressive Sparsity Regularization: Jointly optimize the model on the conventional next - token prediction loss and (L_1) regularization loss. The regularization factor (\lambda) increases progressively in multiple stages.
Activation Threshold Shifting: Replace ReLU with FATReLU, a ReLU variant with a positive threshold, to further boost sparsity.

The 7B model is trained on 8 A100 GPUs. The learning rate is controlled by a cosine scheduler with a peak LR of (3e - 5). The hyper - parameters for each stage are as follows:

Step Number (i)	(\lambda_i)	(T_i)	Accumulated Tokens (B)
0	0	5,000	10.49
1	(5e - 3)	6,000	12.58
2	(5e - 2)	10,000	20.97
3	(5e - 2)	12,000	25.17
4	(2e - 1)	16,000	33.55
5	(2e - 1)	16,500	34.60

Evaluation Results

The evaluation results on various benchmarks show the advantage of ProSparse. It is the only method achieving high sparsity and comparable performance to the original Swish - activated LLaMA2. The evaluation is based on the framework UltraEval.

Setting	Average Sparsity	Average Performance	Code Generation	Commonsense Reasoning	Reading Comprehension	GSM8K	MMLU	BBH	AGI Eval
LLaMA2 - 7B	-	37.96	16.37	69.59	61.87	12.96	44.45	32.96	27.53
ReluLLaMA - 7B	66.98	37.62	15.85	69.64	70.54	5.84	38.64	35.07	27.73
ProSparse - 7B*	88.11	38.31	19.47	66.29	63.33	12.74	45.21	33.59	27.55
ProSparse - 7B	89.32	38.46	19.42	66.27	63.50	12.13	45.48	34.99	27.46
LLaMA2 - 13B	-	44.06	20.19	72.58	71.55	22.21	54.69	37.89	29.33
ReluLLaMA - 13B	71.56	42.74	20.19	70.44	73.29	18.50	50.58	37.97	28.22
ProSparse - 13B*	87.97	45.07	29.03	69.75	67.54	25.40	54.78	40.20	28.76
ProSparse - 13B	88.80	44.90	28.42	69.76	66.91	26.31	54.35	39.90	28.67
MiniCPM - 1B	-	44.44	36.85	63.67	60.90	35.48	50.44	35.03	28.71
ProSparse - 1B*	86.25	44.72	41.38	64.55	60.69	34.72	49.36	34.04	28.27
ProSparse - 1B	87.89	44.72	42.04	64.37	60.73	34.57	49.51	34.08	27.77

Evaluation Issues with LM - Eval

The results can be replicated with UltraEval. Abnormal results with [LM - Eval](https://github.com/EleutherAI/lm - evaluation - harness) may be due to the absence of the cls token <s>. A quick fix is shown in the following code:

# https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/models/huggingface.py#L945
for _, context_enc, continuation_enc in chunk:
    # sanity check
    assert len(context_enc) > 0
    # Note: a trivial fix here
    if context_enc[0] != 1:
        context_enc = [1] + context_enc
    assert len(continuation_enc) > 0
    assert len(continuation_enc) <= self.max_length

Adapting vLLM to ProSparse LLaMA models

Replace the file vllm/model_executor/models/llama.py in original vLLM with this [file](https://github.com/Raincleared - Song/DejaVu_predictor/blob/main/llama.py).
Replace the contents of the original [config.json](https://huggingface.co/SparseLLM/prosparse - llama - 2 - 7b/blob/main/config.json) with this [file](https://github.com/Raincleared - Song/DejaVu_predictor/blob/main/config.json).
Set the environment variable ACT_INFO. To test the version without activation threshold shifting, export ACT_INFO = relu. To test the version with activation threshold shifting, export ACT_INFO = fatrelu_0.01.

Inference Acceleration Effects

PowerInfer: Utilize PowerInfer, a state - of - the - art acceleration framework. Report activation recall, predicted sparsity, and the number of tokens generated per second. The GGUF files and activation predictors for ProSparse - 7B are available at [ProSparse - LLaMA - 2 - 7B - GGUF](https://huggingface.co/PowerInfer/prosparse - llama - 2 - 7b - gguf) and [ProSparse - LLaMA - 2 - 7B - Predictor](https://huggingface.co/PowerInfer/prosparse - llama - 2 - 7b - predictor).
Sparse GPU Operators: Implement two sparse GPU [operators](https://github.com/Raincleared - Song/sparse_gpu_operator) for faster accurate inference. They are responsible for the speedup of two key steps in a gated FFN.

Setting	Average Sparsity	Activation Recall	Predicted Sparsity	PowerInfer Speed	Speedup to Dense	`S2` Time	Speedup to Dense	`S3` Time	Speedup to Dense
Dense - 7B	-	-	-	3.67	1.00	90.55	1.00	82.92	1.00
ReluLLaMA - 7B	66.98	90.89	58.95	11.37	3.10	67.12	1.35	63.00	1.32
ProSparse - 7B*	88.11	93.46	79.39	14.77	4.03	50.32	1.80	47.72	1.74
ProSparse - 7B	89.32	93.42	80.24	15.17	4.13	49.03	1.85	46.44	1.79

🔧 Technical Details

The README provides detailed technical information about the ProSparse method, including the training process (activation function substitution, progressive sparsity regularization, activation threshold shifting), hyper - parameters, and the implementation of sparse GPU operators.

📄 License

The model uses the llama2 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご