DeepSeek-Coder-V2-Lite-Instruct-FP8 Open Source Code Model - Optimized Inference for English Business Research

Deepseek Coder V2 Lite Instruct FP8

Developed by RedHatAI

FP8 quantized version of DeepSeek-Coder-V2-Lite-Instruct, suitable for commercial and research use in English, optimized for inference efficiency.

Large Language Model

Transformers

Open Source License:Other #FP8 Quantization #Code Generation #vLLM Optimization

Downloads 11.29k

Release Time : 7/17/2024

Model Overview

This model is a quantized version of DeepSeek-Coder-V2-Lite-Instruct, optimized with FP8 weight and activation quantization, suitable for assistant-like chat scenarios.

Model Features

FP8 Quantization

Weights and activations quantized to FP8 data type, reducing disk size and GPU memory requirements by approximately 50%.

Efficient Inference

Compatible with vLLM >= 0.5.2 for efficient inference, optimizing inference speed.

High Accuracy

Excellent performance on the HumanEval+ benchmark, with accuracy comparable to the non-quantized model.

Model Capabilities

Text Generation

Code Generation

Chat Assistant

Use Cases

Commercial and Research

Code Generation Assistant

Helps developers generate code snippets, improving development efficiency.

Achieved an average score of 79.60 on the HumanEval+ benchmark.

Chatbot

Suitable for assistant-like chat scenarios, providing natural language interaction.

🚀 DeepSeek-Coder-V2-Lite-Instruct-FP8

A quantized version of DeepSeek-Coder-V2-Lite-Instruct, optimized for efficient inference with vLLM.

🚀 Quick Start

This model can be deployed efficiently using the vLLM backend. See the "Deployment" section for a detailed example.

✨ Features

Quantization Optimization: The weights and activations of the model are quantized to FP8 data type, reducing the disk size and GPU memory requirements by approximately 50%.
Efficient Inference: Ready for inference with vLLM >= 0.5.2, and vLLM also supports OpenAI-compatible serving.
High Performance: Achieves an average score of 79.60 on the HumanEval+ benchmark, outperforming the unquantized model.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8"

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

llm = LLM(model=model_id, trust_remote_code=True, max_model_len=4096)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

Advanced Usage

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

📚 Documentation

Model Overview

Property	Details
Model Architecture	DeepSeek-Coder-V2-Lite-Instruct. Input: Text; Output: Text
Model Optimizations	Weight quantization: FP8; Activation quantization: FP8
Intended Use Cases	Intended for commercial and research use in English. Similar to Meta-Llama-3-7B-Instruct, this model is intended for assistant-like chat.
Out-of-scope	Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
Release Date	7/18/2024
Version	1.0
License(s)	deepseek-license
Model Developers	Neural Magic

This is a quantized version of DeepSeek-Coder-V2-Lite-Instruct. It achieves an average score of 79.60 on the HumanEval+ benchmark, whereas the unquantized model achieves 79.33.

Model Optimizations

This model was obtained by quantizing the weights and activations of DeepSeek-Coder-V2-Lite-Instruct to FP8 data type, ready for inference with vLLM >= 0.5.2. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations. AutoFP8 is used for quantization with 512 sequences of UltraChat.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example in the "Usage Examples" section. vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created by applying AutoFP8 with calibration samples from ultrachat with expert gates kept at original precision, as presented in the code snippet below. Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using llm-compressor which supports several quantization schemes and models not supported by AutoFP8.

from datasets import load_dataset
from transformers import AutoTokenizer

from auto_fp8 import AutoFP8ForCausalLM, BaseQuantizeConfig

pretrained_model_dir = "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct"
quantized_model_dir = "DeepSeek-Coder-V2-Lite-Instruct-FP8"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True, model_max_length=4096)
tokenizer.pad_token = tokenizer.eos_token

ds = load_dataset("mgoin/ultrachat_2k", split="train_sft").select(range(512))
examples = [tokenizer.apply_chat_template(batch["messages"], tokenize=False) for batch in ds]
examples = tokenizer(examples, padding=True, truncation=True, return_tensors="pt").to("cuda")

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="static",
    ignore_patterns=["re:.*lm_head"]
)

model = AutoFP8ForCausalLM.from_pretrained(
    pretrained_model_dir, quantize_config=quantize_config
)
model.quantize(examples)
model.save_quantized(quantized_model_dir)

Evaluation

The model was evaluated on the HumanEval+ benchmark with the Neural Magic fork of the EvalPlus implementation of HumanEval+ and the vLLM engine, using the following command:

python codegen/generate.py --model neuralmagic/DeepSeek-Coder-V2-Lite-Instruct-FP8 --temperature 0.2 --n_samples 50 --resume --root ~ --dataset humaneval
python evalplus/sanitize.py ~/humaneval/neuralmagic--DeepSeek-Coder-V2-Lite-Instruct-FP8_vllm_temp_0.2
evalplus.evaluate --dataset humaneval --samples ~/humaneval/neuralmagic--DeepSeek-Coder-V2-Lite-Instruct-FP8_vllm_temp_0.2-sanitized

Accuracy

Benchmark	DeepSeek-Coder-V2-Lite-Instruct	DeepSeek-Coder-V2-Lite-Instruct-FP8 (this model)	Recovery
base pass@1	80.8	79.3	98.14%
base pass@10	83.4	84.6	101.4%
base+extra pass@1	75.8	74.9	98.81%
base+extra pass@10	77.3	79.6	102.9%
Average	79.33	79.60	100.3%

📄 License

This model is released under the deepseek-license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご