Neural-Chat-7B-V3-1開源大語言模型 - 支持多語言任務免費部署

首頁

Neural Chat 7b V3 1

由Intel開發

基於Mistral-7B在英特爾Gaudi 2處理器上微調的70億參數大語言模型，採用DPO方法對齊，適用於多種語言任務

大型語言模型

Transformers

英語開源協議:Apache-2.0 #英特爾Gaudi2優化 #DPO對齊對話 #多任務語言生成

下載量 3,019

發布時間 : 11/14/2023

模型概述

本模型是基於Mistral-7B微調的大語言模型，使用Open-Orca/SlimOrca數據集進行訓練，通過直接性能優化(DPO)方法對齊，適用於文本生成等多種語言任務

模型特點

DPO優化

採用直接性能優化(DPO)方法進行模型對齊，提升生成質量

英特爾Gaudi2加速

在英特爾Gaudi 2處理器上完成微調，優化硬件性能

多任務支持

在多種語言理解任務上表現優異，包括ARC、HellaSwag等基準測試

模型能力

文本生成

數學問題解答

語言理解

對話系統

使用案例

教育

數學問題解答

生成分步數學解題過程

可正確解答如100+520+60等基礎數學問題

通用對話

智能對話

作為對話助手提供信息諮詢

🚀 神經聊天-7B-v3-1模型

神經聊天-7B-v3-1是一款基於英特爾Gaudi 2處理器微調的70億參數大語言模型。它在開源數據集上進行訓練，性能表現出色，能為多種語言相關任務提供支持。

🚀 快速開始

本模型是基於英特爾Gaudi 2處理器，在mistralai/Mistral-7B-v0.1基礎上，於開源數據集Open-Orca/SlimOrca上微調得到的70億參數大語言模型。該模型使用Direct Performance Optimization (DPO)方法，結合Intel/orca_dpo_pairs進行對齊。更多信息請參考Medium文章The Practice of Supervised Fine-tuning and Direct Preference Optimization on Intel Gaudi2。

Photo by Google DeepMind on Unsplash

✨ 主要特性

模型信息

屬性	詳情
模型作者 - 公司	Intel。NeuralChat團隊，成員來自DCAI/AISE/AIPT。核心團隊成員：呂考考、呂亮、王暢、張文心、任旭輝和沈浩昊。
日期	2023年10月
版本	v3 - 1
模型類型	70億參數大語言模型
論文或其他資源	Medium博客
許可證	Apache 2.0
問題或評論	社區板塊和 Intel DevHub Discord

預期用途

預期用途	描述
主要預期用途	可以使用微調後的模型進行多個與語言相關的任務。查看大語言模型排行榜，瞭解該模型的表現。
主要預期用戶	任何進行與語言相關任務推理的人。
非預期用途	在大多數情況下，該模型需要針對特定任務進行微調。該模型不應被用於故意為人們創造敵對或疏離的環境。

📦 安裝指南

訓練超參數

訓練期間使用了以下超參數：

學習率：1e - 04
訓練批次大小：1
評估批次大小：2
隨機種子：42
分佈式類型：multi - HPU
設備數量：8
梯度累積步數：8
總訓練批次大小：64
總評估批次大小：8
優化器：Adam，beta=(0.9, 0.999)，epsilon = 1e - 08
學習率調度器類型：cosine
學習率調度器熱身比例：0.03
訓練輪數：2.0

復現模型

以下是復現模型的示例代碼：GitHub示例代碼。以下是復現構建模型的文檔：

git clone https://github.com/intel/intel-extension-for-transformers.git
cd intel-extension-for-transformers

docker build --no-cache ./ --target hpu --build-arg REPO=https://github.com/intel/intel-extension-for-transformers.git --build-arg ITREX_VER=main -f ./intel_extension_for_transformers/neural_chat/docker/Dockerfile -t chatbot_finetuning:latest

docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host chatbot_finetuning:latest

# 進入docker容器後
cd examples/finetuning/finetune_neuralchat_v3

我們選擇最新的預訓練模型mistralai/Mistral-7B-v0.1和開源數據集Open-Orca/SlimOrca進行實驗。

以下腳本使用deepspeed zero2在8張Gaudi2卡上啟動訓練。在finetune_neuralchat_v3.py中，Gaudi2的默認設置為use_habana=True, use_lazy_mode=True, device="hpu"。如果要在NVIDIA GPU上運行，可以將其設置為use_habana=False, use_lazy_mode=False, device="auto"。

deepspeed --include localhost:0,1,2,3,4,5,6,7 \
    --master_port 29501 \
    finetune_neuralchat_v3.py

合併LoRA權重：

python apply_lora.py \
    --base-model-path mistralai/Mistral-7B-v0.1 \
    --lora-model-path finetuned_model/ \
    --output-path finetuned_model_lora

💻 使用示例

基礎用法

FP32推理（使用Transformers）

import transformers

model_name = 'Intel/neural-chat-7b-v3-1'
model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)

def generate_response(system_input, user_input):

    # 使用提供的模板格式化輸入
    prompt = f"### System:\n{system_input}\n### User:\n{user_input}\n### Assistant:\n"

    # 對提示進行分詞和編碼
    inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=False)

    # 生成響應
    outputs = model.generate(inputs, max_length=1000, num_return_sequences=1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # 僅提取助手的響應
    return response.split("### Assistant:\n")[-1]


# 示例用法
system_input = "You are a math expert assistant. Your mission is to help users understand and solve various math problems. You should provide step-by-step solutions, explain reasonings and give the correct answer."
user_input = "calculate 100 + 520 + 60"
response = generate_response(system_input, user_input)
print(response)

# 預期響應
"""
To calculate the sum of 100, 520, and 60, we will follow these steps:

1. Add the first two numbers: 100 + 520
2. Add the result from step 1 to the third number: (100 + 520) + 60

Step 1: Add 100 and 520
100 + 520 = 620

Step 2: Add the result from step 1 to the third number (60)
(620) + 60 = 680

So, the sum of 100, 520, and 60 is 680.
"""

BF16推理（使用Intel Extension for Transformers和Intel Extension for Pytorch）

from transformers import AutoTokenizer, TextStreamer
import torch
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
import intel_extension_for_pytorch as ipex

model_name = "Intel/neural-chat-7b-v3-1"
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = ipex.optimize(model.eval(), dtype=torch.bfloat16, inplace=True, level="O1", auto_kernel_selection=True)

outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

INT4推理（使用Transformers和Intel Extension for Transformers）

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
model_name = "Intel/neural-chat-7b-v3-1"

# 對於int8，應設置weight_dtype="int8"       
config = WeightOnlyQuantConfig(compute_dtype="bf16", weight_dtype="int4")
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

高級用法

測試模型量化能力

以下代碼塊可用於確定PyTorch模型是否適合量化。需要注意的是，Intel Extension for PyTorch使用的optimum ipex處於預發佈階段，需要進一步測試。

要安裝依賴項，應首先安裝Intel Extensions for PyTorch，然後使用pip安裝以下每個依賴項：

torch
optimum.intel
optimum[ipex]
transformers

Intel Extension for PyTorch方法：

在這種情況下，我們正在測試neural-chat-7b-v3-1是否可以量化，此測試方法展示了模型大小的變化，例如：當基礎類型指定為torch.bfloat16，但同時指定load_in_4bit = True，這會導致僅對權重進行量化，我們從模型測試中看到如下輸出：

model_quantize_internal: 模型大小 = 27625.02 MB
model_quantize_internal: 量化後大小 = 4330.80 MB

此代碼應在Python腳本中運行，例如ipex_test.py，如下所示：

import torch
import os
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM, pipeline
model_name = "Intel/neural-chat-7b-v3-1"     
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

result = {torch.bfloat16:"failed"}
typ = torch.bfloat16
try:
    model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True,  torch_dtype = typ)
    outputs = model.generate(inputs, max_new_tokens=20)
    result[typ] = f"passed, {os.stat(model.bin_file).st_size}"
except:
    result[typ] = "failed"

    
print("\n\nResults of quantizing: ")  
# 確定是否量化
with open(r"output.log", 'r') as fp:
    for l_no, line in enumerate(fp):
        # 搜索字符串
        if 'model_quantize_internal' in line:
            print(line)
            
print("\n\nExecution results ")
for k,v in result.items():
    print(k,v)
    
print("\n\nModel Output: ")
tokenizer.decode(outputs[0], skip_special_tokens=True).strip()

從bash終端按以下方式運行代碼：

python ipex_test.py 2>&1 | tee output.log

整個輸出將捕獲在output.log中，但會進行總結，同時還會包含模型輸出，指示量化是否通過以及給定提示的模型輸出。

📚 詳細文檔

影響因素

因素	描述
數據集	有關數據集和註釋的更多詳細信息可以在Open-Orca/SlimOrca和相關論文https://arxiv.org/abs/2306.02707中找到。
輸入影響	模型的性能可能會因輸入而異。在這種情況下，提供的提示會極大地改變語言模型的預測。
訓練環境	該模型在英特爾Gaudi 2處理器（8張卡）上進行訓練。
硬件和軟件影響	在其他硬件和軟件上部署模型會改變模型性能。模型評估因素來自Hugging Face大語言模型排行榜：ARC、HellaSwag、MMLU、TruthfulQA、Winogrande、GSM8K和DROP（見下面的定量分析）。

評估指標

指標	描述
模型性能評估	根據大語言模型排行榜上的指標，將該模型的性能與其他大語言模型進行評估。這些指標已成為大語言模型性能的標準。
決策閾值	未使用決策閾值。
不確定性和可變性處理方法	-

訓練和評估數據

數據類型	描述
數據集	訓練數據來自Open-Orca/SlimOrca。沒有來自GSM8k測試集的汙染，因為這不是Open-Orca/SlimOrca數據集的一部分。
動機	-
預處理	-

🔧 技術細節

定量分析

該模型已提交到大語言模型排行榜。詳細提交信息可在此處找到：https://huggingface.co/datasets/open-llm-leaderboard/details_Intel__neural-chat-7b-v3-1。以下指標顯示該模型的性能相對於Mistral-7B-v0.1和neural-chat-7b-v3有顯著提升。

模型	平均得分 ⬆️	ARC (25-shot) ⬆️	HellaSwag (10-shot) ⬆️	MMLU (5-shot) ⬆️	TruthfulQA (MC) (0-shot) ⬆️	Winogrande (5-shot)	GSM8K (5-shot)	DROP (3-shot)
mistralai/Mistral-7B-v0.1	50.32	59.58	83.31	64.16	42.15	78.37	18.12	6.14
Intel/neural-chat-7b-v3	57.31	67.15	83.29	62.26	58.77	78.06	1.21	50.43
Intel/neural-chat-7b-v3-1	59.06	66.21	83.64	62.37	59.65	78.14	19.56	43.84