SpeechGPT-7B-cm開源AI模型 - 支持語音文本交互的跨模態對話助手

首頁

Speechgpt 7B Cm

由fnlp開發

SpeechGPT是一個具備內在跨模態對話能力的大型語言模型，能夠感知和生成多模態內容，支持語音與文本的交互。

文本生成音頻

Transformers

#跨模態對話 #語音語言模型 #多模態指令跟隨

下載量 47

發布時間 : 9/14/2023

模型概述

SpeechGPT通過離散語音表示和三階段訓練策略（模態適應預訓練、跨模態指令微調、模態鏈式指令微調），實現了語音與文本的對齊，能夠處理多種跨模態任務。

模型特點

跨模態對話能力

能夠同時處理語音和文本輸入輸出，實現跨模態交互。

三階段訓練策略

通過模態適應預訓練、跨模態指令微調和模態鏈式指令微調三個階段，逐步提升模型性能。

大規模語音指令數據集

構建了SpeechInstruct數據集，包含跨模態指令和模態鏈式指令。

模型能力

語音識別

語音合成

跨模態對話

文本生成

多模態指令跟隨

使用案例

個人助理

語音問答

通過語音提問獲取信息回答

提供準確的語音或文本響應

教育

語言學習

幫助學習者練習英語聽說能力

提供語音交互和發音反饋

🚀 SpeechGPT：賦予大語言模型內在跨模態對話能力

SpeechGPT是一個具備內在跨模態對話能力的大語言模型，能夠按照人類指令感知和生成多模態內容。藉助離散語音表徵，我們首先構建了大規模跨模態語音指令數據集SpeechInstruct。此外，我們採用了三階段訓練策略，包括模態適應預訓練、跨模態指令微調以及模態鏈指令微調。實驗結果表明，SpeechGPT在遵循多模態人類指令方面表現出色，凸顯了用一個模型處理多種模態的潛力。

你可以在項目頁面查看SpeechGPT的演示。從演示中可以看出，SpeechGPT具有強大的跨模態指令遵循能力和口語對話能力。它可以成為會說話的百科全書、你的私人助理、聊天夥伴、詩人、心理學家以及教育助手等。

🚀 快速開始

SpeechGPT是一個具備內在跨模態對話能力的大語言模型，能夠感知並按照人類指令生成多模態內容。以下是關於它的詳細介紹和使用指南。

SpeechGPT處理多個跨模態任務的能力

左：SpeechInstruct構建過程。右：SpeechGPT模型結構

✨ 主要特性

內在跨模態對話能力：能夠感知和生成多模態內容，遵循人類指令。
大規模跨模態語音指令數據集：構建了SpeechInstruct數據集，用於模型訓練。
三階段訓練策略：包括模態適應預訓練、跨模態指令微調以及模態鏈指令微調。
強大的指令遵循和對話能力：可以作為多種角色，如百科全書、私人助理等。

📦 安裝指南

克隆倉庫

git clone https://github.com/0nutation/SpeechGPT
cd SpeechGPT

創建虛擬環境

conda create --name SpeechGPT python=3.8
conda activate SpeechGPT

安裝依賴

pip install -r requirements.txt

下載模型和數據

下載模型

要與SpeechGPT進行對話，你需要將SpeechGPT-7B-cm和SpeechGPT-7B-com下載到本地。

下載mHuBERT模型

s2u_dir="uitls/speech2unit"
cd ${s2u_dir}
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin

下載單元聲碼器

vocoder_dir="utils/vocoder/"
cd ${vocoder_dir}
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -O config.json
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -O vocoder.pt

💻 使用示例

命令行推理

python3 speechgpt/src/infer/cli_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir} \
--output-dir "output"

注意事項 對於語音輸入，你可以提供音頻文件的路徑。對於自動語音識別（ASR）或文本轉語音（TTS）任務，你必須在語音或文本前加上 this is input: ，否則可能會識別錯誤。

語音響應將保存為 .wav 文件，詳細響應將保存為JSON文件。這些文件的路徑將在響應中顯示。

文本對話示例

請與SpeechGPT對話：
Who is Lebron James?
響應：
   Lebron James is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). He is considered one of the greatest basketball players of all time and is known for his athleticism, scoring ability, and leadership skills. He is a four-time NBA MVP, a 14-time NBA All-Star, a 13-time All-NBA selection, and a two-time Olympic gold medalist.
響應JSON文件保存於 output/responses.json

口語對話示例

請與SpeechGPT對話：
prompts/0.wav
轉錄內容：   What are the main causes of climate change?
文本響應：  The main causes of climate change are human activities such as burning fossil fuels, deforestation, and agricultural practices. These activities release greenhouse gases, like carbon dioxide and Methane, into the atmosphere which trap heat and cause the Earth's temperature to rise.
語音響應保存於 output/wav/answer_0.wav
響應JSON文件保存於 output/responses.json

自動語音識別（ASR）示例

請與SpeechGPT對話：
Recognize this speech, this is input: prompts/1.wav
響應：
   today is a sunny day.
響應JSON文件保存於 output/responses.json

文本轉語音（TTS）示例

請與SpeechGPT對話：
Read this sentence aloud, this is input: Today is a sunny day.
響應：
   <sosp> <661> <987> <520> <982> <681> <982> <681> <982> <681> <982> <681> <982> <189> <63> <662> <79> <868> <220> <196> <166> <549> <822> <89> <194> <633> <14> <855> <183> <609> <389> <771> <865> <641> <124> <362> <734> <742> <98> <519> <26> <204> <280> <668> <167> <104> <650> <179> <961> <428> <950> <82> <165> <196> <166> <549> <822> <89> <194> <458> <726> <603> <819> <651> <133> <651> <133> <186> <133> <186> <133> <186> <511> <186> <511> <eosp> 
語音響應保存於 output/wav/answer_1.wav
響應JSON文件保存於 output/responses.json

Gradio Web界面

python3 speechgpt/src/infer/web_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir}" \
--output-dir "output/"

📚 詳細文檔

開源列表

模型

SpeechGPT-7B-ma：第一階段模態適應預訓練後得到的模型，該模型以LLaMA - 7B為初始化，在LibriLight語音單元上進一步預訓練。
SpeechGPT-7B-cm：第二階段跨模態指令微調後得到的模型，該模型以SpeechGPT - 7B - ma為初始化，在SpeechInstruct跨模態指令集上進一步微調。這是一個強大的基礎模型，能夠對齊語音和文本。
SpeechGPT-7B-com：第三階段模態鏈指令微調後得到的模型，該模型以SpeechGPT - 7B - cm為初始化，在SpeechInstruct模態鏈指令集上進行LoRA微調。這是SpeechGPT - 7B - cm用於口語對話的適配器模型。

數據集

SpeechInstruct-cross-modal：跨模態指令集，約900萬個單元 - 文本數據對，由mHuBERT從大規模英語自動語音識別（ASR）數據集中進行標記。
SpeechInstruct-chain-of-modality：四種輸入 - 輸出格式的思維鏈風格指令，即語音指令 - 語音響應、語音指令 - 文本響應、文本指令 - 語音響應和文本指令 - 文本響應。

SpeechInstruct-cross-modal數據格式

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
        "plain_text": "[Human]: Try to speak out this sentence, please. This is input: The alchemist rode in front, with the falcon on his shoulder.<eoh> [SpeechGPT]: <sosp><661><588><604><157><596><499><596><106><596><189><63><189><665><991><162><202><393><946><327><905><907><597><660><351><557><794><788><59><754><12><977><877><333><873><835><67><940><118><686><613><169><72><644><553><535><935><101><741><384><173><894><787><380><787><196><555><721><944><250><56><812><222><915><143><390><479><330><435><647><246><650><816><325><506><686><208><613><417><755><193><411><452><111><735><6><735><63><665><644><991><535><271><333><196><918><29><202><393><946><734><390><479><330><776><167><761><907><597><660><351><557><794><75><788><15><366><896><627><168><654><659><177><183><609><710><187><493><361><470><821><59><56><198><912><742><840><431><531><76><668><576><803><791><380><660><325><801><549><366><377><164><309><584><605><193><71><39><eosp><eoa> "
    },
]

SpeechInstruct-chain-of-modality數據格式

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
        "plain_text": "[Human]: <sosp><661><987><511><732><951><997><111><982><189><63><665><991><535><101><741><173><945><944><503><641><124><565><734><870><290><978><833><238><761><907><430><901><185><403><557><244><583><788><663><969><896><627><143><515><663><969><660><691><251><412><260><41><740><677><253><380><382><268><506><876><417><755><16><819><80><651><80><651><80><987><588><eosp><eoh>. [SpeechGPT]: What is a bad term for poop?; [ta] A bad term for poop is excrement. It is usually used as a polite way to refer to fecal waste.; [ua] <sosp><497><63><264><644><710><823><565><577><154><331><384><173><945><29><244><326><583><728><576><663><969><896><627><143><38><515><663><24><382><251><676><412><260><41><740><677><253><382><268><876><233><878><609><389><771><865><641><124><878><609><423><384><879><487><219><522><589><337><126><119><663><748><12><671><877><377><385><902><819><619><842><419><997><829><111><666><42><277><63><665><644><389><771><685><437><641><124><258><436><139><340><11><59><518><56><948><86><258><436><139><340><347><376><940><118><944><878><173><641><124><362><734><179><961><931><878><609><423><384><879><219><522><866><337><243><935><101><741><822><89><194><630><86><555><105><79><868><220><156><824><998><870><390><422><330><776><663><969><523><105><79><799><220><357><390><479><422><330><776><485><165><86><501><119><716><205><521><787><935><101><741><89><194><664><835><67><940><118><613><417><755><902><415><772><497><eosp><eoa>."
    },
]

訓練SpeechGPT

階段1：模態適應預訓練

首先，利用mHuBERT對LibriLight數據集進行離散化處理，以獲得用於第一階段訓練的離散單元序列。你可以參考Speech2unit中的數據處理方法。

其次，將離散單元劃分為訓練集和開發集，並以以下格式保存到 data/stage1/train.txt 和 data/stage1/dev.txt 文件中：

<sosp><189><247><922><991><821><258><485><974><284><466><969><523><196><202><881><331><822><853><432><32><742><98><519><26><204><280><576><384><879><901><555><944><366><641><124><362><734><156><824><462><761><907><430><81><597><716><205><521><470><821><677><355><483><641><124><243><290><978><82><620><915><470><821><576><384><466><398><212><455><931><579><969><778><45><914><445><469><576><803><6><803><791><377><506><835><67><940><613><417><755><237><224><452><121><736><eosp>
<sosp><300><189><63><6><665><991><881><331><6><384><879><945><29><244><583><874><655><837><81><627><545><124><337><850><412><213><260><41><740><797><211><488><961><428><6><196><555><944><873><32><683><700><955><812><328><915><166><250><56><903><86><233><479><330><776><167><104><764><259><921><366><663><432><431><531><976><314><822><89><664><377><611><479><417><eosp>
<sosp><189><735><991><39><565><734><32><742><98><519><26><204><280><668><576><803><791><660><555><233><787><101><741><466><969><219><107><459><491><556><384><733><219><501><445><137><910><523><793><50><981><230><534><321><948><86><116><281><62><462><104><70><918><743><15><212><455><143><836><173><944><958><390><422><66><776><258><436><139><663><432><742><98><519><589><243><126><260><41><444><6><655><764><969><219><727><85><297><700><362><493><6><493><361><393><946><6><470><821><246><655><837><81><969><916><584><819><544><452><158><452><736><eosp>

第三，你需要將LLaMA 7B（HuggingFace）下載到 llama/hf/7B 目錄下。

現在你可以開始第一階段的訓練：要進行分佈式訓練，你必須指定正確的 NNODE、NODE_RANK、MASTER_ADDR 和 MASTER_PORT 值。

bash scripts/ma_pretrain.sh ${NNODE} ${NODE_RANK} ${MASTER_ADDR} ${MASTER_PORT}

階段2：跨模態指令微調

你需要將SpeechInstruct跨模態指令集下載到 data/stage2/ 目錄下。

如果你想跳過第一階段的訓練，可以將 SpeechGPT-7B-ma 下載到 output/stage1/ 目錄下。

現在你可以開始第二階段的訓練：要進行分佈式訓練，你必須指定正確的 NNODE、NODE_RANK、MASTER_ADDR 和 MASTER_PORT 值。

bash scripts/cm_sft.sh ${NNODE} ${NODE_RANK} ${MASTER_ADDR} ${MASTER_PORT}

階段3：模態鏈指令微調

你需要將SpeechInstruct模態鏈指令集下載到 data/stage3/ 目錄下。

如果你想跳過第一階段和第二階段的訓練，可以將 SpeechGPT-7B-cm 下載到 output/stage2/ 目錄下。

現在你可以開始第三階段的訓練：要進行分佈式訓練，你必須指定正確的 NNODE、NODE_RANK、MASTER_ADDR 和 MASTER_PORT 值。

bash scripts/com_sft.sh ${NNODE} ${NODE_RANK} ${MASTER_ADDR} ${MASTER_PORT}

微調SpeechGPT

Speech-7B-cm 是一個在語音和文本對齊方面表現出色的基礎模型。我們鼓勵基於此模型對SpeechGPT進行微調。

步驟1：按照SpeechInstruct跨模態指令集中的格式準備你的數據。

步驟2：將SpeechGPT-7B-cm下載到本地。

步驟3：修改 scripts/cm_sft.sh 腳本中的 METAROOT、DATAROOT 和 OUTROOT 參數為你自己的參數，然後運行該腳本。對於LoRA微調，更新 scripts/com_sft.sh 腳本中的 METAROOT、DATAROOT 和 OUTROOT 參數並運行該腳本。

🔧 技術細節

SpeechGPT採用離散語音表徵構建了大規模跨模態語音指令數據集SpeechInstruct，並使用三階段訓練策略，包括模態適應預訓練、跨模態指令微調以及模態鏈指令微調。這種策略使得模型能夠更好地處理多模態任務，遵循人類的多模態指令。

📄 許可證

本項目未明確提及許可證信息。

致謝

MOSS：我們使用了moss - sft - 002 - data。
stanford_alpaca：我們基於此代碼庫進行開發。

引用

如果你發現SpeechGPT對你的研究和應用有幫助，請使用以下BibTex進行引用：

@misc{zhang2023speechgpt,
      title={SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities}, 
      author={Dong Zhang and Shimin Li and Xin Zhang and Jun Zhan and Pengyu Wang and Yaqian Zhou and Xipeng Qiu},
      year={2023},
      eprint={2305.11000},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}