SpeechGPT-7B-cm开源AI模型 - 支持语音文本交互的跨模态对话助手

首页

Speechgpt 7B Cm

由 fnlp 开发

SpeechGPT是一个具备内在跨模态对话能力的大型语言模型，能够感知和生成多模态内容，支持语音与文本的交互。

文本生成音频

Transformers

#跨模态对话 #语音语言模型 #多模态指令跟随

下载量 47

发布时间 : 9/14/2023

模型简介

SpeechGPT通过离散语音表示和三阶段训练策略（模态适应预训练、跨模态指令微调、模态链式指令微调），实现了语音与文本的对齐，能够处理多种跨模态任务。

模型特点

跨模态对话能力

能够同时处理语音和文本输入输出，实现跨模态交互。

三阶段训练策略

通过模态适应预训练、跨模态指令微调和模态链式指令微调三个阶段，逐步提升模型性能。

大规模语音指令数据集

构建了SpeechInstruct数据集，包含跨模态指令和模态链式指令。

模型能力

语音识别

语音合成

跨模态对话

文本生成

多模态指令跟随

使用案例

个人助理

语音问答

通过语音提问获取信息回答

提供准确的语音或文本响应

教育

语言学习

帮助学习者练习英语听说能力

提供语音交互和发音反馈

🚀 SpeechGPT：赋予大语言模型内在跨模态对话能力

SpeechGPT是一个具备内在跨模态对话能力的大语言模型，能够按照人类指令感知和生成多模态内容。借助离散语音表征，我们首先构建了大规模跨模态语音指令数据集SpeechInstruct。此外，我们采用了三阶段训练策略，包括模态适应预训练、跨模态指令微调以及模态链指令微调。实验结果表明，SpeechGPT在遵循多模态人类指令方面表现出色，凸显了用一个模型处理多种模态的潜力。

你可以在项目页面查看SpeechGPT的演示。从演示中可以看出，SpeechGPT具有强大的跨模态指令遵循能力和口语对话能力。它可以成为会说话的百科全书、你的私人助理、聊天伙伴、诗人、心理学家以及教育助手等。

🚀 快速开始

SpeechGPT是一个具备内在跨模态对话能力的大语言模型，能够感知并按照人类指令生成多模态内容。以下是关于它的详细介绍和使用指南。

SpeechGPT处理多个跨模态任务的能力

左：SpeechInstruct构建过程。右：SpeechGPT模型结构

✨ 主要特性

内在跨模态对话能力：能够感知和生成多模态内容，遵循人类指令。
大规模跨模态语音指令数据集：构建了SpeechInstruct数据集，用于模型训练。
三阶段训练策略：包括模态适应预训练、跨模态指令微调以及模态链指令微调。
强大的指令遵循和对话能力：可以作为多种角色，如百科全书、私人助理等。

📦 安装指南

克隆仓库

git clone https://github.com/0nutation/SpeechGPT
cd SpeechGPT

创建虚拟环境

conda create --name SpeechGPT python=3.8
conda activate SpeechGPT

安装依赖

pip install -r requirements.txt

下载模型和数据

下载模型

要与SpeechGPT进行对话，你需要将SpeechGPT-7B-cm和SpeechGPT-7B-com下载到本地。

下载mHuBERT模型

s2u_dir="uitls/speech2unit"
cd ${s2u_dir}
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin

下载单元声码器

vocoder_dir="utils/vocoder/"
cd ${vocoder_dir}
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -O config.json
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -O vocoder.pt

💻 使用示例

命令行推理

python3 speechgpt/src/infer/cli_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir} \
--output-dir "output"

注意事项 对于语音输入，你可以提供音频文件的路径。对于自动语音识别（ASR）或文本转语音（TTS）任务，你必须在语音或文本前加上 this is input: ，否则可能会识别错误。

语音响应将保存为 .wav 文件，详细响应将保存为JSON文件。这些文件的路径将在响应中显示。

文本对话示例

请与SpeechGPT对话：
Who is Lebron James?
响应：
   Lebron James is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). He is considered one of the greatest basketball players of all time and is known for his athleticism, scoring ability, and leadership skills. He is a four-time NBA MVP, a 14-time NBA All-Star, a 13-time All-NBA selection, and a two-time Olympic gold medalist.
响应JSON文件保存于 output/responses.json

口语对话示例

请与SpeechGPT对话：
prompts/0.wav
转录内容：   What are the main causes of climate change?
文本响应：  The main causes of climate change are human activities such as burning fossil fuels, deforestation, and agricultural practices. These activities release greenhouse gases, like carbon dioxide and Methane, into the atmosphere which trap heat and cause the Earth's temperature to rise.
语音响应保存于 output/wav/answer_0.wav
响应JSON文件保存于 output/responses.json

自动语音识别（ASR）示例

请与SpeechGPT对话：
Recognize this speech, this is input: prompts/1.wav
响应：
   today is a sunny day.
响应JSON文件保存于 output/responses.json

文本转语音（TTS）示例

请与SpeechGPT对话：
Read this sentence aloud, this is input: Today is a sunny day.
响应：
   <sosp> <661> <987> <520> <982> <681> <982> <681> <982> <681> <982> <681> <982> <189> <63> <662> <79> <868> <220> <196> <166> <549> <822> <89> <194> <633> <14> <855> <183> <609> <389> <771> <865> <641> <124> <362> <734> <742> <98> <519> <26> <204> <280> <668> <167> <104> <650> <179> <961> <428> <950> <82> <165> <196> <166> <549> <822> <89> <194> <458> <726> <603> <819> <651> <133> <651> <133> <186> <133> <186> <133> <186> <511> <186> <511> <eosp> 
语音响应保存于 output/wav/answer_1.wav
响应JSON文件保存于 output/responses.json

Gradio Web界面

python3 speechgpt/src/infer/web_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir}" \
--output-dir "output/"

📚 详细文档

开源列表

模型

SpeechGPT-7B-ma：第一阶段模态适应预训练后得到的模型，该模型以LLaMA - 7B为初始化，在LibriLight语音单元上进一步预训练。
SpeechGPT-7B-cm：第二阶段跨模态指令微调后得到的模型，该模型以SpeechGPT - 7B - ma为初始化，在SpeechInstruct跨模态指令集上进一步微调。这是一个强大的基础模型，能够对齐语音和文本。
SpeechGPT-7B-com：第三阶段模态链指令微调后得到的模型，该模型以SpeechGPT - 7B - cm为初始化，在SpeechInstruct模态链指令集上进行LoRA微调。这是SpeechGPT - 7B - cm用于口语对话的适配器模型。

数据集

SpeechInstruct-cross-modal：跨模态指令集，约900万个单元 - 文本数据对，由mHuBERT从大规模英语自动语音识别（ASR）数据集中进行标记。
SpeechInstruct-chain-of-modality：四种输入 - 输出格式的思维链风格指令，即语音指令 - 语音响应、语音指令 - 文本响应、文本指令 - 语音响应和文本指令 - 文本响应。

SpeechInstruct-cross-modal数据格式

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
        "plain_text": "[Human]: Try to speak out this sentence, please. This is input: The alchemist rode in front, with the falcon on his shoulder.<eoh> [SpeechGPT]: <sosp><661><588><604><157><596><499><596><106><596><189><63><189><665><991><162><202><393><946><327><905><907><597><660><351><557><794><788><59><754><12><977><877><333><873><835><67><940><118><686><613><169><72><644><553><535><935><101><741><384><173><894><787><380><787><196><555><721><944><250><56><812><222><915><143><390><479><330><435><647><246><650><816><325><506><686><208><613><417><755><193><411><452><111><735><6><735><63><665><644><991><535><271><333><196><918><29><202><393><946><734><390><479><330><776><167><761><907><597><660><351><557><794><75><788><15><366><896><627><168><654><659><177><183><609><710><187><493><361><470><821><59><56><198><912><742><840><431><531><76><668><576><803><791><380><660><325><801><549><366><377><164><309><584><605><193><71><39><eosp><eoa> "
    },
]

SpeechInstruct-chain-of-modality数据格式

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
        "plain_text": "[Human]: <sosp><661><987><511><732><951><997><111><982><189><63><665><991><535><101><741><173><945><944><503><641><124><565><734><870><290><978><833><238><761><907><430><901><185><403><557><244><583><788><663><969><896><627><143><515><663><969><660><691><251><412><260><41><740><677><253><380><382><268><506><876><417><755><16><819><80><651><80><651><80><987><588><eosp><eoh>. [SpeechGPT]: What is a bad term for poop?; [ta] A bad term for poop is excrement. It is usually used as a polite way to refer to fecal waste.; [ua] <sosp><497><63><264><644><710><823><565><577><154><331><384><173><945><29><244><326><583><728><576><663><969><896><627><143><38><515><663><24><382><251><676><412><260><41><740><677><253><382><268><876><233><878><609><389><771><865><641><124><878><609><423><384><879><487><219><522><589><337><126><119><663><748><12><671><877><377><385><902><819><619><842><419><997><829><111><666><42><277><63><665><644><389><771><685><437><641><124><258><436><139><340><11><59><518><56><948><86><258><436><139><340><347><376><940><118><944><878><173><641><124><362><734><179><961><931><878><609><423><384><879><219><522><866><337><243><935><101><741><822><89><194><630><86><555><105><79><868><220><156><824><998><870><390><422><330><776><663><969><523><105><79><799><220><357><390><479><422><330><776><485><165><86><501><119><716><205><521><787><935><101><741><89><194><664><835><67><940><118><613><417><755><902><415><772><497><eosp><eoa>."
    },
]

训练SpeechGPT

阶段1：模态适应预训练

首先，利用mHuBERT对LibriLight数据集进行离散化处理，以获得用于第一阶段训练的离散单元序列。你可以参考Speech2unit中的数据处理方法。

其次，将离散单元划分为训练集和开发集，并以以下格式保存到 data/stage1/train.txt 和 data/stage1/dev.txt 文件中：

<sosp><189><247><922><991><821><258><485><974><284><466><969><523><196><202><881><331><822><853><432><32><742><98><519><26><204><280><576><384><879><901><555><944><366><641><124><362><734><156><824><462><761><907><430><81><597><716><205><521><470><821><677><355><483><641><124><243><290><978><82><620><915><470><821><576><384><466><398><212><455><931><579><969><778><45><914><445><469><576><803><6><803><791><377><506><835><67><940><613><417><755><237><224><452><121><736><eosp>
<sosp><300><189><63><6><665><991><881><331><6><384><879><945><29><244><583><874><655><837><81><627><545><124><337><850><412><213><260><41><740><797><211><488><961><428><6><196><555><944><873><32><683><700><955><812><328><915><166><250><56><903><86><233><479><330><776><167><104><764><259><921><366><663><432><431><531><976><314><822><89><664><377><611><479><417><eosp>
<sosp><189><735><991><39><565><734><32><742><98><519><26><204><280><668><576><803><791><660><555><233><787><101><741><466><969><219><107><459><491><556><384><733><219><501><445><137><910><523><793><50><981><230><534><321><948><86><116><281><62><462><104><70><918><743><15><212><455><143><836><173><944><958><390><422><66><776><258><436><139><663><432><742><98><519><589><243><126><260><41><444><6><655><764><969><219><727><85><297><700><362><493><6><493><361><393><946><6><470><821><246><655><837><81><969><916><584><819><544><452><158><452><736><eosp>

第三，你需要将LLaMA 7B（HuggingFace）下载到 llama/hf/7B 目录下。

现在你可以开始第一阶段的训练：要进行分布式训练，你必须指定正确的 NNODE、NODE_RANK、MASTER_ADDR 和 MASTER_PORT 值。

bash scripts/ma_pretrain.sh ${NNODE} ${NODE_RANK} ${MASTER_ADDR} ${MASTER_PORT}

阶段2：跨模态指令微调

你需要将SpeechInstruct跨模态指令集下载到 data/stage2/ 目录下。

如果你想跳过第一阶段的训练，可以将 SpeechGPT-7B-ma 下载到 output/stage1/ 目录下。

现在你可以开始第二阶段的训练：要进行分布式训练，你必须指定正确的 NNODE、NODE_RANK、MASTER_ADDR 和 MASTER_PORT 值。

bash scripts/cm_sft.sh ${NNODE} ${NODE_RANK} ${MASTER_ADDR} ${MASTER_PORT}

阶段3：模态链指令微调

你需要将SpeechInstruct模态链指令集下载到 data/stage3/ 目录下。

如果你想跳过第一阶段和第二阶段的训练，可以将 SpeechGPT-7B-cm 下载到 output/stage2/ 目录下。

现在你可以开始第三阶段的训练：要进行分布式训练，你必须指定正确的 NNODE、NODE_RANK、MASTER_ADDR 和 MASTER_PORT 值。

bash scripts/com_sft.sh ${NNODE} ${NODE_RANK} ${MASTER_ADDR} ${MASTER_PORT}

微调SpeechGPT

Speech-7B-cm 是一个在语音和文本对齐方面表现出色的基础模型。我们鼓励基于此模型对SpeechGPT进行微调。

步骤1：按照SpeechInstruct跨模态指令集中的格式准备你的数据。

步骤2：将SpeechGPT-7B-cm下载到本地。

步骤3：修改 scripts/cm_sft.sh 脚本中的 METAROOT、DATAROOT 和 OUTROOT 参数为你自己的参数，然后运行该脚本。对于LoRA微调，更新 scripts/com_sft.sh 脚本中的 METAROOT、DATAROOT 和 OUTROOT 参数并运行该脚本。

🔧 技术细节

SpeechGPT采用离散语音表征构建了大规模跨模态语音指令数据集SpeechInstruct，并使用三阶段训练策略，包括模态适应预训练、跨模态指令微调以及模态链指令微调。这种策略使得模型能够更好地处理多模态任务，遵循人类的多模态指令。

📄 许可证

本项目未明确提及许可证信息。

致谢

MOSS：我们使用了moss - sft - 002 - data。
stanford_alpaca：我们基于此代码库进行开发。

引用

如果你发现SpeechGPT对你的研究和应用有帮助，请使用以下BibTex进行引用：

@misc{zhang2023speechgpt,
      title={SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities}, 
      author={Dong Zhang and Shimin Li and Xin Zhang and Jun Zhan and Pengyu Wang and Yaqian Zhou and Xipeng Qiu},
      year={2023},
      eprint={2305.11000},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}