Model Overview
Model Features
Model Capabilities
Use Cases
đ SpeechGPT: Empowering Large Language Models with Intrinsic Cross - Modal Conversational Abilities
SpeechGPT is a large language model that endows large language models with intrinsic cross - modal conversational abilities. It can perceive and generate multi - model content according to human instructions. By constructing a large - scale cross - modal speech instruction dataset and adopting a three - stage training strategy, it showcases remarkable multi - modal instruction - following capabilities.
đ Quick Start
SpeechGPT demos are available on our project page. You can experience its cross - modal instruction - following and spoken dialogue capabilities through these demos.
⨠Features
- Intrinsic Cross - Modal Conversational Abilities: SpeechGPT can perceive and generate multi - model content following human instructions, serving as a talking encyclopedia, personal assistant, chat partner, poet, psychologist, and educational assistant.
- Large - Scale Cross - Modal Speech Instruction Dataset: With discrete speech representations, the SpeechInstruct dataset is constructed, which is crucial for cross - modal training.
- Three - Stage Training Strategy: It includes modality - adaptation pre - training, cross - modal instruction fine - tuning, and chain - of - modality instruction fine - tuning, enabling the model to handle multiple modalities effectively.
đĻ Installation
Prerequisites
- Clone the repository:
git clone https://github.com/0nutation/SpeechGPT
cd SpeechGPT
- Create a conda environment:
conda create --name SpeechGPT python=3.8
conda activate SpeechGPT
- Install dependencies:
pip install -r requirements.txt
Download Models and Data
- Download SpeechGPT models: Download [SpeechGPT - 7B - cm](https://huggingface.co/fnlp/SpeechGPT - 7B - cm) and [SpeechGPT - 7B - com](https://huggingface.co/fnlp/SpeechGPT - 7B - com) locally.
- Download mHuBERT model:
s2u_dir="uitls/speech2unit"
cd ${s2u_dir}
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin
- Download unit - vocoder:
vocoder_dir="utils/vocoder/"
cd ${vocoder_dir}
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -O config.json
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -O vocoder.pt
đģ Usage Examples
CLI Inference
python3 speechgpt/src/infer/cli_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir}" \
--output-dir "output"
Notes:
- For speech input, provide the path to the audio file. For ASR or TTS tasks, prefix the speech or text with
this is input:
to avoid incorrect recognition. - The speech response will be saved as a
.wav
file, and detailed responses will be saved in a JSON file.
Textual dialogue example
Please talk with SpeechGPT:
Who is Lebron James?
Response:
Lebron James is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). He is considered one of the greatest basketball players of all time and is known for his athleticism, scoring ability, and leadership skills. He is a four - time NBA MVP, a 14 - time NBA All - Star, a 13 - time All - NBA selection, and a two - time Olympic gold medalist.
Response json is saved in output/responses.json
Spoken dialogue example
Please talk with SpeechGPT:
prompts/0.wav
Transcript: What are the main causes of climate change?
Text response: The main causes of climate change are human activities such as burning fossil fuels, deforestation, and agricultural practices. These activities release greenhouse gases, like carbon dioxide and Methane, into the atmosphere which trap heat and cause the Earth's temperature to rise.
Speech repsonse is saved in output/wav/answer_0.wav
Response json is saved in output/responses.json
ASR example
Please talk with SpeechGPT:
Recognize this speech, this is input: prompts/1.wav
Response:
today is a sunny day.
Response json is saved in output/responses.json
TTS example
Please talk with SpeechGPT:
Read this sentence aloud, this is input: Today is a sunny day.
Response:
<sosp> <661> <987> <520> <982> <681> <982> <681> <982> <681> <982> <681> <982> <189> <63> <662> <79> <868> <220> <196> <166> <549> <822> <89> <194> <633> <14> <855> <183> <609> <389> <771> <865> <641> <124> <362> <734> <742> <98> <519> <26> <204> <280> <668> <167> <104> <650> <179> <961> <428> <950> <82> <165> <196> <166> <549> <822> <89> <194> <458> <726> <603> <819> <651> <133> <651> <133> <186> <133> <186> <133> <186> <511> <186> <511> <eosp>
Speech repsonse is saved in output/wav/answer_1.wav
Response json is saved in output/responses.json
Gradio Web UI
python3 speechgpt/src/infer/web_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir}" \
--output-dir "output/"
đ§ Technical Details
Model Training
Stage1: Modality - adaptation Pre - training
First, use mHuBERT to discretize the LibriLight dataset to obtain discrete unit sequences for stage1 training. Refer to the data processing methods in Speech2unit. Then, divide the discrete units into a training set and a development set, and save them in a specific form.
đ License
No license information is provided in the original document.
Release
- [2023/9/15] Released SpeechGPT code and checkpoints and SpeechInstruct dataset.
- [2023/9/1] Proposed SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Released the code and checkpoints of SpeechTokenizer. Checkout the paper, demo, and github.
- [2023/5/18] Released SpeechGPT: Empowering Large Language Models with Intrinsic Cross - Modal Conversational Abilities. Proposed SpeechGPT, the first multi - modal LLM capable of perceiving and generating multi - modal contents following multi - modal human instructions. Checkout the paper and demo.
Open - source list
Models
- [SpeechGPT - 7B - ma](https://huggingface.co/fnlp/SpeechGPT - 7B - ma): The model obtained after the first - stage modality - adaptation pre - training, initialized with LLaMA - 7B and further pre - trained on LibriLight speech units.
- [SpeechGPT - 7B - cm](https://huggingface.co/fnlp/SpeechGPT - 7B - cm): The model obtained after the second - stage cross - modal instruction fine - tuning, initialized with SpeechGPT - 7B - ma and further fine - tuned on SpeechInstruct Cross - Modal Instruction set. It is a powerful foundational model that aligns speech and text.
- [SpeechGPT - 7B - com](https://huggingface.co/fnlp/SpeechGPT - 7B - com): The model obtained after the third - stage chain - of - modality instruction fine - tuning, initialized with SpeechGPT - 7B - cm and further lora - fine - tuned on SpeechInstruct Chain - of - Modality Instruction set. It is an adapter - model of SpeechGPT - 7B - cm for spoken dialogue.
Datasets
- SpeechInstruct - cross - modal: The cross - modal instruction set, about 9 million unit - text data pairs tokenized by mHuBERT from large - scale English ASR datasets.
- SpeechInstruct - chain - of - modality: The chain - of - thought style instructions for four input - output formats, namely Speech Instruction - Speech Response, Speech Instruction - Text Response, Text Instruction - Speech Response, and Text Instruction - Text Response.
SpeechInstruct - cross - modal data format:
[
{
"prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross - modal conversational language model that is developed by Fudan University. SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross - modal inputs and generate cross - modal outputs.\n",
"plain_text": "[Human]: Try to speak out this sentence, please. This is input: The alchemist rode in front, with the falcon on his shoulder.<eoh> [SpeechGPT]: <sosp><661><588><604><157><596><499><596><106><596><189><63><189><665><991><162><202><393><946><327><905><907><597><660><351><557><794><788><59><754><12><977><877><333><873><835><67><940><118><686><613><169><72><644><553><535><935><101><741><384><173><894><787><380><787><196><555><721><944><250><56><812><222><915><143><390><479><330><435><647><246><650><816><325><506><686><208><613><417><755><193><411><452><111><735><6><735><63><665><644><991><535><271><333><196><918><29><202><393><946><734><390><479><330><776><167><761><907><597><660><351><557><794><75><788><15><366><896><627><168><654><659><177><183><609><710><187><493><361><470><821><59><56><198><912><742><840><431><531><76><668><576><803><791><380><660><325><801><549><366><377><164><309><584><605><193><71><39><eosp><eoa> "
},
]
SpeechInstruct - chain - of - modality data format:
[
{
"prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross - modal conversational language model that is developed by Fudan University. SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross - modal inputs and generate cross - modal outputs.\n",
"plain_text": "[Human]: <sosp><661><987><511><732><951><997><111><982><189><63><665><991><535><101><741><173><945><944><503><641><124><565><734><870><290><978><833><238><761><907><430><901><185><403><557><244><583><788><663><969><896><627><143><515><663><969><660><691><251><412><260><41><740><677><253><380><382><268><506><876><417><755><16><819><80><651><80><651><80><987><588><eosp><eoh>. [SpeechGPT]: What is a bad term for poop?; [ta] A bad term for poop is excrement. It is usually used as a polite way to refer to fecal waste.; [ua] <sosp><497><63><264><644><710><823><565><577><154><331><384><173><945><29><244><326><583><728><576><663><969><896><627><143><38><515><663><24><382><251><676><412><260><41><740><677><253><382><268><876><233><878><609><389><771><865><641><124><878><609><423><384><879><487><219><522><589><337><126><119><663><748><12><671><877><377><385><902><819><619><842><419><997><829><111><666><42><277><63><665><644><389><771><685><437><641><124><258><436><139><340><11><59><518><56><948><86><258><436><139><340><347><376><940><118><944><878><173><641><124><362><734><179><961><931><878><609><423><384><879><219><522><866><337><243><935><101><741><822><89><194><630><86><555><105><79><868><220><156><824><998><870><390><422><330><776><663><969><523><105><79><799><220><357><390><479><422><330><776><485><165><86><501><119><716><205><521><787><935><101><741><89><194><664><835><67><940><118><613><417><755><902><415><772><497><eosp><eoa>."
},
]









