Model Overview
Model Features
Model Capabilities
Use Cases
🚀 SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
SpeechGPT is a large language model that endows large language models with intrinsic cross - modal conversational abilities, enabling perception and generation of multi - model content according to human instructions.
🚀 Quick Start
SpeechGPT demos are shown in our project page. As shown in the demos, SpeechGPT has strong cross - modal instruction - following ability and spoken dialogue ability. SpeechGPT can be a talking encyclopedia, your personal assistant, your chat partner, a poet, a psychologist and your educational assistant...
✨ Features
SpeechGPT is a large language model with intrinsic cross - modal conversational abilities, capable of perceiving and generating multi - model content following human instructions.
- Dataset Construction: With discrete speech representations, we first construct SpeechInstruct, a large - scale cross - modal speech instruction dataset.
- Training Strategy: We employ a three - stage training strategy that includes modality - adaptation pre - training, cross - modal instruction fine - tuning, and chain - of - modality instruction fine - tuning.
- Experimental Results: The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi - modal human instructions and highlight the potential of handling multiple modalities with one model.
SpeechGPT’s capabilities to tackle multiple cross - modal tasks
Left: SpeechInstruct construction process. Right: SpeechGPT model structure
📦 Installation
git clone https://github.com/0nutation/SpeechGPT
cd SpeechGPT
conda create --name SpeechGPT python=3.8
conda activate SpeechGPT
pip install -r requirements.txt
💻 Usage Examples
Download
To talk with SpeechGPT, you should download [SpeechGPT - 7B - cm](https://huggingface.co/fnlp/SpeechGPT - 7B - cm) and [SpeechGPT - 7B - com](https://huggingface.co/fnlp/SpeechGPT - 7B - com) locally.
You should download mHuBERT model to utils/speech2unit/
. Please see Speech2unit for details.
s2u_dir="uitls/speech2unit"
cd ${s2u_dir}
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin
You should download the unit - vocoder to utils/vocoder/
. Please see vocoder for details.
vocoder_dir="utils/vocoder/"
cd ${vocoder_dir}
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -O config.json
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -O vocoder.pt
CLI Inference
python3 speechgpt/src/infer/cli_infer.py \
--model - name - or - path "path/to/SpeechGPT - 7B - cm" \
--lora - weights "path/to/SpeechGPT - 7B - com" \
--s2u - dir "${s2u_dir}" \
--vocoder - dir "${vocoder_dir} \
--output - dir "output"
Notes
For speech input, you can provide the path to the audio file. For ASR or TTS tasks, you must prefix the speech or text with this is input:
, otherwise, it may be recognized incorrectly.
The speech response will be saved to a .wav
file, and detailed responses will be saved in a JSON file. The paths to these files will be indicated in the response.
Here are some examples of talking with SpeechGPT:
Basic Usage
Textual dialogue example
Please talk with SpeechGPT:
Who is Lebron James?
Response:
Lebron James is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). He is considered one of the greatest basketball players of all time and is known for his athleticism, scoring ability, and leadership skills. He is a four - time NBA MVP, a 14 - time NBA All - Star, a 13 - time All - NBA selection, and a two - time Olympic gold medalist.
Response json is saved in output/responses.json
Spoken dialogue example
Please talk with SpeechGPT:
prompts/0.wav
Transcript: What are the main causes of climate change?
Text response: The main causes of climate change are human activities such as burning fossil fuels, deforestation, and agricultural practices. These activities release greenhouse gases, like carbon dioxide and Methane, into the atmosphere which trap heat and cause the Earth's temperature to rise.
Speech repsonse is saved in output/wav/answer_0.wav
Response json is saved in output/responses.json
Advanced Usage
ASR example
Please talk with SpeechGPT:
Recognize this speech, this is input: prompts/1.wav
Response:
today is a sunny day.
Response json is saved in output/responses.json
TTS example
Please talk with SpeechGPT:
Read this sentence aloud, this is input: Today is a sunny day.
Response:
<sosp> <661> <987> <520> <982> <681> <982> <681> <982> <681> <982> <681> <982> <189> <63> <662> <79> <868> <220> <196> <166> <549> <822> <89> <194> <633> <14> <855> <183> <609> <389> <771> <865> <641> <124> <362> <734> <742> <98> <519> <26> <204> <280> <668> <167> <104> <650> <179> <961> <428> <950> <82> <165> <196> <166> <549> <822> <89> <194> <458> <726> <603> <819> <651> <133> <651> <133> <186> <133> <186> <133> <186> <511> <186> <511> <eosp>
Speech repsonse is saved in output/wav/answer_1.wav
Response json is saved in output/responses.json
Gradio Web UI
python3 speechgpt/src/infer/web_infer.py \
--model - name - or - path "path/to/SpeechGPT - 7B - cm" \
--lora - weights "path/to/SpeechGPT - 7B - com" \
--s2u - dir "${s2u_dir}" \
--vocoder - dir "${vocoder_dir}" \
--output - dir "output/"
📚 Documentation
Release
- [2023/9/15] We released SpeechGPT code and checkpoints and SpeechInstruct dataset.
- [2023/9/1] We proposed SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. We released the code and checkpoints of SpeechTokenizer. Checkout the paper, demo and github.
- [2023/5/18] We released SpeechGPT: Empowering Large Language Models with Intrinsic Cross - Modal Conversational Abilities. We propose SpeechGPT, the first multi - modal LLM capable of perceiving and generating multi - modal contents following multi - modal human instructions. Checkout the paper and demo.
Open - source list
Models
- [SpeechGPT - 7B - ma](https://huggingface.co/fnlp/SpeechGPT - 7B - ma): The model obtained after the first - stage modality - adaptation pre - training, which was initialized with LLaMA - 7B and further pre - trained on LibriLight speech units.
- [SpeechGPT - 7B - cm](https://huggingface.co/fnlp/SpeechGPT - 7B - cm): The model obtained after the second - stage cross - modal instruction finetuning, which was initialized with SpeechGPT - 7B - ma and further finetuned on SpeechInstruct Cross - Modal Instruction set. This is a powerful foundational model that aligns speech and text.
- [SpeechGPT - 7B - com](https://huggingface.co/fnlp/SpeechGPT - 7B - com): The model obtained after the third - stage chain - of - modality instruction finetuning, which was initialized with SpeechGPT - 7B - cm and further lora - finetuned on SpeechInstruct Chain - of - Modality Instruction set. This is an adapter - model of SpeechGPT - 7B - cm for spoken dialogue.
Datasets
- SpeechInstruct - cross - modal: The cross - modal instruction set, about 9 million unit - text data pairs tokenzed by mHuBERT from large - scale English ASR datasets. data format:
- SpeechInstruct - chain - of - modality: The chain - of - thought style instructions for four input - output formats, namely Speech Instruction - Speech Response, Speech Instruction - Text Response, Text Instruction - Speech Response, and Text Instruction - Text Response.
SpeechInstruct - cross - modal data format:
[
{
"prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross - modal conversational language model that is developed by Fudan University. SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross - modal inputs and generate cross - modal outputs.\n",
"plain_text": "[Human]: Try to speak out this sentence, please. This is input: The alchemist rode in front, with the falcon on his shoulder.<eoh> [SpeechGPT]: <sosp><661><588><604><157><596><499><596><106><596><189><63><189><665><991><162><202><393><946><327><905><907><597><660><351><557><794><788><59><754><12><977><877><333><873><835><67><940><118><686><613><169><72><644><553><535><935><101><741><384><173><894><787><380><787><196><555><721><944><250><56><812><222><915><143><390><479><330><435><647><246><650><816><325><506><686><208><613><417><755><193><411><452><111><735><6><735><63><665><644><991><535><271><333><196><918><29><202><393><946><734><390><479><330><776><167><761><907><597><660><351><557><794><75><788><15><366><896><627><168><654><659><177><183><609><710><187><493><361><470><821><59><56><198><912><742><840><431><531><76><668><576><803><791><380><660><325><801><549><366><377><164><309><584><605><193><71><39><eosp><eoa> "
},
]
SpeechInstruct - chain - of - modality data format:
[
{
"prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross - modal conversational language model that is developed by Fudan University. SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross - modal inputs and generate cross - modal outputs.\n",
"plain_text": "[Human]: <sosp><661><987><511><732><951><997><111><982><189><63><665><991><535><101><741><173><945><944><503><641><124><565><734><870><290><978><833><238><761><907><430><901><185><403><557><244><583><788><663><969><896><627><143><515><663><969><660><691><251><412><260><41><740><677><253><380><382><268><506><876><417><755><16><819><80><651><80><651><80><987><588><eosp><eoh>. [SpeechGPT]: What is a bad term for poop?; [ta] A bad term for poop is excrement. It is usually used as a polite way to refer to fecal waste.; [ua] <sosp><497><63><264><644><710><823><565><577><154><331><384><173><945><29><244><326><583><728><576><663><969><896><627><143><38><515><663><24><382><251><676><412><260><41><740><677><253><382><268><876><233><878><609><389><771><865><641><124><878><609><423><384><879><487><219><522><589><337><126><119><663><748><12><671><877><377><385><902><819><619><842><419><997><829><111><666><42><277><63><665><644><389><771><685><437><641><124><258><436><139><340><11><59><518><56><948><86><258><436><139><340><347><376><940><118><944><878><173><641><124><362><734><179><961><931><878><609><423><384><879><219><522><866><337><243><935><101><741><822><89><194><630><86><555><105><79><868><220><156><824><998><870><390><422><330><776><663><969><523><105><79><799><220><357><390><479><422><330><776><485><165><86><501><119><716><205><521><787><935><101><741><89><194><664><835><67><940><118><613><417><755><902><415><772><497><eosp><eoa>."
},
]
🔧 Technical Details
Train SpeechGPT
Stage1: Modality - adaptation Pre - training
First, utilize mHuBERT for discretizing the LibriLight dataset to obtain discrete unit sequences for stage1 training. You can refer to the data processing methods in Speech2unit.
Second, divide the discrete units into a training set and a development set, and save them in the following form... (The original text seems incomplete here, but we keep it as it is according to the requirements.)









