SpeechGPT-7B-ma Open-Source Large Model - Supports Cross-Modal Dialogue and Generates Multi-Modal Content with Instruction Perception

Speechgpt 7B Ma

Developed by fnlp

SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content based on human instructions.

Text-to-Audio

Transformers

#Cross-modal Dialogue #Speech Language Model #Multimodal Instruction Following

Downloads 37

Release Time : 9/14/2023

Model Overview

SpeechGPT constructs a cross-modal speech instruction dataset through discrete speech representations, employs a three-stage training strategy, and demonstrates excellent multimodal human instruction following capabilities.

Model Features

Cross-modal Dialogue Capability

Capable of processing both speech and text input/output, enabling true cross-modal interaction

Three-stage Training Strategy

Adopts a three-stage training approach: modality adaptation pre-training, cross-modal instruction fine-tuning, and modality chain instruction fine-tuning

Large-scale Speech Instruction Dataset

Constructed the SpeechInstruct dataset containing approximately 9 million unit-text pairs

Model Capabilities

Speech recognition

Speech synthesis

Cross-modal dialogue

Text generation

Instruction following

Use Cases

Personal Assistant

Voice Q&A

Obtain information answers through voice questions

Can accurately understand questions and generate speech or text responses

Education

Language Learning

Help learners practice English listening and speaking skills

Provides interactive voice learning experience

🚀 SpeechGPT: Empowering Large Language Models with Intrinsic Cross - Modal Conversational Abilities

SpeechGPT is a large language model that endows large language models with intrinsic cross - modal conversational abilities. It can perceive and generate multi - model content according to human instructions. By constructing a large - scale cross - modal speech instruction dataset and adopting a three - stage training strategy, it showcases remarkable multi - modal instruction - following capabilities.

🚀 Quick Start

SpeechGPT demos are available on our project page. You can experience its cross - modal instruction - following and spoken dialogue capabilities through these demos.

✨ Features

Intrinsic Cross - Modal Conversational Abilities: SpeechGPT can perceive and generate multi - model content following human instructions, serving as a talking encyclopedia, personal assistant, chat partner, poet, psychologist, and educational assistant.
Large - Scale Cross - Modal Speech Instruction Dataset: With discrete speech representations, the SpeechInstruct dataset is constructed, which is crucial for cross - modal training.
Three - Stage Training Strategy: It includes modality - adaptation pre - training, cross - modal instruction fine - tuning, and chain - of - modality instruction fine - tuning, enabling the model to handle multiple modalities effectively.

📦 Installation

Prerequisites

Clone the repository:

git clone https://github.com/0nutation/SpeechGPT
cd SpeechGPT

Create a conda environment:

conda create --name SpeechGPT python=3.8
conda activate SpeechGPT

Install dependencies:

pip install -r requirements.txt

Download Models and Data

Download SpeechGPT models: Download [SpeechGPT - 7B - cm](https://huggingface.co/fnlp/SpeechGPT - 7B - cm) and [SpeechGPT - 7B - com](https://huggingface.co/fnlp/SpeechGPT - 7B - com) locally.
Download mHuBERT model:

s2u_dir="uitls/speech2unit"
cd ${s2u_dir}
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin

Download unit - vocoder:

vocoder_dir="utils/vocoder/"
cd ${vocoder_dir}
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -O config.json
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -O vocoder.pt

💻 Usage Examples

CLI Inference

python3 speechgpt/src/infer/cli_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir}" \
--output-dir "output"

Notes:

For speech input, provide the path to the audio file. For ASR or TTS tasks, prefix the speech or text with this is input: to avoid incorrect recognition.
The speech response will be saved as a .wav file, and detailed responses will be saved in a JSON file.

Textual dialogue example

Please talk with SpeechGPT:
Who is Lebron James?
Response:
   Lebron James is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). He is considered one of the greatest basketball players of all time and is known for his athleticism, scoring ability, and leadership skills. He is a four - time NBA MVP, a 14 - time NBA All - Star, a 13 - time All - NBA selection, and a two - time Olympic gold medalist.
Response json is saved in output/responses.json

Spoken dialogue example

Please talk with SpeechGPT:
prompts/0.wav
Transcript:   What are the main causes of climate change?
Text response:  The main causes of climate change are human activities such as burning fossil fuels, deforestation, and agricultural practices. These activities release greenhouse gases, like carbon dioxide and Methane, into the atmosphere which trap heat and cause the Earth's temperature to rise.
Speech repsonse is saved in output/wav/answer_0.wav
Response json is saved in output/responses.json

ASR example

Please talk with SpeechGPT:
Recognize this speech, this is input: prompts/1.wav
Response:
   today is a sunny day.
Response json is saved in output/responses.json

TTS example

Please talk with SpeechGPT:
Read this sentence aloud, this is input: Today is a sunny day.
Response:
   <sosp> <661> <987> <520> <982> <681> <982> <681> <982> <681> <982> <681> <982> <189> <63> <662> <79> <868> <220> <196> <166> <549> <822> <89> <194> <633> <14> <855> <183> <609> <389> <771> <865> <641> <124> <362> <734> <742> <98> <519> <26> <204> <280> <668> <167> <104> <650> <179> <961> <428> <950> <82> <165> <196> <166> <549> <822> <89> <194> <458> <726> <603> <819> <651> <133> <651> <133> <186> <133> <186> <133> <186> <511> <186> <511> <eosp> 
Speech repsonse is saved in output/wav/answer_1.wav
Response json is saved in output/responses.json

Gradio Web UI

python3 speechgpt/src/infer/web_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir}" \
--output-dir "output/"

🔧 Technical Details

Model Training

Stage1: Modality - adaptation Pre - training

First, use mHuBERT to discretize the LibriLight dataset to obtain discrete unit sequences for stage1 training. Refer to the data processing methods in Speech2unit. Then, divide the discrete units into a training set and a development set, and save them in a specific form.

📄 License

No license information is provided in the original document.

Release

[2023/9/15] Released SpeechGPT code and checkpoints and SpeechInstruct dataset.
[2023/9/1] Proposed SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models. Released the code and checkpoints of SpeechTokenizer. Checkout the paper, demo, and github.
[2023/5/18] Released SpeechGPT: Empowering Large Language Models with Intrinsic Cross - Modal Conversational Abilities. Proposed SpeechGPT, the first multi - modal LLM capable of perceiving and generating multi - modal contents following multi - modal human instructions. Checkout the paper and demo.

Open - source list

Models

[SpeechGPT - 7B - ma](https://huggingface.co/fnlp/SpeechGPT - 7B - ma): The model obtained after the first - stage modality - adaptation pre - training, initialized with LLaMA - 7B and further pre - trained on LibriLight speech units.
[SpeechGPT - 7B - cm](https://huggingface.co/fnlp/SpeechGPT - 7B - cm): The model obtained after the second - stage cross - modal instruction fine - tuning, initialized with SpeechGPT - 7B - ma and further fine - tuned on SpeechInstruct Cross - Modal Instruction set. It is a powerful foundational model that aligns speech and text.
[SpeechGPT - 7B - com](https://huggingface.co/fnlp/SpeechGPT - 7B - com): The model obtained after the third - stage chain - of - modality instruction fine - tuning, initialized with SpeechGPT - 7B - cm and further lora - fine - tuned on SpeechInstruct Chain - of - Modality Instruction set. It is an adapter - model of SpeechGPT - 7B - cm for spoken dialogue.

Datasets

SpeechInstruct - cross - modal: The cross - modal instruction set, about 9 million unit - text data pairs tokenized by mHuBERT from large - scale English ASR datasets.
SpeechInstruct - chain - of - modality: The chain - of - thought style instructions for four input - output formats, namely Speech Instruction - Speech Response, Speech Instruction - Text Response, Text Instruction - Speech Response, and Text Instruction - Text Response.

SpeechInstruct - cross - modal data format:

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross - modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross - modal inputs and generate cross - modal outputs.\n",
        "plain_text": "[Human]: Try to speak out this sentence, please. This is input: The alchemist rode in front, with the falcon on his shoulder.<eoh> [SpeechGPT]: <sosp><661><588><604><157><596><499><596><106><596><189><63><189><665><991><162><202><393><946><327><905><907><597><660><351><557><794><788><59><754><12><977><877><333><873><835><67><940><118><686><613><169><72><644><553><535><935><101><741><384><173><894><787><380><787><196><555><721><944><250><56><812><222><915><143><390><479><330><435><647><246><650><816><325><506><686><208><613><417><755><193><411><452><111><735><6><735><63><665><644><991><535><271><333><196><918><29><202><393><946><734><390><479><330><776><167><761><907><597><660><351><557><794><75><788><15><366><896><627><168><654><659><177><183><609><710><187><493><361><470><821><59><56><198><912><742><840><431><531><76><668><576><803><791><380><660><325><801><549><366><377><164><309><584><605><193><71><39><eosp><eoa> "
    },
]

SpeechInstruct - chain - of - modality data format:

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross - modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross - modal inputs and generate cross - modal outputs.\n",
        "plain_text": "[Human]: <sosp><661><987><511><732><951><997><111><982><189><63><665><991><535><101><741><173><945><944><503><641><124><565><734><870><290><978><833><238><761><907><430><901><185><403><557><244><583><788><663><969><896><627><143><515><663><969><660><691><251><412><260><41><740><677><253><380><382><268><506><876><417><755><16><819><80><651><80><651><80><987><588><eosp><eoh>. [SpeechGPT]: What is a bad term for poop?; [ta] A bad term for poop is excrement. It is usually used as a polite way to refer to fecal waste.; [ua] <sosp><497><63><264><644><710><823><565><577><154><331><384><173><945><29><244><326><583><728><576><663><969><896><627><143><38><515><663><24><382><251><676><412><260><41><740><677><253><382><268><876><233><878><609><389><771><865><641><124><878><609><423><384><879><487><219><522><589><337><126><119><663><748><12><671><877><377><385><902><819><619><842><419><997><829><111><666><42><277><63><665><644><389><771><685><437><641><124><258><436><139><340><11><59><518><56><948><86><258><436><139><340><347><376><940><118><944><878><173><641><124><362><734><179><961><931><878><609><423><384><879><219><522><866><337><243><935><101><741><822><89><194><630><86><555><105><79><868><220><156><824><998><870><390><422><330><776><663><969><523><105><79><799><220><357><390><479><422><330><776><485><165><86><501><119><716><205><521><787><935><101><741><89><194><664><835><67><940><118><613><417><755><902><415><772><497><eosp><eoa>."
    },
]

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご