SpeechGPT-7B-cmオープンソースAIモデル - 音声とテキストのインタラクションをサポートするクロスモーダル対話アシスタント

ホーム

Speechgpt 7B Cm

fnlpによって開発

SpeechGPTは内在的なクロスモーダル対話能力を備えた大規模言語モデルで、マルチモーダルコンテンツの知覚と生成が可能であり、音声とテキストのインタラクションをサポートします。

テキスト生成オーディオ

Transformers

#クロスモーダル対話 #音声言語モデル #マルチモーダル命令追従

ダウンロード数 47

リリース時間 : 9/14/2023

モデル概要

SpeechGPTは離散音声表現と3段階のトレーニング戦略（モーダル適応事前トレーニング、クロスモーダル命令ファインチューニング、モーダルチェーン命令ファインチューニング）により、音声とテキストの整合を実現し、さまざまなクロスモーダルタスクを処理できます。

モデル特徴

クロスモーダル対話能力

音声とテキストの入出力を同時に処理し、クロスモーダルインタラクションを実現します。

3段階トレーニング戦略

モーダル適応事前トレーニング、クロスモーダル命令ファインチューニング、モーダルチェーン命令ファインチューニングの3段階を経て、モデルの性能を段階的に向上させます。

大規模音声命令データセット

クロスモーダル命令とモーダルチェーン命令を含むSpeechInstructデータセットを構築しました。

モデル能力

音声認識

音声合成

クロスモーダル対話

テキスト生成

マルチモーダル命令追従

使用事例

個人アシスタント

音声Q&A

音声で質問し情報の回答を得る

正確な音声またはテキスト応答を提供

教育

言語学習

学習者の英語リスニング・スピーキング能力の練習を支援

音声インタラクションと発音フィードバックを提供

🚀 SpeechGPT: 大規模言語モデルに内在的なクロスモーダル会話能力を付与する

🚀 クイックスタート

SpeechGPTは、内在的なクロスモーダル会話能力を持つ大規模言語モデルで、人間の指示に従ってマルチモーダルコンテンツを知覚し、生成することができます。離散的な音声表現を用いて、まず大規模なクロスモーダル音声指示データセットであるSpeechInstructを構築します。さらに、モダリティ適応事前学習、クロスモーダル指示微調整、モダリティ連鎖指示微調整を含む3段階の学習戦略を採用しています。実験結果から、SpeechGPTはマルチモーダルな人間の指示に従う能力が高く、1つのモデルで複数のモダリティを扱う可能性があることが示されています。
SpeechGPTのデモは、プロジェクトページで確認できます。デモに示されているように、SpeechGPTは強力なクロスモーダル指示追従能力と音声対話能力を持っています。SpeechGPTは会話型百科事典、個人用アシスタント、チャットパートナー、詩人、心理学者、教育アシスタントなどとして機能することができます。

SpeechGPTの複数のクロスモーダルタスクを処理する能力

左: SpeechInstructの構築プロセス。右: SpeechGPTのモデル構造

✨ 主な機能

内在的なクロスモーダル会話能力を備え、人間の指示に従ってマルチモーダルコンテンツを知覚し、生成する。
大規模なクロスモーダル音声指示データセットであるSpeechInstructを構築。
3段階の学習戦略（モダリティ適応事前学習、クロスモーダル指示微調整、モダリティ連鎖指示微調整）を採用。
マルチモーダルな人間の指示に従う能力が高く、1つのモデルで複数のモダリティを扱う可能性がある。

📦 インストール

コードのクローン

git clone https://github.com/0nutation/SpeechGPT
cd SpeechGPT
conda create --name SpeechGPT python=3.8
conda activate SpeechGPT
pip install -r requirements.txt

モデルのダウンロード

SpeechGPTと会話するには、SpeechGPT-7B-cmとSpeechGPT-7B-comをローカルにダウンロードする必要があります。

mHuBERTモデルをutils/speech2unit/にダウンロードします。詳細はSpeech2unitを参照してください。

s2u_dir="uitls/speech2unit"
cd ${s2u_dir}
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3.pt
wget https://dl.fbaipublicfiles.com/hubert/mhubert_base_vp_en_es_fr_it3_L11_km1000.bin

ユニットボコーダーをutils/vocoder/にダウンロードします。詳細はvocoderを参照してください。

vocoder_dir="utils/vocoder/"
cd ${vocoder_dir}
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -O config.json
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -O vocoder.pt

💻 使用例

基本的な使用法

CLI推論

python3 speechgpt/src/infer/cli_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir} \
--output-dir "output"

注意事項 音声入力の場合は、音声ファイルのパスを指定できます。ASRまたはTTSタスクの場合は、音声またはテキストの前にthis is input: を付ける必要があります。そうしないと、誤認識される可能性があります。

音声応答は.wavファイルに保存され、詳細な応答はJSONファイルに保存されます。これらのファイルのパスは応答に表示されます。

以下は、SpeechGPTと会話する例です。

テキスト対話の例

Please talk with SpeechGPT:
Who is Lebron James?
Response:
   Lebron James is an American professional basketball player for the Los Angeles Lakers of the National Basketball Association (NBA). He is considered one of the greatest basketball players of all time and is known for his athleticism, scoring ability, and leadership skills. He is a four-time NBA MVP, a 14-time NBA All-Star, a 13-time All-NBA selection, and a two-time Olympic gold medalist.
Response json is saved in output/responses.json

音声対話の例

Please talk with SpeechGPT:
prompts/0.wav
Transcript:   What are the main causes of climate change?
Text response:  The main causes of climate change are human activities such as burning fossil fuels, deforestation, and agricultural practices. These activities release greenhouse gases, like carbon dioxide and Methane, into the atmosphere which trap heat and cause the Earth's temperature to rise.
Speech repsonse is saved in output/wav/answer_0.wav
Response json is saved in output/responses.json

ASRの例

Please talk with SpeechGPT:
Recognize this speech, this is input: prompts/1.wav
Response:
   today is a sunny day.
Response json is saved in output/responses.json

TTSの例

Please talk with SpeechGPT:
Read this sentence aloud, this is input: Today is a sunny day.
Response:
   <sosp> <661> <987> <520> <982> <681> <982> <681> <982> <681> <982> <681> <982> <189> <63> <662> <79> <868> <220> <196> <166> <549> <822> <89> <194> <633> <14> <855> <183> <609> <389> <771> <865> <641> <124> <362> <734> <742> <98> <519> <26> <204> <280> <668> <167> <104> <650> <179> <961> <428> <950> <82> <165> <196> <166> <549> <822> <89> <194> <458> <726> <603> <819> <651> <133> <651> <133> <186> <133> <186> <133> <186> <511> <186> <511> <eosp> 
Speech repsonse is saved in output/wav/answer_1.wav
Response json is saved in output/responses.json

Gradio Web UI

python3 speechgpt/src/infer/web_infer.py \
--model-name-or-path "path/to/SpeechGPT-7B-cm" \
--lora-weights "path/to/SpeechGPT-7B-com" \
--s2u-dir "${s2u_dir}" \
--vocoder-dir "${vocoder_dir}" \
--output-dir "output/"

高度な使用法

SpeechGPTの学習

段階1: モダリティ適応事前学習

まず、mHuBERTを利用してLibriLightデータセットを離散化し、段階1の学習用の離散ユニットシーケンスを取得します。データ処理方法はSpeech2unitを参照してください。

次に、離散ユニットを学習セットと開発セットに分割し、以下の形式でdata/stage1/train.txtとdata/stage1/dev.txtファイルに保存します。

<sosp><189><247><922><991><821><258><485><974><284><466><969><523><196><202><881><331><822><853><432><32><742><98><519><26><204><280><576><384><879><901><555><944><366><641><124><362><734><156><824><462><761><907><430><81><597><716><205><521><470><821><677><355><483><641><124><243><290><978><82><620><915><470><821><576><384><466><398><212><455><931><579><969><778><45><914><445><469><576><803><6><803><791><377><506><835><67><940><613><417><755><237><224><452><121><736><eosp>
<sosp><300><189><63><6><665><991><881><331><6><384><879><945><29><244><583><874><655><837><81><627><545><124><337><850><412><213><260><41><740><797><211><488><961><428><6><196><555><944><873><32><683><700><955><812><328><915><166><250><56><903><86><233><479><330><776><167><104><764><259><921><366><663><432><431><531><976><314><822><89><664><377><611><479><417><eosp>
<sosp><189><735><991><39><565><734><32><742><98><519><26><204><280><668><576><803><791><660><555><233><787><101><741><466><969><219><107><459><491><556><384><733><219><501><445><137><910><523><793><50><981><230><534><321><948><86><116><281><62><462><104><70><918><743><15><212><455><143><836><173><944><958><390><422><66><776><258><436><139><663><432><742><98><519><589><243><126><260><41><444><6><655><764><969><219><727><85><297><700><362><493><6><493><361><393><946><6><470><821><246><655><837><81><969><916><584><819><544><452><158><452><736><eosp>

第三に、LLaMA 7B(HuggingFace)をllama/hf/7Bにダウンロードする必要があります。

これで、段階1の学習を開始できます。分散学習を行う場合は、NNODE、NODE_RANK、MASTER_ADDR、MASTER_PORTの正しい値を指定する必要があります。

bash scripts/ma_pretrain.sh ${NNODE} ${NODE_RANK} ${MASTER_ADDR} ${MASTER_PORT}

段階2: クロスモーダル指示微調整

[SpeechInstruct Cross-modal Instruction set](https://huggingface.co/datasets/fnlp/SpeechInstruct/resoをダウンロードする必要があります。（原文でこのリンクが不完全なので、そのまま残しています）

📚 ドキュメント

オープンソースリスト

モデル

SpeechGPT-7B-ma: 第1段階のモダリティ適応事前学習後に得られるモデルで、LLaMA-7Bで初期化され、LibriLight音声ユニット上でさらに事前学習されます。
SpeechGPT-7B-cm: 第2段階のクロスモーダル指示微調整後に得られるモデルで、SpeechGPT-7B-maで初期化され、SpeechInstruct Cross-Modal Instructionセット上でさらに微調整されます。これは、音声とテキストをアラインさせる強力な基礎モデルです。
SpeechGPT-7B-com: 第3段階のモダリティ連鎖指示微調整後に得られるモデルで、SpeechGPT-7B-cmで初期化され、SpeechInstruct Chain-of-Modality Instructionセット上でさらにlora微調整されます。これは、音声対話用のSpeechGPT-7B-cmのアダプターモデルです。

データセット

SpeechInstruct-cross-modal: クロスモーダル指示セットで、大規模な英語ASRデータセットからmHuBERTでトークン化された約900万のユニット-テキストデータペアが含まれています。
SpeechInstruct-chain-of-modality: 音声指示-音声応答、音声指示-テキスト応答、テキスト指示-音声応答、テキスト指示-テキスト応答の4つの入力-出力形式の思考連鎖スタイルの指示が含まれています。

SpeechInstruct-cross-modalのデータ形式:

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
        "plain_text": "[Human]: Try to speak out this sentence, please. This is input: The alchemist rode in front, with the falcon on his shoulder.<eoh> [SpeechGPT]: <sosp><661><588><604><157><596><499><596><106><596><189><63><189><665><991><162><202><393><946><327><905><907><597><660><351><557><794><788><59><754><12><977><877><333><873><835><67><940><118><686><613><169><72><644><553><535><935><101><741><384><173><894><787><380><787><196><555><721><944><250><56><812><222><915><143><390><479><330><435><647><246><650><816><325><506><686><208><613><417><755><193><411><452><111><735><6><735><63><665><644><991><535><271><333><196><918><29><202><393><946><734><390><479><330><776><167><761><907><597><660><351><557><794><75><788><15><366><896><627><168><654><659><177><183><609><710><187><493><361><470><821><59><56><198><912><742><840><431><531><76><668><576><803><791><380><660><325><801><549><366><377><164><309><584><605><193><71><39><eosp><eoa> "
    },
]

SpeechInstruct-chain-of-modalityのデータ形式:

[
    {
        "prefix": "You are an AI assistant whose name is SpeechGPT.\n- SpeechGPT is a intrinsic cross-modal conversational language model that is developed by Fudan University.  SpeechGPT can understand and communicate fluently with human through speech or text chosen by the user.\n- It can perceive cross-modal inputs and generate cross-modal outputs.\n",
        "plain_text": "[Human]: <sosp><661><987><511><732><951><997><111><982><189><63><665><991><535><101><741><173><945><944><503><641><124><565><734><870><290><978><833><238><761><907><430><901><185><403><557><244><583><788><663><969><896><627><143><515><663><969><660><691><251><412><260><41><740><677><253><380><382><268><506><876><417><755><16><819><80><651><80><651><80><987><588><eosp><eoh>. [SpeechGPT]: What is a bad term for poop?; [ta] A bad term for poop is excrement. It is usually used as a polite way to refer to fecal waste.; [ua] <sosp><497><63><264><644><710><823><565><577><154><331><384><173><945><29><244><326><583><728><576><663><969><896><627><143><38><515><663><24><382><251><676><412><260><41><740><677><253><382><268><876><233><878><609><389><771><865><641><124><878><609><423><384><879><487><219><522><589><337><126><119><663><748><12><671><877><377><385><902><819><619><842><419><997><829><111><666><42><277><63><665><644><389><771><685><437><641><124><258><436><139><340><11><59><518><56><948><86><258><436><139><340><347><376><940><118><944><878><173><641><124><362><734><179><961><931><878><609><423><384><879><219><522><866><337><243><935><101><741><822><89><194><630><86><555><105><79><868><220><156><824><998><870><390><422><330><776><663><969><523><105><79><799><220><357><390><479><422><330><776><485><165><86><501><119><716><205><521><787><935><101><741><89><194><664><835><67><940><118><613><417><755><902><415><772><497><eosp><eoa>."
    },
]

🔧 技術詳細

SpeechGPTは、大規模言語モデルに内在的なクロスモーダル会話能力を付与するために開発されました。まず、離散的な音声表現を用いて、大規模なクロスモーダル音声指示データセットであるSpeechInstructを構築します。このデータセットは、音声とテキストのペアを含み、モデルがクロスモーダルな入力を知覚し、出力を生成する能力を学習するために使用されます。

次に、3段階の学習戦略を採用しています。第1段階はモダリティ適応事前学習で、LLaMA-7Bを初期化し、LibriLight音声ユニット上でさらに事前学習を行います。第2段階はクロスモーダル指示微調整で、SpeechInstruct Cross-Modal Instructionセット上で微調整を行い、音声とテキストをアラインさせます。第3段階はモダリティ連鎖指示微調整で、SpeechInstruct Chain-of-Modality Instructionセット上でlora微調整を行い、音声対話能力を向上させます。

実験結果から、SpeechGPTはマルチモーダルな人間の指示に従う能力が高く、1つのモデルで複数のモダリティを扱う可能性があることが示されています。