Mustango開源多模態大模型 - 免費實現高質量的文本可控音樂生成

首頁

Mustango

由declare-lab開發

Mustango是一個專為可控音樂生成而設計的全新多模態大語言模型，融合了潛在擴散模型（LDM）、Flan-T5和音樂特徵來實現高質量的文本到音樂生成。

文本生成音頻

Transformers

開源協議:Apache-2.0 #可控音樂生成 #多模態音樂創作 #音樂特徵融合

下載量 165

發布時間 : 11/15/2023

模型概述

Mustango是一個創新的文本到音樂生成模型，通過結合多種技術實現高質量且可控的音樂創作。

模型特點

多模態融合

結合潛在擴散模型和Flan-T5語言模型，實現高質量的文本到音樂轉換

可控生成

支持通過文本提示精確控制音樂風格、節奏、旋律等特徵

專業音樂特徵

整合專業音樂特徵，生成具有音樂性的作品

模型能力

文本生成音樂

音樂風格控制

旋律生成

節奏控制

使用案例

音樂創作

電視節目配樂

為兒童電視節目生成符合場景氛圍的背景音樂

生成充滿童趣的音樂作品

廣告音樂

根據廣告主題快速生成匹配的短歌

生成符合廣告氛圍的短音樂片段

內容創作

視頻配樂

為視頻內容自動生成匹配的背景音樂

生成與視頻內容協調的音樂

🚀 Mustango：邁向可控的文本到音樂生成

Mustango 是一款專為可控音樂生成設計的多模態大語言模型，它利用潛在擴散模型（LDM）、Flan - T5 和音樂特徵實現音樂生成。該模型為音樂生成領域帶來了新的活力，讓用戶能夠通過文本提示生成特定風格的音樂。

演示 | 模型 | 網站和示例 | 論文 | 數據集

🔥 在 Replicate 和 HuggingFace 上有即時演示。

🚀 快速開始

根據文本提示生成音樂：

import IPython
import soundfile as sf
from mustango import Mustango

model = Mustango("declare-lab/mustango")

prompt = "This is a new age piece. There is a flute playing the main melody with a lot of staccato notes. The rhythmic background consists of a medium tempo electronic drum beat with percussive elements all over the spectrum. There is a playful atmosphere to the piece. This piece can be used in the soundtrack of a children's TV show or an advertisement jingle."

music = model.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

📦 安裝指南

git clone https://github.com/AMAAI-Lab/mustango
cd mustango
pip install -r requirements.txt
cd diffusers
pip install -e .

📚 詳細文檔

數據集

MusicBench 數據集包含 52k 個音樂片段，每個片段都配有豐富的特定音樂文本描述。

主觀評估

模型	數據集	預訓練	整體匹配度 ↑	和絃匹配度 ↑	節奏匹配度 ↑	音頻質量 ↑	音樂性 ↑	節奏存在與穩定性 ↑	和聲與協和性 ↑
Tango	MusicCaps	✓	4.35	2.75	3.88	3.35	2.83	3.95	3.84
Tango	MusicBench	✓	4.91	3.61	3.86	3.88	3.54	4.01	4.34
Mustango	MusicBench	✓	5.49	5.76	4.98	4.30	4.28	4.65	5.18
Mustango	MusicBench	✗	5.75	6.06	5.11	4.80	4.80	4.75	5.59

訓練

我們使用 Hugging Face 的 accelerate 包進行多 GPU 訓練。在終端運行 accelerate config 並根據提示設置運行配置。

你可以使用以下命令在 MusicBench 數據集上訓練 Mustango：

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config_munet.json" \
--model_type Mustango --freeze_text_encoder --uncondition_all --uncondition_single \
--drop_sentences --random_pick_text_column --snr_gamma 5 \

--model_type 標誌允許你選擇使用相同代碼訓練 Mustango 或 Tango。但請注意，你還需要將 --unet_model_config 更改為相關配置：Mustango 使用 diffusion_model_config_munet；Tango 使用 diffusion_model_config。

參數 --uncondition_all、--uncondition_single、--drop_sentences 按照論文第 5.2 節控制丟棄函數。--random_pick_text_column 參數允許在兩個輸入文本提示之間隨機選擇 - 對於 MusicBench 數據集，我們在 ChatGPT 改寫的描述和原始增強的 MusicCaps 提示之間進行選擇，如論文圖 1 所示。

在 MusicBench 上從頭開始訓練的建議時間至少為 40 個週期。

模型庫

我們發佈了以下模型：

Mustango 預訓練模型：https://huggingface.co/declare-lab/mustango-pretrained
Mustango：https://huggingface.co/declare-lab/mustango

引用

如果您覺得我們的工作有用，請考慮引用以下文章：

@misc{melechovsky2023mustango,
      title={Mustango: Toward Controllable Text-to-Music Generation}, 
      author={Jan Melechovsky and Zixun Guo and Deepanway Ghosal and Navonil Majumder and Dorien Herremans and Soujanya Poria},
      year={2023},
      eprint={2311.08355},
      archivePrefix={arXiv},
}