đ Mustango: Toward Controllable Text-to-Music Generation
Mustango is an exciting addition to the Multimodal Large Language Models for controlled music generation. It uses Latent Diffusion Model (LDM), Flan - T5, and musical features to achieve remarkable results.
đ Quick Start
Generate music from a text prompt:
import IPython
import soundfile as sf
from mustango import Mustango
model = Mustango("declare-lab/mustango")
prompt = "This is a new age piece. There is a flute playing the main melody with a lot of staccato notes. The rhythmic background consists of a medium tempo electronic drum beat with percussive elements all over the spectrum. There is a playful atmosphere to the piece. This piece can be used in the soundtrack of a children's TV show or an advertisement jingle."
music = model.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)
đĻ Installation
git clone https://github.com/AMAAI-Lab/mustango
cd mustango
pip install -r requirements.txt
cd diffusers
pip install -e .
đ Documentation
Datasets
The MusicBench dataset contains 52k music fragments with a rich music - specific text caption.
Subjective Evaluation by Expert Listeners
Property |
Details |
Model |
Dataset |
Tango |
MusicCaps |
Tango |
MusicBench |
Mustango |
MusicBench |
Mustango |
MusicBench |
Training
We use the accelerate
package from Hugging Face for multi - gpu training. Run accelerate config
from terminal and set up your run configuration by answering the questions asked.
You can now train Mustango on the MusicBench dataset using:
accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config_munet.json" \
--model_type Mustango --freeze_text_encoder --uncondition_all --uncondition_single \
--drop_sentences --random_pick_text_column --snr_gamma 5 \
The --model_type
flag allows to choose either Mustango, or Tango to be trained with the same code. However, do note that you also need to change --unet_model_config
to the relevant config: diffusion_model_config_munet for Mustango; diffusion_model_config for Tango.
The arguments --uncondition_all
, --uncondition_single
, --drop_sentences
control the dropout functions as per Section 5.2 in our paper. The argument of --random_pick_text_column
allows to randomly pick between two input text prompts - in the case of MusicBench, we pick between ChatGPT rephrased captions and original enhanced MusicCaps prompts, as depicted in Figure 1 in our paper.
Recommended training time from scratch on MusicBench is at least 40 epochs.
Model Zoo
We have released the following models:
- Mustango Pretrained: https://huggingface.co/declare-lab/mustango - pretrained
- Mustango: https://huggingface.co/declare-lab/mustango
đ License
This project is licensed under the Apache 2.0 license.
đ Citation
Please consider citing the following article if you found our work useful:
@misc{melechovsky2023mustango,
title={Mustango: Toward Controllable Text-to-Music Generation},
author={Jan Melechovsky and Zixun Guo and Deepanway Ghosal and Navonil Majumder and Dorien Herremans and Soujanya Poria},
year={2023},
eprint={2311.08355},
archivePrefix={arXiv},
}
Useful Links
đĨ Live demo available on Replicate and HuggingFace.