Mustango Open-Source Multimodal Large Model - Freely Achieve High-Quality Text-Controllable Music Generation

Mustango

Developed by declare-lab

Mustango is a novel multimodal large language model specifically designed for controllable music generation, combining Latent Diffusion Model (LDM), Flan-T5, and music features to achieve high-quality text-to-music generation.

Text-to-Audio

Transformers

Open Source License:Apache-2.0 #Controllable Music Generation #Multimodal Music Composition #Music Feature Fusion

Downloads 165

Release Time : 11/15/2023

Model Overview

Mustango is an innovative text-to-music generation model that delivers high-quality and controllable music composition by integrating multiple technologies.

Model Features

Multimodal Fusion

Combines Latent Diffusion Model and Flan-T5 language model to achieve high-quality text-to-music conversion.

Controllable Generation

Supports precise control over music style, rhythm, melody, and other features through text prompts.

Professional Music Features

Incorporates professional music features to generate musically coherent works.

Model Capabilities

Text-to-Music Generation

Music Style Control

Melody Generation

Rhythm Control

Use Cases

Music Composition

TV Show Scoring

Generate background music for children's TV shows that matches the scene's atmosphere.

Produces playful and engaging musical pieces.

Advertisement Music

Quickly generate short jingles that align with advertising themes.

Creates short music clips that fit the ad's mood.

Content Creation

Video Scoring

Automatically generate background music that matches video content.

Produces music that harmonizes with the video.

🚀 Mustango: Toward Controllable Text-to-Music Generation

Mustango is an exciting addition to the Multimodal Large Language Models for controlled music generation. It uses Latent Diffusion Model (LDM), Flan - T5, and musical features to achieve remarkable results.

🚀 Quick Start

Generate music from a text prompt:

import IPython
import soundfile as sf
from mustango import Mustango

model = Mustango("declare-lab/mustango")

prompt = "This is a new age piece. There is a flute playing the main melody with a lot of staccato notes. The rhythmic background consists of a medium tempo electronic drum beat with percussive elements all over the spectrum. There is a playful atmosphere to the piece. This piece can be used in the soundtrack of a children's TV show or an advertisement jingle."

music = model.generate(prompt)
sf.write(f"{prompt}.wav", audio, samplerate=16000)
IPython.display.Audio(data=audio, rate=16000)

📦 Installation

git clone https://github.com/AMAAI-Lab/mustango
cd mustango
pip install -r requirements.txt
cd diffusers
pip install -e .

📚 Documentation

Datasets

The MusicBench dataset contains 52k music fragments with a rich music - specific text caption.

Subjective Evaluation by Expert Listeners

Property	Details
Model	Dataset
Tango	MusicCaps
Tango	MusicBench
Mustango	MusicBench
Mustango	MusicBench

Training

We use the accelerate package from Hugging Face for multi - gpu training. Run accelerate config from terminal and set up your run configuration by answering the questions asked.

You can now train Mustango on the MusicBench dataset using:

accelerate launch train.py \
--text_encoder_name="google/flan-t5-large" \
--scheduler_name="stabilityai/stable-diffusion-2-1" \
--unet_model_config="configs/diffusion_model_config_munet.json" \
--model_type Mustango --freeze_text_encoder --uncondition_all --uncondition_single \
--drop_sentences --random_pick_text_column --snr_gamma 5 \

The --model_type flag allows to choose either Mustango, or Tango to be trained with the same code. However, do note that you also need to change --unet_model_config to the relevant config: diffusion_model_config_munet for Mustango; diffusion_model_config for Tango.

The arguments --uncondition_all, --uncondition_single, --drop_sentences control the dropout functions as per Section 5.2 in our paper. The argument of --random_pick_text_column allows to randomly pick between two input text prompts - in the case of MusicBench, we pick between ChatGPT rephrased captions and original enhanced MusicCaps prompts, as depicted in Figure 1 in our paper.

Recommended training time from scratch on MusicBench is at least 40 epochs.

Model Zoo

We have released the following models:

Mustango Pretrained: https://huggingface.co/declare-lab/mustango - pretrained
Mustango: https://huggingface.co/declare-lab/mustango

📄 License

This project is licensed under the Apache 2.0 license.

📚 Citation

Please consider citing the following article if you found our work useful:

@misc{melechovsky2023mustango,
      title={Mustango: Toward Controllable Text-to-Music Generation}, 
      author={Jan Melechovsky and Zixun Guo and Deepanway Ghosal and Navonil Majumder and Dorien Herremans and Soujanya Poria},
      year={2023},
      eprint={2311.08355},
      archivePrefix={arXiv},
}

Useful Links

🔥 Live demo available on Replicate and HuggingFace.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご