🚀 MusicGen - Melody - Large 3.3B
MusicGen is a text-to-music model that can generate high-quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers a more efficient way to create music with specific characteristics.
🚀 Quick Start
You can try out MusicGen in multiple ways:
- To run MusicGen locally, follow these steps:
📦 Installation
- First, install the
audiocraft
library:
pip install git+https://github.com/facebookresearch/audiocraft.git
- Ensure that
ffmpeg
is installed:
apt get install ffmpeg
💻 Usage Examples
Basic Usage
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
melody, sr = torchaudio.load('./assets/bach.mp3')
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
for idx, one_wav in enumerate(wav):
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
✨ Features
- Single-stage Generation: Unlike some existing methods such as MusicLM, MusicGen doesn't require a self-supervised semantic representation and generates all 4 codebooks in one pass.
- Parallel Prediction: By introducing a small delay between the codebooks, it can predict them in parallel, with only 50 auto-regressive steps per second of audio.
- Multiple Pre-trained Models: We offer 10 pre-trained models, including different sizes and variants for text-to-music and text+melody-to-music generation.
The pre trained models are:
facebook/musicgen-small
: 300M model, text to music only - 🤗 Hub
facebook/musicgen-medium
: 1.5B model, text to music only - 🤗 Hub
facebook/musicgen-melody
: 1.5B model, text to music and text+melody to music - 🤗 Hub
facebook/musicgen-large
: 3.3B model, text to music only - 🤗 Hub
facebook/musicgen-melody-large
: 3.3B model, text to music and text+melody to music - 🤗 Hub
facebook/musicgen-stereo-*
: All the previous models fine-tuned for stereo generation -
small,
medium,
large,
melody,
melody large
📚 Documentation
Model details
Property |
Details |
Organization developing the model |
The FAIR team of Meta AI |
Model date |
MusicGen was trained between April 2023 and May 2023 |
Model version |
This is the version 1 of the model |
Model Type |
MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation |
Paper or resources for more information |
More information can be found in the paper Simple and Controllable Music Generation |
Citation details |
@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}} |
License |
Code is released under MIT, model weights are released under CC-BY-NC 4.0 |
Where to send questions or comments about the model |
Questions and comments about MusicGen can be sent via the Github repository of the project, or by opening an issue |
Intended use
- Primary intended use: The primary use of MusicGen is research on AI-based music generation, including research efforts to understand generative model limitations and generating music guided by text or melody for machine learning amateurs.
- Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs seeking to better understand those models.
- Out-of-scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. It should not be used to create or disseminate music that creates hostile or alienating environments for people.
Metrics
- Models performance measures:
- Objective measures: Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish), Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST), CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model.
- Qualitative studies: Overall quality of the music samples, text relevance to the provided text input, adherence to the melody for melody-guided music generation.
- Decision thresholds: Not applicable.
Evaluation datasets
The model was evaluated on the MusicCaps benchmark and an in-domain held-out evaluation set with no artist overlap with the training set.
Training datasets
The model was trained on licensed data from the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection.
Evaluation results
Model |
Frechet Audio Distance |
KLD |
Text Consistency |
Chroma Cosine Similarity |
facebook/musicgen-small |
4.88 |
1.42 |
0.27 |
- |
facebook/musicgen-medium |
5.14 |
1.38 |
0.28 |
- |
facebook/musicgen-large |
5.48 |
1.37 |
0.28 |
- |
facebook/musicgen-melody |
4.93 |
1.41 |
0.27 |
0.44 |
Limitations and biases
- Data: The data sources are created by music professionals and covered by legal agreements. Scaling the model on larger datasets may improve performance.
- Mitigations: Vocals have been removed using tags and the Hybrid Transformer for Music Source Separation (HT-Demucs).
- Limitations:
- Unable to generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across different music styles and cultures.
- Sometimes generates end-of-song silence.
- Difficult to determine optimal text descriptions, may require prompt engineering.
- Biases: The data source may lack diversity, and the model may not perform equally well on all music genres. Generated samples may reflect training data biases.
- Risks and harms: Biases and limitations may lead to the generation of inappropriate or offensive samples.
⚠️ Important Note
Users must be aware of the biases, limitations and risks of the model. MusicGen is developed for artificial intelligence research on controllable music generation and should not be used for downstream applications without further investigation and risk mitigation.
MusicGen was published in Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez.