🚀 MusicGen - Melody - 1.5B
Audiocraft offers the code and models for MusicGen, a straightforward and controllable model for music generation. MusicGen simplifies the music - generation process and provides users with more control over the output.
MusicGen is a single - stage auto - regressive Transformer model. It's trained using a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike other methods such as MusicLM, MusicGen doesn't need a self - supervised semantic representation and can generate all 4 codebooks in one pass. By introducing a small delay between the codebooks, it can predict them in parallel, resulting in only 50 auto - regressive steps per second of audio.
MusicGen was introduced in the paper Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez.
Four checkpoints have been released:
🚀 Quick Start
Try it Online
-
Run Locally
- First, install the
audiocraft
library:
pip install git+https://github.com/facebookresearch/audiocraft.git
- Ensure that
ffmpeg
is installed:
apt get install ffmpeg
- Execute the following Python code:
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
melody, sr = torchaudio.load('./assets/bach.mp3')
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
for idx, one_wav in enumerate(wav):
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
✨ Features
- Single - stage Generation: MusicGen generates all 4 codebooks in one pass, eliminating the need for a self - supervised semantic representation.
- Parallel Prediction: By introducing a small delay between codebooks, it can predict them in parallel, reducing the number of auto - regressive steps.
- Multiple Checkpoints: Four checkpoints (small, medium, large, and melody) are available for different use - cases.
📚 Documentation
Model Details
Property |
Details |
Organization developing the model |
The FAIR team of Meta AI |
Model date |
Trained between April 2023 and May 2023 |
Model version |
Version 1 |
Model type |
Consists of an EnCodec model for audio tokenization and an auto - regressive language model based on the transformer architecture. Comes in different sizes (300M, 1.5B, 3.3B parameters) and two variants (text - to - music and melody - guided). |
Paper or resources for more information |
Simple and Controllable Music Generation |
Citation details |
@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}} |
License |
Code is released under MIT, model weights are released under CC - BY - NC 4.0 |
Where to send questions or comments about the model |
Via the Github repository of the project or by opening an issue |
Intended Use
- Primary intended use: Research on AI - based music generation, including exploring limitations of generative models and generating music guided by text or melody for learning purposes.
- Primary intended users: Researchers in audio, machine learning, and artificial intelligence, as well as amateurs interested in understanding generative AI models.
- Out - of - scope use cases: Should not be used in downstream applications without risk evaluation. Avoid generating music that creates hostile environments or propagates stereotypes.
Metrics
- Models performance measures:
- Frechet Audio Distance computed on features from a pre - trained audio classifier (VGGish).
- Kullback - Leibler Divergence on label distributions from a pre - trained audio classifier (PaSST).
- CLAP Score between audio and text embeddings from a pre - trained CLAP model.
- Qualitative studies with human participants on overall quality, text relevance, and melody adherence.
- Decision thresholds: Not applicable.
Evaluation Datasets
The model was evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.
Training Datasets
Trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection, and Pond5 music collection.
Evaluation Results
Model |
Frechet Audio Distance |
KLD |
Text Consistency |
Chroma Cosine Similarity |
facebook/musicgen - small |
4.88 |
1.42 |
0.27 |
- |
facebook/musicgen - medium |
5.14 |
1.38 |
0.28 |
- |
facebook/musicgen - large |
5.48 |
1.37 |
0.28 |
- |
facebook/musicgen - melody |
4.93 |
1.41 |
0.27 |
0.44 |
Limitations and Biases
- Limitations:
- Can't generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across music styles and cultures.
- May generate silent endings.
- Prompt engineering may be needed for satisfying results.
- Biases: The training data may lack diversity, and the model may reflect these biases in its output.
- Risks and harms: Biased or inappropriate samples may be generated.
- Use cases: Users should be aware of biases and limitations and avoid using the model in downstream applications without further investigation.
📄 License
The code is released under the MIT license, and the model weights are released under the CC - BY - NC 4.0 license.