🚀 MusicGen - Stereo - Melody - Large 3.3B
MusicGen is a text-to-music model that can generate high - quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers multiple pre - trained models for different use cases. This release includes stereophonic capable models fine - tuned from mono models, sharing similar capabilities and limitations with the base models.
🚀 Quick Start
You can try out MusicGen in multiple ways:
-
Online demos:
-
-
Local run:
- First, install the
audiocraft
library:
pip install git+https://github.com/facebookresearch/audiocraft.git
- Make sure to have
ffmpeg
installed:
apt get install ffmpeg
- Run the following Python code:
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
melody, sr = torchaudio.load('./assets/bach.mp3')
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
for idx, one_wav in enumerate(wav):
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
✨ Features
- Stereophonic models: A set of stereophonic capable models are released, fine - tuned from mono models for 200k updates.
- Single - stage generation: Unlike some existing methods, MusicGen doesn't require a self - supervised semantic representation and generates all 4 codebooks in one pass.
- Parallel prediction: By introducing a small delay between the codebooks, it can predict them in parallel, with only 50 auto - regressive steps per second of audio.
- Multiple pre - trained models: There are 10 pre - trained models available, including different sizes (300M, 1.5B, 3.3B parameters) and variants (text - to - music and melody - guided music generation).
💻 Usage Examples
Basic Usage
import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)
descriptions = ['happy rock', 'energetic EDM', 'sad jazz']
melody, sr = torchaudio.load('./assets/bach.mp3')
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)
for idx, one_wav in enumerate(wav):
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
📚 Documentation
Model details
Property |
Details |
Organization developing the model |
The FAIR team of Meta AI. |
Model date |
MusicGen was trained between April 2023 and May 2023. |
Model version |
This is the version 1 of the model. |
Model type |
MusicGen consists of an EnCodec model for audio tokenization, an auto - regressive language model based on the transformer architecture for music modeling. It comes in different sizes (300M, 1.5B and 3.3B parameters) and two variants (text - to - music and melody - guided music generation). |
Paper or resources for more information |
More information can be found in the paper Simple and Controllable Music Generation. |
Citation details |
@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}} |
License |
Code is released under MIT, model weights are released under CC - BY - NC 4.0. |
Where to send questions or comments about the model |
Questions and comments about MusicGen can be sent via the Github repository of the project, or by opening an issue. |
Intended use
- Primary intended use: Research on AI - based music generation, including probing model limitations and generating music guided by text or melody.
- Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs interested in understanding these models.
- Out - of - scope use cases: The model should not be used in downstream applications without further risk evaluation and mitigation. It should not be used to create or disseminate offensive music.
Metrics
- Models performance measures:
- Frechet Audio Distance computed on features from a pre - trained audio classifier (VGGish).
- Kullback - Leibler Divergence on label distributions from a pre - trained audio classifier (PaSST).
- CLAP Score between audio embedding and text embedding from a pre - trained CLAP model.
- Qualitative studies with human participants on music quality, text relevance, and melody adherence.
- Decision thresholds: Not applicable.
Evaluation datasets
The model was evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.
Training datasets
The model was trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection and Pond5 music collection.
Evaluation results
Model |
Frechet Audio Distance |
KLD |
Text Consistency |
Chroma Cosine Similarity |
facebook/musicgen - small |
4.88 |
1.42 |
0.27 |
- |
facebook/musicgen - medium |
5.14 |
1.38 |
0.28 |
- |
facebook/musicgen - large |
5.48 |
1.37 |
0.28 |
- |
facebook/musicgen - melody |
4.93 |
1.41 |
0.27 |
0.44 |
Limitations and biases
- Data: The model was trained on 20K hours of data from music professionals. Scaling on larger datasets may improve performance.
- Mitigations: Vocals were removed using tags and Hybrid Transformer for Music Source Separation (HT - Demucs).
- Limitations:
- Unable to generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across different music styles and cultures.
- Sometimes generates silent endings.
- Difficult to determine optimal text descriptions.
- Biases: The data source may lack diversity, and generated samples may reflect training data biases.
- Risks and harms: Biased or inappropriate samples may be generated.
- Use cases: Users should be aware of biases, limitations and risks. It should not be used for downstream applications without further investigation.
📄 License
Code is released under MIT, model weights are released under CC - BY - NC 4.0.