🚀 MusicGen - Small - 300M
MusicGen is a text-to-music model that can generate high-quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers a more efficient way to create music from text.
🚀 Quick Start
You can quickly start using MusicGen through the following methods:
- Audiocraft Colab:
- Hugging Face Colab:
- Hugging Face Demo:
✨ Features
- Single-stage Generation: Unlike some existing methods, MusicGen doesn't require a self - supervised semantic representation and generates all 4 codebooks in one pass.
- Parallel Prediction: By introducing a small delay between the codebooks, it can predict them in parallel, with only 50 auto - regressive steps per second of audio.
📦 Installation
Using 🤗 Transformers Library
- Install the 🤗 Transformers library and scipy:
pip install --upgrade pip
pip install --upgrade transformers scipy
Using Audiocraft Library
- Install the
audiocraft
library:
pip install git+https://github.com/facebookresearch/audiocraft.git
- Install
ffmpeg
:
apt-get install ffmpeg
💻 Usage Examples
🤗 Transformers Library
Basic Usage
from transformers import pipeline
import scipy
synthesiser = pipeline("text-to-audio", "facebook/musicgen-small")
music = synthesiser("lo-fi music with a soothing melody", forward_params={"do_sample": True})
scipy.io.wavfile.write("musicgen_out.wav", rate=music["sampling_rate"], data=music["audio"])
Advanced Usage
from transformers import AutoProcessor, MusicgenForConditionalGeneration
processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")
inputs = processor(
text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
padding=True,
return_tensors="pt",
)
audio_values = model.generate(**inputs, max_new_tokens=256)
Listening to Audio Samples
from IPython.display import Audio
sampling_rate = model.config.audio_encoder.sampling_rate
Audio(audio_values[0].numpy(), rate=sampling_rate)
Saving as .wav File
import scipy
sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy())
Audiocraft Library
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
model = MusicGen.get_pretrained("small")
model.set_generation_params(duration=8)
descriptions = ["happy rock", "energetic EDM"]
wav = model.generate(descriptions)
for idx, one_wav in enumerate(wav):
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
📚 Documentation
Model Details
Property |
Details |
Organization developing the model |
The FAIR team of Meta AI |
Model date |
Trained between April 2023 and May 2023 |
Model version |
Version 1 |
Model Type |
Consists of an EnCodec model for audio tokenization, an auto - regressive language model based on the transformer architecture for music modeling. Comes in different sizes (300M, 1.5B and 3.3B parameters) and two variants (text - to - music generation and melody - guided music generation) |
Paper or resources for more information |
Simple and Controllable Music Generation |
Citation details |
|
@misc{copet2023simple,
title={Simple and Controllable Music Generation},
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
year={2023},
eprint={2306.05284},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
| License | Code is released under MIT, model weights are released under CC - BY - NC 4.0 |
| Where to send questions or comments about the model | Via the Github repository of the project, or by opening an issue |
Intended Use
- Primary intended use: Research on AI - based music generation, including research efforts and music generation guided by text or melody for amateurs.
- Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs.
- Out - of - scope use cases: Should not be used on downstream applications without further risk evaluation and mitigation. Should not be used to create or disseminate harmful music.
Metrics
- Models performance measures: Frechet Audio Distance, Kullback - Leibler Divergence, CLAP Score, and qualitative studies on overall quality, text relevance, and melody adherence.
- Decision thresholds: Not applicable.
Evaluation Datasets
Evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set.
Training Datasets
Trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection and Pond5 music collection.
Evaluation Results
Model |
Frechet Audio Distance |
KLD |
Text Consistency |
Chroma Cosine Similarity |
facebook/musicgen-small |
4.88 |
1.42 |
0.27 |
- |
facebook/musicgen-medium |
5.14 |
1.38 |
0.28 |
- |
facebook/musicgen-large |
5.48 |
1.37 |
0.28 |
- |
facebook/musicgen-melody |
4.93 |
1.41 |
0.27 |
0.44 |
Limitations and Biases
- Data: The data source may lack diversity, and not all music cultures are equally represented.
- Mitigations: Vocals are removed using tags and a music source separation method.
- Limitations: Can't generate realistic vocals, performs better with English descriptions, has uneven performance across music styles, may generate silent endings, and prompt engineering may be needed.
- Biases: The model reflects biases from the training data.
- Risks and harms: May generate biased, inappropriate or offensive samples.
- Use cases: Users should be aware of risks and not use it for downstream applications without further investigation.