MusicGen-medium Open Source Text-to-Music Model - Generate High-Quality Music Samples with Text or Audio Prompts

Musicgen Medium

Developed by facebook

MusicGen is a text-to-music model that generates high-quality music samples based on text descriptions or audio prompts, utilizing a 1.5-billion-parameter autoregressive Transformer architecture.

Audio Generation

Transformers

#Text-to-Music #Autoregressive Generation #Parallel Multi-Codebook

Downloads 1.5M

Release Time : 6/8/2023

Model Overview

A single-stage autoregressive Transformer model that directly generates 32kHz sampled music audio from text descriptions, supporting parallel prediction and controllable music generation.

Model Features

Parallel Codebook Prediction

Achieves parallel prediction through minimal delays between codebooks, requiring only 50 autoregressive steps per second of audio.

No Semantic Representation Needed

Unlike solutions such as MusicLM, it directly generates audio codebooks without intermediate semantic representations.

Multiple Parameter Versions

Offers 300M/1.5B/3.3B parameter versions and melody-guided variants.

Model Capabilities

Generate music from text descriptions

Support style mixing (e.g., '80s hip-hop + funky house')

Produce high-quality 32kHz audio

Enable melody-guided generation (requires melody variant model)

Use Cases

Music Creation

Background Music Generation

Generate customized opening music for podcasts/videos

Examples demonstrate the ability to produce audio with catchy rhythms

Style Experimentation

Blend musical elements from different eras and styles

Successfully generates hybrid styles like '80s hip-hop + funky house'

Content Production

Lo-Fi Work Music

Generate soothing tracks with chill electronic elements

Can produce background music suitable for focused work

🚀 MusicGen - Medium - 1.5B

MusicGen is a text-to-music model that can generate high-quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers more efficient and controllable music creation.

🚀 Quick Start

Try Online

You can try out MusicGen yourself through the following links:

Audiocraft Colab:
Hugging Face Colab:
Hugging Face Demo:

Run Locally

You can run MusicGen locally using either the 🤗 Transformers library or the Audiocraft library.

🤗 Transformers Usage

Installation First, install the 🤗 Transformers library and scipy:

pip install --upgrade pip
pip install --upgrade transformers scipy

Inference via the Text-to-Audio (TTA) pipeline

from transformers import pipeline
import scipy

synthesiser = pipeline("text-to-audio", "facebook/musicgen-medium")

music = synthesiser("lo-fi music with a soothing melody", forward_params={"do_sample": True})

scipy.io.wavfile.write("musicgen_out.wav", rate=music["sampling_rate"], data=music["audio"])

Inference via the Transformers modelling code

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs, max_new_tokens=256)

Listen to the audio samples In an ipynb notebook:

from IPython.display import Audio

sampling_rate = model.config.audio_encoder.sampling_rate
Audio(audio_values[0].numpy(), rate=sampling_rate)

Or save them as a .wav file using scipy:

import scipy

sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy())

Audiocraft Usage

Installation First, install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Install ffmpeg

apt-get install ffmpeg

Run the Python code

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained("medium")
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ["happy rock", "energetic EDM"]

wav = model.generate(descriptions)  # generates 2 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

✨ Features

Single-stage Generation: Unlike some existing methods, MusicGen doesn't require a self - supervised semantic representation and generates all 4 codebooks in one pass.
Parallel Prediction: By introducing a small delay between the codebooks, it can predict them in parallel, reducing the number of auto - regressive steps per second of audio.

📚 Documentation

Model details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	Trained between April 2023 and May 2023
Model version	Version 1 of the model
Model type	Consists of an EnCodec model for audio tokenization and an auto - regressive language model based on the transformer architecture for music modeling. Comes in different sizes (300M, 1.5B and 3.3B parameters) and two variants (text - to - music generation and melody - guided music generation)
Paper or resources for more information	Simple and Controllable Music Generation
Citation details	`@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC - BY - NC 4.0
Where to send questions or comments about the model	Via the Github repository of the project or by opening an issue

Intended use

Primary intended use: Research on AI - based music generation, including exploring model limitations and generating music guided by text or melody.
Primary intended users: Researchers in audio, machine learning, and artificial intelligence, as well as amateurs interested in understanding generative models.
Out - of - scope use cases: Should not be used in downstream applications without further risk evaluation and mitigation. Avoid using it to create or disseminate offensive or biased music.

Metrics

Models performance measures:
- Objective measures: Frechet Audio Distance, Kullback - Leibler Divergence, CLAP Score.
- Qualitative studies: Evaluated on overall music quality, text relevance, and adherence to melody.
Decision thresholds: Not applicable.

Evaluation datasets

Evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.

Training datasets

Trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection, and Pond5 music collection.

Evaluation results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen - small	4.88	1.42	0.27	-
facebook/musicgen - medium	5.14	1.38	0.28	-
facebook/musicgen - large	5.48	1.37	0.28	-
facebook/musicgen - melody	4.93	1.41	0.27	0.44

Limitations and biases

Data: The training data may lack diversity, and not all music cultures are equally represented.
Limitations: Can't generate realistic vocals, performs better with English descriptions, may not work well for all music styles, sometimes generates silent endings, and prompt engineering may be needed.
Biases: Generated samples may reflect biases in the training data.
Risks and harms: Biased or inappropriate samples may be generated.
Use cases: Users should be aware of the risks and not use the model in downstream applications without further investigation.

⚠️ Important Note

Biases and limitations of the model may lead to generation of samples that may be considered as biased, inappropriate or offensive. Users must be aware of these issues and should not use MusicGen for downstream applications without further investigation and mitigation of risks.

💡 Usage Tip

Prompt engineering may be required to obtain satisfying results when using MusicGen. Try different text descriptions to get the best music generation.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご