MusicGen-melody Open-source Music Generation Model - Generate High-quality Music Based on Text and Melody Descriptions

Musicgen Melody

Developed by facebook

MusicGen is a simple and controllable music generation model capable of producing high-quality music based on text descriptions or melody inputs.

Audio Generation

Transformers

#Melody-guided generation #Single-stage autoregressive #32kHz audio generation

Downloads 3,632

Release Time : 6/8/2023

Model Overview

MusicGen is a single-stage autoregressive Transformer model trained on a 32kHz EnCodec tokenizer, using 4 codebooks sampled at 50Hz. Unlike existing methods, it doesn't require self-supervised semantic representations and can generate all codebooks in one pass.

Model Features

Parallel prediction

Achieves parallel prediction by introducing minimal delays between codebooks, requiring only 50 autoregressive steps per second of audio.

Melody-guided generation

Can generate music based on given audio melodies and text descriptions while preserving the original melodic characteristics.

Simple and controllable

Doesn't require self-supervised semantic representations, featuring a straightforward model architecture that's easy to control.

Model Capabilities

Text-to-music generation

Melody-guided music generation

Multiple music style generation

Use Cases

Music creation

Background music generation

Generates customized background music for videos, games, and other content.

Can generate music segments from 8 seconds to longer durations

Melody extension

Generates complete musical works based on existing melody fragments.

Expands musical content while preserving original melodic characteristics

Research

AI music generation research

Used to explore applications of generative models in the music field.

🚀 MusicGen - Melody - 1.5B

Audiocraft offers the code and models for MusicGen, a straightforward and controllable model for music generation. MusicGen simplifies the music - generation process and provides users with more control over the output.

MusicGen is a single - stage auto - regressive Transformer model. It's trained using a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike other methods such as MusicLM, MusicGen doesn't need a self - supervised semantic representation and can generate all 4 codebooks in one pass. By introducing a small delay between the codebooks, it can predict them in parallel, resulting in only 50 auto - regressive steps per second of audio.

MusicGen was introduced in the paper Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez.

Four checkpoints have been released:

🚀 Quick Start

Try it Online

Run Locally

First, install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Ensure that ffmpeg is installed:

apt get install ffmpeg

Execute the following Python code:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

✨ Features

Single - stage Generation: MusicGen generates all 4 codebooks in one pass, eliminating the need for a self - supervised semantic representation.
Parallel Prediction: By introducing a small delay between codebooks, it can predict them in parallel, reducing the number of auto - regressive steps.
Multiple Checkpoints: Four checkpoints (small, medium, large, and melody) are available for different use - cases.

📚 Documentation

Model Details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	Trained between April 2023 and May 2023
Model version	Version 1
Model type	Consists of an EnCodec model for audio tokenization and an auto - regressive language model based on the transformer architecture. Comes in different sizes (300M, 1.5B, 3.3B parameters) and two variants (text - to - music and melody - guided).
Paper or resources for more information	Simple and Controllable Music Generation
Citation details	`@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC - BY - NC 4.0
Where to send questions or comments about the model	Via the Github repository of the project or by opening an issue

Intended Use

Primary intended use: Research on AI - based music generation, including exploring limitations of generative models and generating music guided by text or melody for learning purposes.
Primary intended users: Researchers in audio, machine learning, and artificial intelligence, as well as amateurs interested in understanding generative AI models.
Out - of - scope use cases: Should not be used in downstream applications without risk evaluation. Avoid generating music that creates hostile environments or propagates stereotypes.

Metrics

Models performance measures:
- Frechet Audio Distance computed on features from a pre - trained audio classifier (VGGish).
- Kullback - Leibler Divergence on label distributions from a pre - trained audio classifier (PaSST).
- CLAP Score between audio and text embeddings from a pre - trained CLAP model.
- Qualitative studies with human participants on overall quality, text relevance, and melody adherence.
Decision thresholds: Not applicable.

Evaluation Datasets

The model was evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.

Training Datasets

Trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection, and Pond5 music collection.

Evaluation Results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen - small	4.88	1.42	0.27	-
facebook/musicgen - medium	5.14	1.38	0.28	-
facebook/musicgen - large	5.48	1.37	0.28	-
facebook/musicgen - melody	4.93	1.41	0.27	0.44

Limitations and Biases

Limitations:
- Can't generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across music styles and cultures.
- May generate silent endings.
- Prompt engineering may be needed for satisfying results.
Biases: The training data may lack diversity, and the model may reflect these biases in its output.
Risks and harms: Biased or inappropriate samples may be generated.
Use cases: Users should be aware of biases and limitations and avoid using the model in downstream applications without further investigation.

📄 License

The code is released under the MIT license, and the model weights are released under the CC - BY - NC 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご