MusicGen-stereo-melody Open-source Text-to-Music Model - Create High-quality Stereo Music Based on Descriptions or Audio Prompts

Musicgen Stereo Melody

Developed by facebook

MusicGen is a text-to-music generation model developed by Meta AI, capable of producing high-quality stereo music samples based on text descriptions or audio prompts.

Audio Generation

Transformers

#Stereo music generation #Melody-guided composition #Parallel multi-codebook prediction

Downloads 82

Release Time : 10/23/2023

Model Overview

A Transformer-based autoregressive music generation model that supports generating 32kHz stereo music through text descriptions or melody prompts, capable of generating all audio codebooks at once without self-supervised semantic representations.

Model Features

Stereo generation

Achieves stereo output through interleaved processing of dual token streams, offering better spatial perception compared to mono versions.

Melody control

Supports input reference melodies, with generated music maintaining the original melodic contour.

Efficient generation

Employs delayed codebook prediction technology, requiring only 50 autoregressive steps per second of audio.

Parallel multi-codebook

Simultaneously predicts 4 EnCodec codebooks without requiring staged generation.

Model Capabilities

Text-to-music generation

Melody-guided music generation

Stereo audio synthesis

Music style transfer

Use Cases

Creative assistance

Background music generation

Automatically generates matching background music based on scene descriptions

Can produce 8-30 second music clips in various styles

Melody expansion

Develops complete arrangements based on user-provided simple melodies

Adds harmony and rhythm while preserving original melodic characteristics

Research applications

Generative model research

Explores architectures and control methods for audio generation models

Provides comparable baseline models

🚀 MusicGen - Stereo - Melody - 1.5B

MusicGen is a text - to - music model that can generate high - quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers multiple pre - trained models for different needs.

🚀 Quick Start

You can try out MusicGen through the following ways:

You can also run MusicGen locally:

First, install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Ensure that ffmpeg is installed:

apt get install ffmpeg

Run the following Python code:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

✨ Features

Stereo Capability: We've released a set of stereophonic capable models, fine - tuned from mono models. They work by getting 2 streams of tokens from the EnCodec model and interleaving them using the delay pattern.
Single - stage Generation: Different from existing methods like MusicLM, MusicGen doesn't require a self - supervised semantic representation and can generate all 4 codebooks in one pass.
Parallel Prediction: By introducing a small delay between the codebooks, it can predict them in parallel, with only 50 auto - regressive steps per second of audio.

📚 Documentation

Model details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	Trained between April 2023 and May 2023
Model version	Version 1 of the model
Model type	Consists of an EnCodec model for audio tokenization and an auto - regressive language model based on the transformer architecture for music modeling. Comes in 300M, 1.5B and 3.3B parameter sizes, with text - to - music and melody - guided music generation variants
Paper or resources for more information	Simple and Controllable Music Generation
Citation details	`@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC - BY - NC 4.0
Where to send questions or comments about the model	Via the Github repository of the project or by opening an issue

Intended use

Primary intended use: Research on AI - based music generation, including probing model limitations and generating music guided by text or melody.
Primary intended users: Researchers in audio, machine learning, and artificial intelligence, as well as machine learning amateurs.
Out - of - scope use cases: Should not be used in downstream applications without risk evaluation and mitigation. Should not be used to create or disseminate music that creates a hostile or alienating environment.

Metrics

Models performance measures:
- Objective measures: Frechet Audio Distance on VGGish features, Kullback - Leibler Divergence on PaSST label distributions, and CLAP Score between audio and text embeddings.
- Qualitative studies: Evaluated on overall music quality, text relevance, and melody adherence.
Decision thresholds: Not applicable.

Evaluation datasets

Evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.

Training datasets

Trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection, and Pond5 music collection.

Evaluation results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen - small	4.88	1.42	0.27	-
facebook/musicgen - medium	5.14	1.38	0.28	-
facebook/musicgen - large	5.48	1.37	0.28	-
facebook/musicgen - melody	4.93	1.41	0.27	0.44

Limitations and biases

Data: Trained on 20K hours of data from music professionals. Scaling on larger datasets may improve performance.
Mitigations: Vocals removed using tags and Hybrid Transformer for Music Source Separation (HT - Demucs).
Limitations: Can't generate realistic vocals, performs better with English descriptions, varies in performance across music styles and cultures, may generate silent endings, and prompt engineering may be needed.
Biases: Training data may lack diversity, and generated samples may reflect training data biases.
Risks and harms: May generate biased, inappropriate, or offensive samples.
Use cases: Users should be aware of biases, limitations, and risks. Not for downstream applications without further investigation.

📄 License

The code is released under the MIT license, and the model weights are released under the CC - BY - NC 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご