MusicGen-stereo-melody-large Open-source Music Generation Model - Produce High-quality Music Based on Text or Audio Prompts

Musicgen Stereo Melody Large

Developed by facebook

MusicGen is a text-to-music generation model that supports stereo and melody guidance, capable of producing high-quality music samples based on text descriptions or audio prompts.

Audio Generation

Transformers

#Stereo music generation #Melody-guided composition #High-fidelity audio

Downloads 61

Release Time : 10/23/2023

Model Overview

MusicGen is an autoregressive music generation model based on the Transformer architecture, supporting 32kHz stereo audio generation through text descriptions or melody guidance. The model employs the EnCodec audio tokenizer and can generate all codebooks at once for efficient music synthesis.

Model Features

Stereo Support

Stereo generation capability achieved through 200,000 iterations of fine-tuning, using delay mode to process dual token streams

Melody Guidance

Supports generating style-matching music based on input melodies while preserving original melodic characteristics

Efficient Generation

Utilizes parallel prediction mechanism, requiring only 50 autoregressive steps per second of audio, significantly improving generation speed

Multi-codebook Joint Prediction

Generates all 4 codebooks simultaneously without requiring staged processing

Model Capabilities

Text-to-music generation

Melody-guided music generation

Stereo audio synthesis

Multiple music style generation

Use Cases

Creative content generation

Background music creation

Generate customized background music for videos, games, and other content

Can quickly produce scene-matching music based on text descriptions

Melody expansion

Generate complete arrangements based on existing melody fragments

Enriches musical expression while preserving original melodic features

Music research

Music generation algorithm research

Used to explore cutting-edge AI music generation technologies

🚀 MusicGen - Stereo - Melody - Large 3.3B

MusicGen is a text-to-music model that can generate high - quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers multiple pre - trained models for different use cases. This release includes stereophonic capable models fine - tuned from mono models, sharing similar capabilities and limitations with the base models.

🚀 Quick Start

You can try out MusicGen in multiple ways:

Online demos:
-

Local run:

First, install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Make sure to have ffmpeg installed:

apt get install ffmpeg

Run the following Python code:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

✨ Features

Stereophonic models: A set of stereophonic capable models are released, fine - tuned from mono models for 200k updates.
Single - stage generation: Unlike some existing methods, MusicGen doesn't require a self - supervised semantic representation and generates all 4 codebooks in one pass.
Parallel prediction: By introducing a small delay between the codebooks, it can predict them in parallel, with only 50 auto - regressive steps per second of audio.
Multiple pre - trained models: There are 10 pre - trained models available, including different sizes (300M, 1.5B, 3.3B parameters) and variants (text - to - music and melody - guided music generation).

💻 Usage Examples

Basic Usage

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

📚 Documentation

Model details

Property	Details
Organization developing the model	The FAIR team of Meta AI.
Model date	MusicGen was trained between April 2023 and May 2023.
Model version	This is the version 1 of the model.
Model type	MusicGen consists of an EnCodec model for audio tokenization, an auto - regressive language model based on the transformer architecture for music modeling. It comes in different sizes (300M, 1.5B and 3.3B parameters) and two variants (text - to - music and melody - guided music generation).
Paper or resources for more information	More information can be found in the paper Simple and Controllable Music Generation.
Citation details	`@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC - BY - NC 4.0.
Where to send questions or comments about the model	Questions and comments about MusicGen can be sent via the Github repository of the project, or by opening an issue.

Intended use

Primary intended use: Research on AI - based music generation, including probing model limitations and generating music guided by text or melody.
Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs interested in understanding these models.
Out - of - scope use cases: The model should not be used in downstream applications without further risk evaluation and mitigation. It should not be used to create or disseminate offensive music.

Metrics

Models performance measures:
- Frechet Audio Distance computed on features from a pre - trained audio classifier (VGGish).
- Kullback - Leibler Divergence on label distributions from a pre - trained audio classifier (PaSST).
- CLAP Score between audio embedding and text embedding from a pre - trained CLAP model.
- Qualitative studies with human participants on music quality, text relevance, and melody adherence.
Decision thresholds: Not applicable.

Evaluation datasets

The model was evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.

Training datasets

The model was trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection and Pond5 music collection.

Evaluation results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen - small	4.88	1.42	0.27	-
facebook/musicgen - medium	5.14	1.38	0.28	-
facebook/musicgen - large	5.48	1.37	0.28	-
facebook/musicgen - melody	4.93	1.41	0.27	0.44

Limitations and biases

Data: The model was trained on 20K hours of data from music professionals. Scaling on larger datasets may improve performance.
Mitigations: Vocals were removed using tags and Hybrid Transformer for Music Source Separation (HT - Demucs).
Limitations:
- Unable to generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across different music styles and cultures.
- Sometimes generates silent endings.
- Difficult to determine optimal text descriptions.
Biases: The data source may lack diversity, and generated samples may reflect training data biases.
Risks and harms: Biased or inappropriate samples may be generated.
Use cases: Users should be aware of biases, limitations and risks. It should not be used for downstream applications without further investigation.

📄 License

Code is released under MIT, model weights are released under CC - BY - NC 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご