MusicGen-melody-large Open-source Music Generation Model - Create High-quality Music with Text or Audio Prompts

Musicgen Melody Large

Developed by facebook

MusicGen is a text-to-music generation model developed by Meta AI, capable of producing high-quality music samples based on text descriptions or audio prompts.

Audio Generation

Transformers

#Text-to-Music Generation #Melody-Guided Generation #High-Fidelity Audio

Downloads 1,414

Release Time : 10/23/2023

Model Overview

MusicGen employs a single-stage autoregressive Transformer architecture, trained on a 32kHz EnCodec tokenizer, supporting music generation via text or text+melody inputs.

Model Features

Melody-Guided Generation

Supports music generation combining text descriptions with melody prompts for enhanced creative control

Efficient Parallel Prediction

Achieves only 50 autoregressive steps per second of audio through codebook delay techniques

Multi-Codebook Processing

Simultaneously processes 4 codebooks sampled at 50Hz to generate complete audio in one pass

Model Capabilities

Text-to-Music Generation

Melody-Guided Music Generation

High-Quality Audio Synthesis

Use Cases

Music Creation

Background Music Generation

Automatically generates matching background music for video content

Can produce 8-30 second high-quality music clips

Melody Adaptation

Generates musical variants in different styles based on existing melodies

Preserves original melodic characteristics while altering musical style

AI Research

Generative Model Research

Explores the technical boundaries of audio generation models

Provides a comparable benchmark model

🚀 MusicGen - Melody - Large 3.3B

MusicGen is a text-to-music model that can generate high-quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers a more efficient way to create music with specific characteristics.

🚀 Quick Start

You can try out MusicGen in multiple ways:

To run MusicGen locally, follow these steps:

📦 Installation

First, install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Ensure that ffmpeg is installed:

apt get install ffmpeg

💻 Usage Examples

Basic Usage

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('melody')
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

✨ Features

Single-stage Generation: Unlike some existing methods such as MusicLM, MusicGen doesn't require a self-supervised semantic representation and generates all 4 codebooks in one pass.
Parallel Prediction: By introducing a small delay between the codebooks, it can predict them in parallel, with only 50 auto-regressive steps per second of audio.
Multiple Pre-trained Models: We offer 10 pre-trained models, including different sizes and variants for text-to-music and text+melody-to-music generation.

The pre trained models are:

facebook/musicgen-small: 300M model, text to music only - 🤗 Hub
facebook/musicgen-medium: 1.5B model, text to music only - 🤗 Hub
facebook/musicgen-melody: 1.5B model, text to music and text+melody to music - 🤗 Hub
facebook/musicgen-large: 3.3B model, text to music only - 🤗 Hub
facebook/musicgen-melody-large: 3.3B model, text to music and text+melody to music - 🤗 Hub
facebook/musicgen-stereo-*: All the previous models fine-tuned for stereo generation - small, medium, large, melody, melody large

📚 Documentation

Model details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	MusicGen was trained between April 2023 and May 2023
Model version	This is the version 1 of the model
Model Type	MusicGen consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B and 3.3B parameters ; and two variants: a model trained for text-to-music generation task and a model trained for melody-guided music generation
Paper or resources for more information	More information can be found in the paper Simple and Controllable Music Generation
Citation details	`@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC-BY-NC 4.0
Where to send questions or comments about the model	Questions and comments about MusicGen can be sent via the Github repository of the project, or by opening an issue

Intended use

Primary intended use: The primary use of MusicGen is research on AI-based music generation, including research efforts to understand generative model limitations and generating music guided by text or melody for machine learning amateurs.
Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs seeking to better understand those models.
Out-of-scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. It should not be used to create or disseminate music that creates hostile or alienating environments for people.

Metrics

Models performance measures:
- Objective measures: Frechet Audio Distance computed on features extracted from a pre-trained audio classifier (VGGish), Kullback-Leibler Divergence on label distributions extracted from a pre-trained audio classifier (PaSST), CLAP Score between audio embedding and text embedding extracted from a pre-trained CLAP model.
- Qualitative studies: Overall quality of the music samples, text relevance to the provided text input, adherence to the melody for melody-guided music generation.
Decision thresholds: Not applicable.

Evaluation datasets

The model was evaluated on the MusicCaps benchmark and an in-domain held-out evaluation set with no artist overlap with the training set.

Training datasets

The model was trained on licensed data from the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection.

Evaluation results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen-small	4.88	1.42	0.27	-
facebook/musicgen-medium	5.14	1.38	0.28	-
facebook/musicgen-large	5.48	1.37	0.28	-
facebook/musicgen-melody	4.93	1.41	0.27	0.44

Limitations and biases

Data: The data sources are created by music professionals and covered by legal agreements. Scaling the model on larger datasets may improve performance.
Mitigations: Vocals have been removed using tags and the Hybrid Transformer for Music Source Separation (HT-Demucs).
Limitations:
- Unable to generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across different music styles and cultures.
- Sometimes generates end-of-song silence.
- Difficult to determine optimal text descriptions, may require prompt engineering.
Biases: The data source may lack diversity, and the model may not perform equally well on all music genres. Generated samples may reflect training data biases.
Risks and harms: Biases and limitations may lead to the generation of inappropriate or offensive samples.

⚠️ Important Note

Users must be aware of the biases, limitations and risks of the model. MusicGen is developed for artificial intelligence research on controllable music generation and should not be used for downstream applications without further investigation and risk mitigation.

MusicGen was published in Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご