MusicGen-small Open-source Text-to-Music Model - Generate High-quality Music for Free Based on Descriptions or Prompts

Musicgen Small

Developed by facebook

MusicGen is a text-to-music model that generates high-quality music samples based on text descriptions or audio prompts.

Audio Generation

Transformers

#Text-to-Music Generation #Autoregressive Transformer #Multi-Style Adaptation

Downloads 123.91k

Release Time : 6/8/2023

Model Overview

A single-stage autoregressive Transformer model trained with a 32kHz EnCodec tokenizer, equipped with 4 codebooks sampled at 50Hz, capable of generating music without self-supervised semantic representations.

Model Features

Single-Stage Generation

Generates all 4 codebooks at once without requiring self-supervised semantic representations

Parallel Prediction

Achieves parallel prediction through minimal delays between codebooks, requiring only 50 autoregressive steps per second of audio

Multi-Codebook Processing

Utilizes a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50Hz

Model Capabilities

Generate music from text descriptions

Support for multiple music styles

Control over generated music duration

Use Cases

Music Creation

Background Music Generation

Generate custom background music for podcasts, videos, and other content

Music Inspiration Exploration

Explore possibilities in music creation through different prompts

Research Applications

Generative Model Research

Investigate the limitations and possibilities of music generation models

🚀 MusicGen - Small - 300M

MusicGen is a text-to-music model that can generate high-quality music samples based on text descriptions or audio prompts. It simplifies the music generation process and offers a more efficient way to create music from text.

🚀 Quick Start

You can quickly start using MusicGen through the following methods:

Audiocraft Colab:
Hugging Face Colab:
Hugging Face Demo:

✨ Features

Single-stage Generation: Unlike some existing methods, MusicGen doesn't require a self - supervised semantic representation and generates all 4 codebooks in one pass.
Parallel Prediction: By introducing a small delay between the codebooks, it can predict them in parallel, with only 50 auto - regressive steps per second of audio.

📦 Installation

Using 🤗 Transformers Library

Install the 🤗 Transformers library and scipy:

pip install --upgrade pip
pip install --upgrade transformers scipy

Using Audiocraft Library

Install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Install ffmpeg:

apt-get install ffmpeg

💻 Usage Examples

🤗 Transformers Library

Basic Usage

from transformers import pipeline
import scipy

synthesiser = pipeline("text-to-audio", "facebook/musicgen-small")

music = synthesiser("lo-fi music with a soothing melody", forward_params={"do_sample": True})

scipy.io.wavfile.write("musicgen_out.wav", rate=music["sampling_rate"], data=music["audio"])

Advanced Usage

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
)

audio_values = model.generate(**inputs, max_new_tokens=256)

Listening to Audio Samples

from IPython.display import Audio

sampling_rate = model.config.audio_encoder.sampling_rate
Audio(audio_values[0].numpy(), rate=sampling_rate)

Saving as .wav File

import scipy

sampling_rate = model.config.audio_encoder.sampling_rate
scipy.io.wavfile.write("musicgen_out.wav", rate=sampling_rate, data=audio_values[0, 0].numpy())

Audiocraft Library

from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained("small")
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ["happy rock", "energetic EDM"]

wav = model.generate(descriptions)  # generates 2 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

📚 Documentation

Model Details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	Trained between April 2023 and May 2023
Model version	Version 1
Model Type	Consists of an EnCodec model for audio tokenization, an auto - regressive language model based on the transformer architecture for music modeling. Comes in different sizes (300M, 1.5B and 3.3B parameters) and two variants (text - to - music generation and melody - guided music generation)
Paper or resources for more information	Simple and Controllable Music Generation
Citation details

@misc{copet2023simple,
      title={Simple and Controllable Music Generation}, 
      author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
      year={2023},
      eprint={2306.05284},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

| License | Code is released under MIT, model weights are released under CC - BY - NC 4.0 | | Where to send questions or comments about the model | Via the Github repository of the project, or by opening an issue |

Intended Use

Primary intended use: Research on AI - based music generation, including research efforts and music generation guided by text or melody for amateurs.
Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs.
Out - of - scope use cases: Should not be used on downstream applications without further risk evaluation and mitigation. Should not be used to create or disseminate harmful music.

Metrics

Models performance measures: Frechet Audio Distance, Kullback - Leibler Divergence, CLAP Score, and qualitative studies on overall quality, text relevance, and melody adherence.
Decision thresholds: Not applicable.

Evaluation Datasets

Evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set.

Training Datasets

Trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection and Pond5 music collection.

Evaluation Results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen-small	4.88	1.42	0.27	-
facebook/musicgen-medium	5.14	1.38	0.28	-
facebook/musicgen-large	5.48	1.37	0.28	-
facebook/musicgen-melody	4.93	1.41	0.27	0.44

Limitations and Biases

Data: The data source may lack diversity, and not all music cultures are equally represented.
Mitigations: Vocals are removed using tags and a music source separation method.
Limitations: Can't generate realistic vocals, performs better with English descriptions, has uneven performance across music styles, may generate silent endings, and prompt engineering may be needed.
Biases: The model reflects biases from the training data.
Risks and harms: May generate biased, inappropriate or offensive samples.
Use cases: Users should be aware of risks and not use it for downstream applications without further investigation.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご