MusicGen - Stereo - Medium: An open-source stereo music generation model that allows you to create high-quality music for free based on text descriptions.

Home

Musicgen Stereo Medium

Developed by facebook

Stereo music generation model released by Meta AI, capable of generating high-quality music from text descriptions

Audio Generation

Transformers

#Stereo music generation #Text-to-audio #Multi-style adaptation

Downloads 303

Release Time : 10/23/2023

Model Overview

A text-to-music model that generates stereo music samples based on text descriptions or audio prompts, using an autoregressive Transformer architecture

Model Features

Stereo generation

Achieves stereo sound effects through dual-channel token streams and delayed interleaving processing

Single-stage generation

Generates all 4 codebooks in one pass without requiring self-supervised semantic representations

Parallel prediction

Enables parallel prediction of 50 autoregressive steps per second through minimal delay design between codebooks

Melody guidance

Supports music generation based on existing melodies (requires using specific variants)

Model Capabilities

Music generation from text descriptions

Audio prompt-guided generation

Stereo audio generation

Multi-style music generation

Use Cases

Music creation

Background music generation

Generates customized background music for videos/podcasts

Can produce music clips matching scene moods

Music inspiration exploration

Discovers new musical ideas through different style combinations

Generates experimental music blending multiple style elements

Audio production

Sound effect design

Generates sound effects or transition music for specific scenes

Can produce short audio clips meeting requirements

🚀 MusicGen - Stereo - Medium - 1.5B

MusicGen is a text - to - music model that can generate high - quality music samples based on text descriptions or audio prompts. This release focuses on stereophonic capable models, which offer a more immersive audio experience.

🚀 Quick Start

MusicGen is a powerful text - to - music model. The stereophonic models are fine - tuned from mono models, sharing capabilities and limitations with the base models. They work by getting 2 streams of tokens from the EnCodec model and interleaving them using a delay pattern.

✨ Features

Stereophonic Sound: Reproduces sound with depth and direction, creating a more immersive audio experience.
Single - stage Auto - regressive: Generates all 4 codebooks in one pass without the need for a self - supervised semantic representation.
Multiple Pre - trained Models: Offers 10 pre - trained models, including small, medium, large, and melody - guided variants.

📦 Installation

You can run MusicGen Stereo models locally with the 🤗 Transformers library from main onward.

First install the 🤗 Transformers library and scipy:

pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git scipy

💻 Usage Examples

Basic Usage

Run inference via the Text - to - Audio (TTA) pipeline:

import torch
import soundfile as sf
from transformers import pipeline

synthesiser = pipeline("text - to - audio", "facebook/musicgen - stereo - medium", device="cuda:0", torch_dtype=torch.float16)

music = synthesiser("lo - fi music with a soothing melody", forward_params={"max_new_tokens": 256})

sf.write("musicgen_out.wav", music["audio"][0].T, music["sampling_rate"])

Advanced Usage

Run inference via the Transformers modelling code for more fine - grained control:

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen - stereo - medium")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen - stereo - medium").to("cuda")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
).to("cuda")

audio_values = model.generate(**inputs, max_new_tokens=256)

📚 Documentation

Model Details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	Trained between April 2023 and May 2023
Model version	Version 1
Model Type	Consists of an EnCodec model for audio tokenization and an auto - regressive language model based on the transformer architecture. Comes in different sizes (300M, 1.5B, 3.3B parameters) and two variants (text - to - music and melody - guided)
Paper or resources for more information	Simple and Controllable Music Generation
Citation details

@misc{copet2023simple,
      title={Simple and Controllable Music Generation}, 
      author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
      year={2023},
      eprint={2306.05284},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

| License | Code is released under MIT, model weights are released under CC - BY - NC 4.0 | | Where to send questions or comments about the model | Via the Github repository of the project or by opening an issue |

Intended Use

Primary intended use:

Research on AI - based music generation, such as probing and understanding the limitations of generative models.
Generation of music guided by text or melody for machine learning amateurs to understand generative AI models.

Primary intended users: Researchers in audio, machine learning, and artificial intelligence, as well as amateurs interested in these models.

Out - of - scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. It should not be used to create or disseminate music that creates hostile or offensive environments.

Metrics

Models performance measures:

Frechet Audio Distance computed on features from a pre - trained audio classifier (VGGish).
Kullback - Leibler Divergence on label distributions from a pre - trained audio classifier (PaSST).
CLAP Score between audio and text embeddings from a pre - trained CLAP model.

Additionally, qualitative studies with human participants evaluate the model on overall quality, text relevance, and adherence to melody.

Evaluation Datasets

The model was evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.

Training Datasets

The model was trained on licensed data from the Meta Music Initiative Sound Collection, Shutterstock music collection, and Pond5 music collection.

Evaluation Results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen - small	4.88	1.42	0.27	-
facebook/musicgen - medium	5.14	1.38	0.28	-
facebook/musicgen - large	5.48	1.37	0.28	-
facebook/musicgen - melody	4.93	1.41	0.27	0.44

Limitations and Biases

Data: The model is trained on 20K hours of data from professional music sources. Scaling on larger datasets may improve performance. Mitigations: Vocals are removed using tags and the Hybrid Transformer for Music Source Separation (HT - Demucs). Limitations:

Unable to generate realistic vocals.
Performs better with English descriptions.
Uneven performance across different music styles and cultures.
Sometimes generates song endings that turn to silence.
Prompt engineering may be needed for satisfying results. Biases: The data source may lack diversity.

📄 License

Code is released under MIT, model weights are released under CC - BY - NC 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご