musicgen-stereo-small Open-source AI Model - Generate High-quality Stereo Music Samples with Text Descriptions

Home

Musicgen Stereo Small

Developed by facebook

AI model that generates high-quality stereo music samples based on text descriptions, supporting 300M parameter scale

Audio Generation

Transformers

#Stereo music generation #Text-to-music #Low-parameter lightweight

Downloads 7,091

Release Time : 10/23/2023

Model Overview

MusicGen is a text-to-music model that generates music through text prompts or audio references, using stereo technology to enhance spatial perception

Model Features

Stereo generation

Creates auditory experiences with directional and layered perception through dual-channel audio systems

Efficient parallel prediction

Uses delayed interleaved pattern processing for codebooks, requiring only 50 autoregressive steps per second of audio

Multi-scale options

Offers three parameter scales (300M/1.5B/3.3B) and two variants (text/melody)

Model Capabilities

Generate music based on text descriptions

Support style mixing (e.g., hip-hop + funk)

Generate stereo audio at 32kHz sampling rate

Supports generation of up to 256 new tokens

Use Cases

Music creation

Background music generation

Quickly generate customized soundtracks for videos/podcasts

Produces stereo music that matches the scene atmosphere

Music inspiration

Explore new music genres through style-mixing prompts

Generates experimental music segments blending multiple styles

🚀 MusicGen - Stereo - Small - 300M

MusicGen is a text - to - music model that can generate high - quality music samples based on text descriptions or audio prompts. This stereo - capable model is fine - tuned from mono models, offering a new dimension to music generation.

🚀 Quick Start

MusicGen is a powerful text - to - music model. You can quickly start using it through the 🤗 Transformers library or the Audiocraft library.

✨ Features

Stereo Capability: Fine - tuned from mono models, it can generate stereophonic music, adding depth and direction to the sound.
Single - stage Generation: Unlike some existing methods, it doesn't require a self - supervised semantic representation and can generate all 4 codebooks in one pass.
Multiple Pre - trained Models: Offers 10 pre - trained models with different sizes and capabilities, suitable for various scenarios.

📦 Installation

You can run MusicGen Stereo models locally with the 🤗 Transformers library from main onward.

First install the 🤗 Transformers library and scipy:

pip install --upgrade pip
pip install --upgrade git+https://github.com/huggingface/transformers.git scipy

If using the Audiocraft library, first install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Make sure to have ffmpeg installed:

apt get install ffmpeg

💻 Usage Examples

Basic Usage

Run inference via the Text - to - Audio (TTA) pipeline:

import torch
import soundfile as sf
from transformers import pipeline

synthesiser = pipeline("text - to - audio", "facebook/musicgen - stereo - small", device="cuda:0", torch_dtype=torch.float16)

music = synthesiser("lo - fi music with a soothing melody", forward_params={"max_new_tokens": 256})

sf.write("musicgen_out.wav", music["audio"][0].T, music["sampling_rate"])

Advanced Usage

Run inference via the Transformers modelling code for more fine - grained control:

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen - stereo - small")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen - stereo - small").to("cuda")

inputs = processor(
    text=["80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums"],
    padding=True,
    return_tensors="pt",
).to("cuda")

audio_values = model.generate(**inputs, max_new_tokens=256)

📚 Documentation

Model Details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	Trained between April 2023 and May 2023
Model version	Version 1
Model Type	Consists of an EnCodec model for audio tokenization, an auto - regressive language model based on the transformer architecture. Comes in different sizes (300M, 1.5B, 3.3B parameters) and two variants (text - to - music and melody - guided music generation)
Paper or resources for more information	Simple and Controllable Music Generation
Citation details	`@misc{copet2023simple, title={Simple and Controllable Music Generation}, author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez}, year={2023}, eprint={2306.05284}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC - BY - NC 4.0
Where to send questions or comments about the model	Via the Github repository of the project, or by opening an issue

Intended Use

Primary intended use: Research on AI - based music generation, including probing limitations of generative models and generating music guided by text or melody.
Primary intended users: Researchers in audio, machine learning, and artificial intelligence, as well as amateurs interested in understanding these models.
Out - of - scope use cases: Should not be used in downstream applications without risk evaluation and mitigation. Should not be used to create or disseminate music that creates hostile or alienating environments.

Metrics

Models performance measures:
- Frechet Audio Distance computed on features from a pre - trained audio classifier (VGGish).
- Kullback - Leibler Divergence on label distributions from a pre - trained audio classifier (PaSST).
- CLAP Score between audio embedding and text embedding from a pre - trained CLAP model.
- Qualitative studies with human participants on overall quality, text relevance, and melody adherence.
Decision thresholds: Not applicable.

Evaluation Datasets

Evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.

Training Datasets

Trained on licensed data from Meta Music Initiative Sound Collection, Shutterstock music collection, and Pond5 music collection.

Evaluation Results

Model	Frechet Audio Distance	KLD	Text Consistency	Chroma Cosine Similarity
facebook/musicgen - small	4.88	1.42	0.27	-
facebook/musicgen - medium	5.14	1.38	0.28	-
facebook/musicgen - large	5.48	1.37	0.28	-
facebook/musicgen - melody	4.93	1.41	0.27	0.44

Limitations and Biases

Data: The data sources may lack diversity. The model was trained on 20K hours of data, and scaling on larger datasets may improve performance.
Mitigations: Vocals were removed from the data source using tags and a state - of - the - art music source separation method (Hybrid Transformer for Music Source Separation (HT - Demucs)).
Limitations: Can't generate realistic vocals, performs better with English descriptions, may not work equally well for all music styles and cultures, sometimes generates silent endings, and prompt engineering may be required for satisfying results.

🔧 Technical Details

MusicGen is a single - stage auto - regressive Transformer model trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. The stereo models work by getting 2 streams of tokens from the EnCodec model and interleaving them using the delay pattern.

📄 License

Code is released under MIT, model weights are released under CC - BY - NC 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご