Magnet - Medium - 10secs Open - Source Model - Freely Generate High - Quality Music and Sounds Based on Text Descriptions

Magnet Medium 10secs

Developed by facebook

MAGNeT is a text-to-music and text-to-sound model that can generate high-quality audio samples based on text descriptions.

Audio Generation #Non-autoregressive music generation #Text-guided audio synthesis #Multi-codebook mask modeling

Downloads 322

Release Time : 1/10/2024

Model Overview

MAGNeT is a masked generative non-autoregressive Transformer based on the 32kHz EnCodec tokenizer, using 4 codebooks sampled at 50Hz. It does not require semantic token conditions or model cascading, and uses a single non-autoregressive Transformer to generate all 4 codebooks.

Model Features

Non-autoregressive generation

Use a single non-autoregressive Transformer to generate all codebooks without model cascading.

High-quality audio generation

Able to generate high-quality audio samples based on text descriptions.

Multi-codebook processing

Use 4 codebooks sampled at 50Hz for audio generation.

Model Capabilities

Text-to-music generation

Text-to-sound generation

Use Cases

Music creation

Generate music in a specific style

Generate music in a specific style based on text descriptions, such as funk house music in the 80s hip-hop style.

Generate a 10-second high-quality music sample.

Generate a relaxing song

Generate a relaxing song influenced by lo-fi, chill electronica, and slow tempo based on text descriptions.

Generate a 10-second high-quality music sample.

Podcast production

Generate podcast opening music

Generate an attractive rhythm for the podcast opening based on text descriptions.

Generate a 10-second high-quality music sample.

🚀 MAGNeT - Medium - 1.5B - 10secs

MAGNeT is a text-to-music and text-to-sound model that can generate high-quality audio samples based on text descriptions, offering a new way for music creation and sound generation.

🚀 Quick Start

MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions. It is a masked generative non-autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike prior work, MAGNeT doesn't require neither semantic token conditioning nor model cascading, and it generates all 4 codebooks using a single non-autoregressive Transformer.

MAGNeT was published in Masked Audio Generation using a Single Non-Autoregressive Transformer by Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.

Six checkpoints are released:

✨ Features

Text - Based Audio Generation: Generate high - quality audio samples based on text descriptions.
Single Non - Autoregressive Transformer: Generate all 4 codebooks without semantic token conditioning or model cascading.
Multiple Checkpoints: Six different checkpoints are available for different needs.

📦 Installation

You can run MAGNeT locally through the original Audiocraft library:

First install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Make sure to have ffmpeg installed:

apt-get install ffmpeg

💻 Usage Examples

Basic Usage

You can run the following Python code after installation:

from audiocraft.models import MAGNeT
from audiocraft.data.audio import audio_write

model = MAGNeT.get_pretrained("facebook/magnet-medium-10secs")

descriptions = ["happy rock", "energetic EDM"]

wav = model.generate(descriptions)  # generates 2 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

📚 Documentation

Model details

Property	Details
Organization developing the model	The FAIR team of Meta AI.
Model date	MAGNeT was trained between November 2023 and January 2024.
Model version	This is the version 1 of the model.
Model type	MAGNeT consists of an EnCodec model for audio tokenization, an non - autoregressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B; and two variants: a model trained for text - to - music generation task and a model trained for text - to - audio generation.
Paper or resources for more information	More information can be found in the paper Masked Audio Generation using a Single Non - Autoregressive Transformer.
Citation details	`@misc{ziv2024masked, title={Masked Audio Generation using a Single Non - Autoregressive Transformer}, author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi}, year={2024}, eprint={2401.04577}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC - BY - NC 4.0.
Where to send questions or comments about the model	Questions and comments about MAGNeT can be sent via the Github repository of the project, or by opening an issue.

Intended use

Primary intended use: The primary use of MAGNeT is research on AI - based music generation, including research efforts to probe and understand the limitations of generative models, and generation of music guided by text for machine learning amateurs.
Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs seeking to better understand those models.
Out - of - scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. It should not be used to create or disseminate music that creates hostile or alienating environments, such as disturbing, distressing, offensive music or content that propagates stereotypes.

Metrics

Models performance measures:
- Objective measures: Frechet Audio Distance computed on features extracted from a pre - trained audio classifier (VGGish), Kullback - Leibler Divergence on label distributions extracted from a pre - trained audio classifier (PaSST), CLAP Score between audio embedding and text embedding extracted from a pre - trained CLAP model.
- Qualitative studies: Overall quality of the music samples and text relevance to the provided text input.
Decision thresholds: Not applicable.

Evaluation datasets

The model was evaluated on the MusicCaps benchmark and on an in - domain held - out evaluation set, with no artist overlap with the training set.

Training datasets

The model was trained on licensed data using the following sources: the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection.

Evaluation results

Model	Frechet Audio Distance	KLD	Text Consistency
facebook/magnet-small-10secs	4.22	1.11	0.28
facebook/magnet-medium-10secs	4.61	1.14	0.28
facebook/magnet-small-30secs	4.35	1.17	0.28
facebook/magnet-medium-30secs	4.63	1.20	0.28

Audio - MAGNeT - Sound - effect generation models

Training datasets

The audio - magnet models were trained on the following data sources: a subset of AudioSet (Gemmeke et al., 2017), [BBC sound effects](https://sound - effects.bbcrewind.co.uk/), AudioCaps (Kim et al., 2019), Clotho v2 (Drossos et al., 2020), VGG - Sound (Chen et al., 2020), FSD50K (Fonseca et al., 2021), [Free To Use Sounds](https://www.freetousesounds.com/all - in - one - bundle/), Sonniss Game Effects, [WeSoundEffects](https://wesoundeffects.com/we - sound - effects - bundle - 2020/), [Paramount Motion - Odeon Cinematic Sound Effects](https://www.paramountmotion.com/odeon - sound - effects).

Evaluation datasets

The audio - magnet models (sound effect generation) were evaluated on the AudioCaps benchmark.

Evaluation results

Model	Frechet Audio Distance	KLD
facebook/audio-magnet-small	3.21	1.42
facebook/audio-magnet-medium	2.32	1.64

🔧 Technical Details

Limitations and biases

Data: The data sources are created by music professionals and covered by legal agreements. Scaling the model on larger datasets may improve performance.
Mitigations: Tracks with vocals were removed using tags and the Hybrid Transformer for Music Source Separation (HT - Demucs).
Limitations:
- Unable to generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across music styles and cultures.
- Sometimes generates end - of - song silence.
- Difficult to determine optimal text descriptions.
Biases: The data source may lack diversity, and the model may reflect biases from the training data.
Risks and harms: Biases and limitations may lead to generation of inappropriate samples.
Use cases: Users should be aware of biases, limitations and risks, and avoid using the model for downstream applications without further investigation.

📄 License

Code is released under MIT, model weights are released under CC - BY - NC 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご