magnet-small-10secs Open-Source Audio Generation Model - Create High-Quality Audio for Free Based on Text Descriptions

Magnet Small 10secs

Developed by facebook

MAGNeT is a text-to-music and text-to-audio model capable of generating high-quality audio samples from text descriptions.

Audio Generation #Text-to-music generation #Non-autoregressive Transformer #32kHz high-fidelity

Downloads 976

Release Time : 1/10/2024

Model Overview

MAGNeT is a masked generative non-autoregressive Transformer based on a 32kHz EnCodec tokenizer, trained using 4 codebooks sampled at 50 Hz. It does not require semantic token conditioning or model cascading, using a single non-autoregressive Transformer to generate all 4 codebooks.

Model Features

Non-autoregressive generation

Uses a single non-autoregressive Transformer to generate all codebooks without cascading models

High-quality audio generation

Capable of generating high-quality audio samples at 32kHz sampling rate from text descriptions

Diverse style support

Supports generating various music styles including hip-hop, funk house, lo-fi, etc.

Model Capabilities

Text-to-music generation

Text-to-sound effects generation

Short audio clip generation (10 seconds)

Use Cases

Music creation

Background music generation

Generate background music for videos, podcasts, etc.

Produces 10-second music clips

Music inspiration exploration

Explore creative possibilities of different music styles through text prompts

Generates diverse music samples

Sound design

Game sound effects generation

Generate environmental sound effects for game scenes

Produces 10-second sound effect clips

🚀 MAGNeT - Small - 300M - 10secs

MAGNeT is a text-to-music and text-to-sound model that can generate high-quality audio samples based on text descriptions. It addresses the need for efficient and high - quality audio generation, offering a new approach to music and sound creation.

🚀 Quick Start

🤗 Transformers Usage

Coming soon...

Audiocraft Usage

You can run MAGNeT locally through the original Audiocraft library:

First, install the audiocraft library:

pip install git+https://github.com/facebookresearch/audiocraft.git

Ensure that ffmpeg is installed:

apt-get install ffmpeg

Run the following Python code:

from audiocraft.models import MAGNeT
from audiocraft.data.audio import audio_write

model = MAGNeT.get_pretrained("facebook/magnet-small-10secs")

descriptions = ["happy rock", "energetic EDM"]

wav = model.generate(descriptions)  # generates 2 samples.

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

✨ Features

MAGNeT is a masked generative non - autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike prior work, it doesn't require semantic token conditioning or model cascading and generates all 4 codebooks using a single non - autoregressive Transformer.

Six checkpoints are released:

📚 Documentation

Model details

Property	Details
Organization developing the model	The FAIR team of Meta AI
Model date	MAGNeT was trained between November 2023 and January 2024
Model version	This is the version 1 of the model
Model type	MAGNeT consists of an EnCodec model for audio tokenization, a non - autoregressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B; and two variants: a model trained for text - to - music generation task and a model trained for text - to - audio generation
Paper or resources for more information	More information can be found in the paper Masked Audio Generation using a Single Non - Autoregressive Transformer
Citation details	`@misc{ziv2024masked, title={Masked Audio Generation using a Single Non - Autoregressive Transformer}, author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi}, year={2024}, eprint={2401.04577}, archivePrefix={arXiv}, primaryClass={cs.SD}}`
License	Code is released under MIT, model weights are released under CC - BY - NC 4.0
Where to send questions or comments about the model	Questions and comments about MAGNeT can be sent via the Github repository of the project, or by opening an issue

Intended use

Primary intended use: The primary use of MAGNeT is research on AI - based music generation, including research efforts to probe and understand the limitations of generative models and generation of music guided by text for machine learning amateurs.

Primary intended users: The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateurs seeking to better understand those models.

Out - of - scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. It should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people.

Metrics

Models performance measures:

Frechet Audio Distance computed on features extracted from a pre - trained audio classifier (VGGish)
Kullback - Leibler Divergence on label distributions extracted from a pre - trained audio classifier (PaSST)
CLAP Score between audio embedding and text embedding extracted from a pre - trained CLAP model

Additionally, qualitative studies with human participants were run, evaluating the model on the overall quality of music samples and text relevance to the provided text input.

Decision thresholds: Not applicable.

Evaluation datasets

The model was evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.

Training datasets

The model was trained on licensed data from the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection.

Evaluation results

Model	Frechet Audio Distance	KLD	Text Consistency
facebook/magnet-small-10secs	4.22	1.11	0.28
facebook/magnet-medium-10secs	4.61	1.14	0.28
facebook/magnet-small-30secs	4.35	1.17	0.28
facebook/magnet-medium-30secs	4.63	1.20	0.28

Audio - MAGNeT - Sound - effect generation models

Training datasets

The audio - magnet models were trained on a subset of AudioSet (Gemmeke et al., 2017), [BBC sound effects](https://sound - effects.bbcrewind.co.uk/), AudioCaps (Kim et al., 2019), Clotho v2 (Drossos et al., 2020), VGG - Sound (Chen et al., 2020), FSD50K (Fonseca et al., 2021), [Free To Use Sounds](https://www.freetousesounds.com/all - in - one - bundle/), Sonniss Game Effects, [WeSoundEffects](https://wesoundeffects.com/we - sound - effects - bundle - 2020/), [Paramount Motion - Odeon Cinematic Sound Effects](https://www.paramountmotion.com/odeon - sound - effects).

Evaluation datasets

The audio - magnet models (sound effect generation) were evaluated on the AudioCaps benchmark.

Evaluation results

Model	Frechet Audio Distance	KLD
facebook/audio-magnet-small	3.21	1.42
facebook/audio-magnet-medium	2.32	1.64

🔧 Technical Details

Limitations and biases

Data: The data sources are created by music professionals and covered by legal agreements. Scaling the model on larger datasets may improve performance.

Mitigations: Tracks with vocals were removed using tags and the Hybrid Transformer for Music Source Separation (HT - Demucs).

Limitations:

The model can't generate realistic vocals.
It performs better with English descriptions.
It doesn't perform equally well for all music styles and cultures.
It sometimes generates song endings with silence.
Prompt engineering may be needed for satisfying results.

Biases: The data source may lack diversity, and the model may not perform equally well on all music genres. Generated samples may reflect training data biases.

Risks and harms: Biases and limitations may lead to generation of inappropriate samples. The model should not be used for downstream applications without risk investigation and mitigation.

Use cases: MAGNeT is for AI research on music generation and should not be used for downstream applications without further risk assessment.

📄 License

Code is released under MIT, model weights are released under CC - BY - NC 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご