🚀 MAGNeT - Medium - 1.5B - 10secs
MAGNeT is a text-to-music and text-to-sound model that can generate high-quality audio samples based on text descriptions, offering a new way for music creation and sound generation.
🚀 Quick Start
MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions.
It is a masked generative non-autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz.
Unlike prior work, MAGNeT doesn't require neither semantic token conditioning nor model cascading, and it generates all 4 codebooks using a single non-autoregressive Transformer.
MAGNeT was published in Masked Audio Generation using a Single Non-Autoregressive Transformer by Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi.
Six checkpoints are released:
✨ Features
- Text - Based Audio Generation: Generate high - quality audio samples based on text descriptions.
- Single Non - Autoregressive Transformer: Generate all 4 codebooks without semantic token conditioning or model cascading.
- Multiple Checkpoints: Six different checkpoints are available for different needs.
📦 Installation
You can run MAGNeT locally through the original Audiocraft library:
- First install the
audiocraft
library:
pip install git+https://github.com/facebookresearch/audiocraft.git
- Make sure to have
ffmpeg
installed:
apt-get install ffmpeg
💻 Usage Examples
Basic Usage
You can run the following Python code after installation:
from audiocraft.models import MAGNeT
from audiocraft.data.audio import audio_write
model = MAGNeT.get_pretrained("facebook/magnet-medium-10secs")
descriptions = ["happy rock", "energetic EDM"]
wav = model.generate(descriptions)
for idx, one_wav in enumerate(wav):
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
📚 Documentation
Model details
Property |
Details |
Organization developing the model |
The FAIR team of Meta AI. |
Model date |
MAGNeT was trained between November 2023 and January 2024. |
Model version |
This is the version 1 of the model. |
Model type |
MAGNeT consists of an EnCodec model for audio tokenization, an non - autoregressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B; and two variants: a model trained for text - to - music generation task and a model trained for text - to - audio generation. |
Paper or resources for more information |
More information can be found in the paper Masked Audio Generation using a Single Non - Autoregressive Transformer. |
Citation details |
@misc{ziv2024masked, title={Masked Audio Generation using a Single Non - Autoregressive Transformer}, author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi}, year={2024}, eprint={2401.04577}, archivePrefix={arXiv}, primaryClass={cs.SD}} |
License |
Code is released under MIT, model weights are released under CC - BY - NC 4.0. |
Where to send questions or comments about the model |
Questions and comments about MAGNeT can be sent via the Github repository of the project, or by opening an issue. |
Intended use
- Primary intended use: The primary use of MAGNeT is research on AI - based music generation, including research efforts to probe and understand the limitations of generative models, and generation of music guided by text for machine learning amateurs.
- Primary intended users: Researchers in audio, machine learning and artificial intelligence, as well as amateurs seeking to better understand those models.
- Out - of - scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. It should not be used to create or disseminate music that creates hostile or alienating environments, such as disturbing, distressing, offensive music or content that propagates stereotypes.
Metrics
- Models performance measures:
- Objective measures: Frechet Audio Distance computed on features extracted from a pre - trained audio classifier (VGGish), Kullback - Leibler Divergence on label distributions extracted from a pre - trained audio classifier (PaSST), CLAP Score between audio embedding and text embedding extracted from a pre - trained CLAP model.
- Qualitative studies: Overall quality of the music samples and text relevance to the provided text input.
- Decision thresholds: Not applicable.
Evaluation datasets
The model was evaluated on the MusicCaps benchmark and on an in - domain held - out evaluation set, with no artist overlap with the training set.
Training datasets
The model was trained on licensed data using the following sources: the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection.
Evaluation results
Model |
Frechet Audio Distance |
KLD |
Text Consistency |
facebook/magnet-small-10secs |
4.22 |
1.11 |
0.28 |
facebook/magnet-medium-10secs |
4.61 |
1.14 |
0.28 |
facebook/magnet-small-30secs |
4.35 |
1.17 |
0.28 |
facebook/magnet-medium-30secs |
4.63 |
1.20 |
0.28 |
Audio - MAGNeT - Sound - effect generation models
Training datasets
The audio - magnet models were trained on the following data sources: a subset of AudioSet (Gemmeke et al., 2017), [BBC sound effects](https://sound - effects.bbcrewind.co.uk/), AudioCaps (Kim et al., 2019), Clotho v2 (Drossos et al., 2020), VGG - Sound (Chen et al., 2020), FSD50K (Fonseca et al., 2021), [Free To Use Sounds](https://www.freetousesounds.com/all - in - one - bundle/), Sonniss Game Effects, [WeSoundEffects](https://wesoundeffects.com/we - sound - effects - bundle - 2020/), [Paramount Motion - Odeon Cinematic Sound Effects](https://www.paramountmotion.com/odeon - sound - effects).
Evaluation datasets
The audio - magnet models (sound effect generation) were evaluated on the AudioCaps benchmark.
Evaluation results
Model |
Frechet Audio Distance |
KLD |
facebook/audio-magnet-small |
3.21 |
1.42 |
facebook/audio-magnet-medium |
2.32 |
1.64 |
🔧 Technical Details
Limitations and biases
- Data: The data sources are created by music professionals and covered by legal agreements. Scaling the model on larger datasets may improve performance.
- Mitigations: Tracks with vocals were removed using tags and the Hybrid Transformer for Music Source Separation (HT - Demucs).
- Limitations:
- Unable to generate realistic vocals.
- Performs better with English descriptions.
- Uneven performance across music styles and cultures.
- Sometimes generates end - of - song silence.
- Difficult to determine optimal text descriptions.
- Biases: The data source may lack diversity, and the model may reflect biases from the training data.
- Risks and harms: Biases and limitations may lead to generation of inappropriate samples.
- Use cases: Users should be aware of biases, limitations and risks, and avoid using the model for downstream applications without further investigation.
📄 License
Code is released under MIT, model weights are released under CC - BY - NC 4.0.