🚀 MAGNeT - Small - 300M - 10secs
MAGNeT is a text-to-music and text-to-sound model that can generate high-quality audio samples based on text descriptions. It addresses the need for efficient and high - quality audio generation, offering a new approach to music and sound creation.
🚀 Quick Start
🤗 Transformers Usage
Coming soon...
Audiocraft Usage
You can run MAGNeT locally through the original Audiocraft library:
- First, install the
audiocraft
library:
pip install git+https://github.com/facebookresearch/audiocraft.git
- Ensure that
ffmpeg
is installed:
apt-get install ffmpeg
- Run the following Python code:
from audiocraft.models import MAGNeT
from audiocraft.data.audio import audio_write
model = MAGNeT.get_pretrained("facebook/magnet-small-10secs")
descriptions = ["happy rock", "energetic EDM"]
wav = model.generate(descriptions)
for idx, one_wav in enumerate(wav):
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")
✨ Features
MAGNeT is a masked generative non - autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. Unlike prior work, it doesn't require semantic token conditioning or model cascading and generates all 4 codebooks using a single non - autoregressive Transformer.
Six checkpoints are released:
📚 Documentation
Model details
Property |
Details |
Organization developing the model |
The FAIR team of Meta AI |
Model date |
MAGNeT was trained between November 2023 and January 2024 |
Model version |
This is the version 1 of the model |
Model type |
MAGNeT consists of an EnCodec model for audio tokenization, a non - autoregressive language model based on the transformer architecture for music modeling. The model comes in different sizes: 300M, 1.5B; and two variants: a model trained for text - to - music generation task and a model trained for text - to - audio generation |
Paper or resources for more information |
More information can be found in the paper Masked Audio Generation using a Single Non - Autoregressive Transformer |
Citation details |
@misc{ziv2024masked, title={Masked Audio Generation using a Single Non - Autoregressive Transformer}, author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi}, year={2024}, eprint={2401.04577}, archivePrefix={arXiv}, primaryClass={cs.SD}} |
License |
Code is released under MIT, model weights are released under CC - BY - NC 4.0 |
Where to send questions or comments about the model |
Questions and comments about MAGNeT can be sent via the Github repository of the project, or by opening an issue |
Intended use
Primary intended use: The primary use of MAGNeT is research on AI - based music generation, including research efforts to probe and understand the limitations of generative models and generation of music guided by text for machine learning amateurs.
Primary intended users: The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateurs seeking to better understand those models.
Out - of - scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. It should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people.
Metrics
Models performance measures:
- Frechet Audio Distance computed on features extracted from a pre - trained audio classifier (VGGish)
- Kullback - Leibler Divergence on label distributions extracted from a pre - trained audio classifier (PaSST)
- CLAP Score between audio embedding and text embedding extracted from a pre - trained CLAP model
Additionally, qualitative studies with human participants were run, evaluating the model on the overall quality of music samples and text relevance to the provided text input.
Decision thresholds: Not applicable.
Evaluation datasets
The model was evaluated on the MusicCaps benchmark and an in - domain held - out evaluation set with no artist overlap with the training set.
Training datasets
The model was trained on licensed data from the Meta Music Initiative Sound Collection, Shutterstock music collection and the Pond5 music collection.
Evaluation results
Model |
Frechet Audio Distance |
KLD |
Text Consistency |
facebook/magnet-small-10secs |
4.22 |
1.11 |
0.28 |
facebook/magnet-medium-10secs |
4.61 |
1.14 |
0.28 |
facebook/magnet-small-30secs |
4.35 |
1.17 |
0.28 |
facebook/magnet-medium-30secs |
4.63 |
1.20 |
0.28 |
Audio - MAGNeT - Sound - effect generation models
Training datasets
The audio - magnet models were trained on a subset of AudioSet (Gemmeke et al., 2017), [BBC sound effects](https://sound - effects.bbcrewind.co.uk/), AudioCaps (Kim et al., 2019), Clotho v2 (Drossos et al., 2020), VGG - Sound (Chen et al., 2020), FSD50K (Fonseca et al., 2021), [Free To Use Sounds](https://www.freetousesounds.com/all - in - one - bundle/), Sonniss Game Effects, [WeSoundEffects](https://wesoundeffects.com/we - sound - effects - bundle - 2020/), [Paramount Motion - Odeon Cinematic Sound Effects](https://www.paramountmotion.com/odeon - sound - effects).
Evaluation datasets
The audio - magnet models (sound effect generation) were evaluated on the AudioCaps benchmark.
Evaluation results
Model |
Frechet Audio Distance |
KLD |
facebook/audio-magnet-small |
3.21 |
1.42 |
facebook/audio-magnet-medium |
2.32 |
1.64 |
🔧 Technical Details
Limitations and biases
Data: The data sources are created by music professionals and covered by legal agreements. Scaling the model on larger datasets may improve performance.
Mitigations: Tracks with vocals were removed using tags and the Hybrid Transformer for Music Source Separation (HT - Demucs).
Limitations:
- The model can't generate realistic vocals.
- It performs better with English descriptions.
- It doesn't perform equally well for all music styles and cultures.
- It sometimes generates song endings with silence.
- Prompt engineering may be needed for satisfying results.
Biases: The data source may lack diversity, and the model may not perform equally well on all music genres. Generated samples may reflect training data biases.
Risks and harms: Biases and limitations may lead to generation of inappropriate samples. The model should not be used for downstream applications without risk investigation and mitigation.
Use cases: MAGNeT is for AI research on music generation and should not be used for downstream applications without further risk assessment.
📄 License
Code is released under MIT, model weights are released under CC - BY - NC 4.0.