🚀 Model Card for discogs-maest-30s-pw-129e
MAEST is a family of Transformer models based on PASST, specializing in music analysis. It offers pre - trained models for music style classification and performs well in various downstream music analysis tasks.
🚀 Quick Start
The MAEST models can be used with the audio-classification
pipeline of the transformers
library. Here is a basic example:
import numpy as np
from transformers import pipeline
audio = np.random.randn(30 * 16000)
pipe = pipeline("audio-classification", model="mtg-upf/discogs-maest-30s-pw-129e")
pipe(audio)
[{'score': 0.6158794164657593, 'label': 'Electronic---Noise'},
{'score': 0.08825448155403137, 'label': 'Electronic---Experimental'},
{'score': 0.08772594481706619, 'label': 'Electronic---Abstract'},
{'score': 0.03644488751888275, 'label': 'Rock---Noise'},
{'score': 0.03272806480526924, 'label': 'Electronic---Musique Concrète'}]
✨ Features
Model Details
MAEST is a family of Transformer models based on PASST and focused on music analysis applications. The models are available for inference in the Essentia library and for inference and training in the official repository. You can try the MAEST interactive demo on replicate.
⚠️ Important Note
This model is available under CC BY - NC - SA 4.0 license for non - commercial applications and under proprietary license upon request. Contact us for more information.
⚠️ Important Note
The MAEST models rely on custom code. Set trust_remote_code=True
to use them within the 🤗Transformers' audio - classification
pipeline.
Model Description
- Developed by: Pablo Alonso
- Shared by: Pablo Alonso
- Model type: Transformer
- License: cc - by - nc - sa - 4.0
- Finetuned from model: PaSST
Model Sources
Uses
MAEST is a music audio representation model pre - trained on the task of music style classification. It shows good performance in several downstream music analysis tasks according to the original paper.
Direct Use
The MAEST models can make predictions for a taxonomy of 400 music styles derived from the public metadata of Discogs.
Downstream Use
The MAEST models have reported good performance in downstream applications related to music genre recognition, music emotion recognition, and instrument detection. Specifically, the original paper reports that the best performance is obtained from representations extracted from intermediate layers of the model.
Out - of - Scope Use
The model has not been evaluated outside the context of music understanding applications, so its performance outside the intended domain is unknown. Since it is for the audio - classification
pipeline, MAEST is NOT a general - purpose audio classification model (such as [AST](https://huggingface.co/docs/transformers/model_doc/audio - spectrogram - transformer)), and it may not perform well in tasks like AudioSet.
Bias, Risks, and Limitations
The MAEST models were trained using Discogs20, an in - house MTG dataset. There is an over - representation of Western (particularly electronic) music, although efforts were made to maximize diversity regarding the 400 music styles in the dataset.
📦 Installation
No specific installation steps are provided in the original document.
📚 Documentation
Training Details
Training Data
Our models were trained using Discogs20, an MTG in - house dataset featuring 3.3M music tracks matched to Discogs' metadata.
Training Procedure
Most training details are detailed in the paper and official implementation of the model.
Preprocessing
MAEST models rely on mel - spectrograms originally extracted with the Essentia library, and used in several previous publications. In Transformers, this mel - spectrogram signature is replicated to a certain extent using audio_utils
, which have a very small (but not neglectable) impact on the predictions.
Evaluation, Metrics, and results
The MAEST models were pre - trained in the task of music style classification, and their internal representations were evaluated via downstream MLP probes in several benchmark music understanding tasks. Check the original paper for details.
Environmental Impact
- Hardware Type: 4 x Nvidia RTX 2080 Ti
- Hours used: apprx. 32
- Carbon Emitted: apprx. 3.46 kg CO2 eq.
Carbon emissions estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
Technical Specifications
Model Architecture and Objective
[Audio Spectrogram Transformer (AST)](https://huggingface.co/docs/transformers/model_doc/audio - spectrogram - transformer)
Compute Infrastructure
- Hardware: 4 x Nvidia RTX 2080 Ti
- Software: Pytorch
Citation
BibTeX
@inproceedings{alonso2023music,
title={Efficient supervised training of audio transformers for music representation learning},
author={Alonso - Jim{\'e}nez, Pablo and Serra, Xavier and Bogdanov, Dmitry},
booktitle={Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023)},
year={2022},
organization={International Society for Music Information Retrieval (ISMIR)}
}
APA
Alonso - Jiménez, P., Serra, X., & Bogdanov, D. (2023). Efficient Supervised Training of Audio Transformers for Music Representation Learning. In Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023)
Model Card Authors
Pablo Alonso
Model Card Contact
📄 License
This model is available under CC BY - NC - SA 4.0 license for non - commercial applications and under proprietary license upon request.