đ Model Card: Pre-trained Audio Representation Models on AudioSet
This model card provides information about pre-trained audio representation models released by ALM. These models, pre-trained on the full AudioSet dataset, are designed for general-purpose Audio Representation Learning (ARL) tasks.
đ Quick Start
This section will guide you through the basic understanding and potential usage of these pre - trained audio representation models.
⨠Features
- Multiple Architectures: The models are based on different architectures, including HuBERT and Wav2Vec 2.0, providing diverse approaches to audio representation learning.
- General - Purpose: They are pre - trained on the full AudioSet dataset, suitable for a wide range of ARL tasks.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
đ Models
Property |
Details |
Model 1 |
[ALM/hubert - base - audioset](https://huggingface.co/ALM/hubert - base - audioset). Architecture: HuBERT (Hubert - Base) transformer - based model. It is pre - trained on the full AudioSet dataset. |
Model 2 |
[ALM/hubert - large - audioset](https://huggingface.co/ALM/hubert - large - audioset). Architecture: HuBERT (Hubert - Large) transformer - based model. Larger in size compared to the base model, it has increased capacity for capturing audio representations from the full AudioSet dataset. |
Model 3 |
[ALM/wav2vec2 - base - audioset](https://huggingface.co/ALM/wav2vec2 - base - audioset). Architecture: Wav2Vec 2.0 (Wav2Vec2 - Base) transformer - based model. Trained on the full AudioSet dataset using SSL with CPC, it offers a different approach to audio representation learning compared to the HuBERT models. |
Model 4 |
[ALM/wav2vec2 - large - audioset](https://huggingface.co/ALM/wav2vec2 - large - audioset). Architecture: Wav2Vec 2.0 (Wav2Vec2 - Large) transformer - based model. Similar to the base model but larger in size, it has enhanced capacity for learning audio representations from the full AudioSet dataset. |
đ¯ Intended Use
These pre - trained models are suitable for a wide range of ARL tasks, such as speech recognition, music classification, and acoustic event detection. They can be used for feature extraction and fine - tuned on task - specific datasets for downstream applications.
â ī¸ Important Note
While these models are versatile across various audio domains, their performance in speech - related tasks may be relatively lower compared to specialized models like the original Wav2Vec and HuBERT models. This is because the AudioSet dataset used for pre - training includes a wide range of audio sources beyond speech.
â ī¸ Limitations and Considerations
- The models are pre - trained on the full AudioSet dataset, which may not comprehensively cover all possible audio domains.
- Fine - tuning on domain - specific data may be necessary to achieve optimal performance for certain tasks.
- Computational resources may be required for deploying and fine - tuning these models, especially the larger variants.
đ Citation
If you use these pre - trained models in your work, please cite the following:
@INPROCEEDINGS{ARCH,
author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
title={Benchmarking Representations for Speech, Music, and Acoustic Events},
year={2024},
pages={505-509},
keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning},
doi={10.1109/ICASSPW62465.2024.10625960}
}
arXiv version: arxiv.org/abs/2405.00934
đ License
The models are licensed under cc - by - nc - sa - 4.0.