đ Model Card: Pre-trained Audio Representation Models on AudioSet
This model card provides information about pre-trained audio representation models released by ALM. These models are pre-trained on the full AudioSet dataset and are suitable for general - purpose Audio Representation Learning (ARL) tasks.
đ Quick Start
The pre - trained models presented in this card are ready to be used for a variety of ARL tasks. You can start by accessing the models through the provided Hugging Face links and then fine - tune them according to your specific requirements.
⨠Features
- Multiple Architectures: The models are based on different transformer architectures, including HuBERT and Wav2Vec 2.0, offering diverse approaches to audio representation learning.
- General - Purpose: Trained on the full AudioSet dataset, these models can be applied to a wide range of ARL tasks.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
No code examples are provided in the original document.
đ Documentation
Models
- [ALM/hubert - base - audioset](https://huggingface.co/ALM/hubert - base - audioset)
- Architecture: HuBERT (Hubert - Base) transformer - based model
- Description: This model is based on the HuBERT architecture and pre - trained on the full AudioSet dataset.
- [ALM/hubert - large - audioset](https://huggingface.co/ALM/hubert - large - audioset)
- Architecture: HuBERT (Hubert - Large) transformer - based model
- Description: Similar to the hubert - base - audioset model, this variant is larger in size, providing increased capacity for capturing audio representations from the full AudioSet dataset.
- [ALM/wav2vec2 - base - audioset](https://huggingface.co/ALM/wav2vec2 - base - audioset)
- Architecture: Wav2Vec 2.0 (Wav2Vec2 - Base) transformer - based model
- Description: This model is based on the Wav2Vec 2.0 architecture, trained on the full AudioSet dataset using SSL with CPC. It offers a different approach to audio representation learning compared to the HuBERT models.
- [ALM/wav2vec2 - large - audioset](https://huggingface.co/ALM/wav2vec2 - large - audioset)
- Architecture: Wav2Vec 2.0 (Wav2Vec2 - Large) transformer - based model
- Description: Similar to the wav2vec2 - base - audioset model, this variant is larger in size, providing enhanced capacity for learning audio representations from the full AudioSet dataset.
Intended Use
These pre - trained models are intended for a wide range of ARL tasks, including but not limited to speech recognition, music classification, and acoustic event detection. They serve as powerful tools for feature extraction and can be fine - tuned on task - specific datasets for downstream applications.
â ī¸ Important Note
While these models offer versatility across various audio domains, their performance in speech - related tasks may be relatively lower compared to specialized models such as the original Wav2Vec and HuBERT models. This is due to the diverse nature of the AudioSet dataset used for pre - training, which includes a wide range of audio sources beyond speech.
Limitations and Considerations
- The models are pre - trained on the full AudioSet dataset, which may not cover all possible audio domains comprehensively.
- Fine - tuning on domain - specific data may be necessary to achieve optimal performance for certain tasks.
- Computational resources may be required for deploying and fine - tuning these models, especially the larger variants.
đ§ Technical Details
The models are pre - trained on the full AudioSet dataset. Different architectures (HuBERT and Wav2Vec 2.0) are used, each with its own approach to audio representation learning. The larger variants of the models have increased capacity for capturing audio representations but may require more computational resources for deployment and fine - tuning.
đ License
The models are released under the CC - BY - NC - SA 4.0 license.
đ Citation
If you use these pre - trained models in your work, please cite the following
@INPROCEEDINGS{ARCH,
author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
title={Benchmarking Representations for Speech, Music, and Acoustic Events},
year={2024},
pages={505-509},
keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning},
doi={10.1109/ICASSPW62465.2024.10625960}
}
arXiv version: arxiv.org/abs/2405.00934