đ Model Card: Pre-trained Audio Representation Models on AudioSet
This model card provides information about the pre - trained audio representation models released by ALM. These models are pre - trained on the full AudioSet dataset and are designed for general - purpose Audio Representation Learning (ARL) tasks.
đ Quick Start
The pre - trained models introduced in this card can be directly used for feature extraction in various ARL tasks. You can fine - tune them on task - specific datasets according to your needs.
⨠Features
- Multiple Model Architectures: Include HuBERT and Wav2Vec 2.0 based models, providing different options for audio representation learning.
- Pre - trained on AudioSet: The models are pre - trained on the full AudioSet dataset, suitable for a wide range of ARL tasks.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
No code examples are provided in the original document.
đ Documentation
Models
Property |
Details |
Model Type |
There are four models: [ALM/hubert - base - audioset](https://huggingface.co/ALM/hubert - base - audioset), [ALM/hubert - large - audioset](https://huggingface.co/ALM/hubert - large - audioset), [ALM/wav2vec2 - base - audioset](https://huggingface.co/ALM/wav2vec2 - base - audioset), [ALM/wav2vec2 - large - audioset](https://huggingface.co/ALM/wav2vec2 - large - audioset) |
Architecture |
HuBERT (Hubert - Base, Hubert - Large) and Wav2Vec 2.0 (Wav2Vec2 - Base, Wav2Vec2 - Large) transformer - based models |
Description |
These models are pre - trained on the full AudioSet dataset. The large - sized variants have enhanced capacity for capturing audio representations. The Wav2Vec 2.0 models use SSL with CPC during training, offering a different approach compared to HuBERT models. |
Intended Use
These pre - trained models are suitable for a wide range of ARL tasks, such as speech recognition, music classification, and acoustic event detection. They can be used for feature extraction and fine - tuned on task - specific datasets for downstream applications. However, their performance in speech - related tasks may be relatively lower than specialized models due to the diverse nature of the AudioSet dataset.
Limitations and Considerations
- The models are pre - trained on the full AudioSet dataset, which may not comprehensively cover all possible audio domains.
- Fine - tuning on domain - specific data may be necessary to achieve optimal performance for certain tasks.
- Computational resources may be required for deploying and fine - tuning these models, especially the larger variants.
đ§ Technical Details
The models are pre - trained on the full AudioSet dataset. Different architectures (HuBERT and Wav2Vec 2.0) are used, and the Wav2Vec 2.0 models are trained using SSL with CPC. The large - sized models have more capacity to capture audio representations from the dataset.
đ License
The license for these models is cc - by - nc - sa - 4.0.
đ Citation
If you use these pre - trained models in your work, please cite the following:
@INPROCEEDINGS{ARCH,
author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)},
title={Benchmarking Representations for Speech, Music, and Acoustic Events},
year={2024},
pages={505-509},
keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning},
doi={10.1109/ICASSPW62465.2024.10625960}
}
arXiv version: arxiv.org/abs/2405.00934