wav2vec2-base-audioset Open-source Audio Model - Completing AudioSet Pretraining to Boost Audio Representation Learning

Home

Wav2vec2 Base Audioset

Developed by ALM

Audio representation learning model based on HuBERT architecture, pre-trained on the complete AudioSet dataset

Audio Classification

Transformers

#General Audio Representation #Multimodal Pre-training #Self-supervised Learning

Downloads 2,191

Release Time : 9/5/2023

Model Overview

This model adopts the HuBERT architecture and extracts general audio features from the AudioSet dataset through self-supervised learning, suitable for various audio processing tasks.

Model Features

General Audio Representation

Capable of learning general feature representations from diverse audio content

Self-supervised Pre-training

Utilizes self-supervised learning for pre-training on the AudioSet dataset

Transformer Architecture

Based on HuBERT's Transformer architecture with powerful feature extraction capabilities

Model Capabilities

Audio Feature Extraction

Music Classification

Acoustic Event Detection

Speech Recognition Assistance

Use Cases

Audio Analysis

Music Classification

Classify music clips by genre or mood

Environmental Sound Detection

Identify specific sound events in the environment (e.g., alarms, animal sounds)

Speech Processing

Speech Recognition Assistance

Serve as a front-end feature extractor for speech recognition systems

May not perform as well as dedicated speech models

🚀 Model Card: Pre-trained Audio Representation Models on AudioSet

This model card provides information about pre-trained audio representation models released by ALM, which are pre-trained on the full AudioSet dataset for general - purpose Audio Representation Learning (ARL) tasks.

🚀 Quick Start

This model card presents details of several pre - trained audio representation models. These models can be used for a variety of ARL tasks. You can access the models through the provided Hugging Face links and start using them for feature extraction or fine - tuning on specific datasets.

✨ Features

These models are pre - trained on the full AudioSet dataset, suitable for general - purpose ARL tasks.
They cover different architectures (HuBERT and Wav2Vec 2.0), offering multiple options for audio representation learning.
Can be fine - tuned on task - specific datasets for downstream applications.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Models

1. [ALM/hubert - base - audioset](https://huggingface.co/ALM/hubert - base - audioset)

Architecture: HuBERT (Hubert - Base) transformer - based model
Description: This model is based on the HuBERT architecture, pre - trained on the full AudioSet dataset.

2. [ALM/hubert - large - audioset](https://huggingface.co/ALM/hubert - large - audioset)

Architecture: HuBERT (Hubert - Large) transformer - based model
Description: Similar to the hubert - base - audioset model, this variant is larger in size, providing increased capacity for capturing audio representations from the full AudioSet dataset.

3. [ALM/wav2vec2 - base - audioset](https://huggingface.co/ALM/wav2vec2 - base - audioset)

Architecture: Wav2Vec 2.0 (Wav2Vec2 - Base) transformer - based model
Description: This model is based on the Wav2Vec 2.0 architecture, trained on the full AudioSet dataset using SSL with CPC. It offers a different approach to audio representation learning compared to the HuBERT models.

4. [ALM/wav2vec2 - large - audioset](https://huggingface.co/ALM/wav2vec2 - large - audioset)

Architecture: Wav2Vec 2.0 (Wav2Vec2 - Large) transformer - based model
Description: Similar to the wav2vec2 - base - audioset model, this variant is larger in size, providing enhanced capacity for learning audio representations from the full AudioSet dataset.

Intended Use

These pre - trained models are intended for a wide range of ARL tasks, including but not limited to speech recognition, music classification, and acoustic event detection. They serve as powerful tools for feature extraction and can be fine - tuned on task - specific datasets for downstream applications.

It's important to note that while these models offer versatility across various audio domains, their performance in speech - related tasks may be relatively lower compared to specialized models such as the original Wav2Vec and HuBERT models. This is due to the diverse nature of the AudioSet dataset used for pre - training, which includes a wide range of audio sources beyond speech.

Limitations and Considerations

The models are pre - trained on the full AudioSet dataset, which may not cover all possible audio domains comprehensively.
Fine - tuning on domain - specific data may be necessary to achieve optimal performance for certain tasks.
Computational resources may be required for deploying and fine - tuning these models, especially the larger variants.

🔧 Technical Details

The models are pre - trained on the AudioSet dataset. Different architectures (HuBERT and Wav2Vec 2.0) are used to capture audio representations. The HuBERT models are based on the HuBERT architecture, and the Wav2Vec 2.0 models are trained using SSL with CPC on the AudioSet dataset.

📄 License

The models are released under the CC - BY - NC - SA - 4.0 license.

📄 Citation

If you use these pre - trained models in your work, please cite the following

@INPROCEEDINGS{ARCH,
  author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
  booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)}, 
  title={Benchmarking Representations for Speech, Music, and Acoustic Events}, 
  year={2024},
  pages={505 - 509},
  keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre - trained Models; Self - Supervised Learning},
  doi={10.1109/ICASSPW62465.2024.10625960}
}

arXiv version: arxiv.org/abs/2405.00934

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご