Hubert-large-audioset Open-source Model - Free for General Audio Representation Learning Tasks

Hubert Large Audioset

Developed by ALM

A Transformer model based on the HuBERT architecture, pre-trained on the complete AudioSet dataset, suitable for general audio representation learning tasks.

Audio Classification

Transformers

#General Audio Representation #Self-supervised Learning #Multi-domain Audio Processing

Downloads 79

Release Time : 8/28/2023

Model Overview

This model is based on the HuBERT architecture and pre-trained on the diverse AudioSet dataset, capable of extracting general audio features for various audio processing tasks.

Model Features

General Audio Representation

Pre-trained on the diverse AudioSet dataset, capable of handling various audio types (speech, music, environmental sounds, etc.)

HuBERT-based Architecture

Utilizes HuBERT's self-supervised learning method to effectively capture temporal features of audio signals

Transfer Learning Friendly

Can be used as a feature extractor or fine-tuned for downstream tasks

Model Capabilities

Audio Feature Extraction

Music Classification

Acoustic Event Detection

Speech Recognition (Limited Capability)

Use Cases

Music Analysis

Music Genre Classification

Automatic music genre classification using features extracted by the model

Environmental Sound Analysis

Acoustic Event Detection

Detecting specific sound events in the environment (e.g., alarms, animal sounds)

🚀 Model Card: Pre-trained Audio Representation Models on AudioSet

This model card provides information about the pre - trained audio representation models released by ALM. These models are pre - trained on the full AudioSet dataset and are designed for general - purpose Audio Representation Learning (ARL) tasks.

🚀 Quick Start

The pre - trained models introduced in this card can be directly used for feature extraction in various ARL tasks. You can fine - tune them on task - specific datasets according to your needs.

✨ Features

Multiple Model Architectures: Include HuBERT and Wav2Vec 2.0 based models, providing different options for audio representation learning.
Pre - trained on AudioSet: The models are pre - trained on the full AudioSet dataset, suitable for a wide range of ARL tasks.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Models

Property	Details
Model Type	There are four models: [ALM/hubert - base - audioset](https://huggingface.co/ALM/hubert - base - audioset), [ALM/hubert - large - audioset](https://huggingface.co/ALM/hubert - large - audioset), [ALM/wav2vec2 - base - audioset](https://huggingface.co/ALM/wav2vec2 - base - audioset), [ALM/wav2vec2 - large - audioset](https://huggingface.co/ALM/wav2vec2 - large - audioset)
Architecture	HuBERT (Hubert - Base, Hubert - Large) and Wav2Vec 2.0 (Wav2Vec2 - Base, Wav2Vec2 - Large) transformer - based models
Description	These models are pre - trained on the full AudioSet dataset. The large - sized variants have enhanced capacity for capturing audio representations. The Wav2Vec 2.0 models use SSL with CPC during training, offering a different approach compared to HuBERT models.

Intended Use

These pre - trained models are suitable for a wide range of ARL tasks, such as speech recognition, music classification, and acoustic event detection. They can be used for feature extraction and fine - tuned on task - specific datasets for downstream applications. However, their performance in speech - related tasks may be relatively lower than specialized models due to the diverse nature of the AudioSet dataset.

Limitations and Considerations

The models are pre - trained on the full AudioSet dataset, which may not comprehensively cover all possible audio domains.
Fine - tuning on domain - specific data may be necessary to achieve optimal performance for certain tasks.
Computational resources may be required for deploying and fine - tuning these models, especially the larger variants.

🔧 Technical Details

The models are pre - trained on the full AudioSet dataset. Different architectures (HuBERT and Wav2Vec 2.0) are used, and the Wav2Vec 2.0 models are trained using SSL with CPC. The large - sized models have more capacity to capture audio representations from the dataset.

📄 License

The license for these models is cc - by - nc - sa - 4.0.

📚 Citation

If you use these pre - trained models in your work, please cite the following:

@INPROCEEDINGS{ARCH,
  author={La Quatra, Moreno and Koudounas, Alkis and Vaiani, Lorenzo and Baralis, Elena and Cagliero, Luca and Garza, Paolo and Siniscalchi, Sabato Marco},
  booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)}, 
  title={Benchmarking Representations for Speech, Music, and Acoustic Events}, 
  year={2024},
  pages={505-509},
  keywords={Representation learning; Systematics; Conferences; Benchmark testing; Signal processing; Acoustics; Data models; Audio Representation Learning; Benchmark; Pre-trained Models; Self-Supervised Learning},
  doi={10.1109/ICASSPW62465.2024.10625960}
}

arXiv version: arxiv.org/abs/2405.00934

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご