🚀 voc2vec-hubert-ls-pt
voc2vec is a foundation model specifically crafted for non-verbal human data. It addresses the need for understanding non-verbal cues by leveraging a collection of 10 datasets, encompassing approximately 125 hours of non-verbal audio. A HuBERT-like model was pre-trained on these datasets.
🚀 Quick Start
You can use the model directly in the following way:
Basic Usage
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")
logits = model(**inputs).logits
✨ Features
Model description
voc2vec-hubert is built upon the HuBERT framework and adheres to its pre-training setup. The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. This model continues pre-training from a model initially trained on the LibriSpeech dataset.
Task and datasets description
We evaluate voc2vec-hubert-ls-pt on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE. This is currently the best released model of the voc2vec collection.
The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.
Model |
Architecture |
Pre-training DS |
UAR |
F1 Macro |
voc2vec |
wav2vec 2.0 |
Voc125 |
.612±.212 |
.580±.230 |
voc2vec-as-pt |
wav2vec 2.0 |
AudioSet + Voc125 |
.603±.183 |
.574±.194 |
voc2vec-ls-pt |
wav2vec 2.0 |
LibriSpeech + Voc125 |
.661±.206 |
.636±.223 |
voc2vec-hubert-ls-pt |
HuBERT |
LibriSpeech + Voc125 |
.696±.189 |
.678±.200 |
Available Models
Property |
Details |
Model Type |
voc2vec-hubert-ls-pt |
Description |
Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset. |
Link |
🔗 Model |
Model |
Description |
Link |
voc2vec |
Pre-trained model on 125 hours of non-verbal audio. |
🔗 Model |
voc2vec-as-pt |
Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset. |
🔗 Model |
voc2vec-ls-pt |
Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset. |
🔗 Model |
voc2vec-hubert-ls-pt |
Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset. |
🔗 Model |
📄 License
This project is licensed under the Apache-2.0 license.
📚 Documentation
BibTeX entry and citation info
@INPROCEEDINGS{koudounas2025icassp,
author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={voc2vec: A Foundation Model for Non-Verbal Vocalization},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
doi={10.1109/ICASSP49660.2025.10890672}}