🚀 voc2vec-as-pt
Voc2vec is a foundation model tailored for non-verbal human data, offering unique insights and capabilities in this specialized domain.
Voc2vec is a foundation model specifically designed for non-verbal human data. We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a Wav2Vec2-like model.
✨ Features
Model description
Voc2vec is built upon the wav2vec 2.0 framework and follows its pre-training setup. The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. This model continues pre-training from a model that was initially trained on the Audioset dataset.
Task and datasets description
We evaluate voc2vec-as-pt on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.
The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.
Model |
Architecture |
Pre-training DS |
UAR |
F1 Macro |
voc2vec |
wav2vec 2.0 |
Voc125 |
.612±.212 |
.580±.230 |
voc2vec-as-pt |
wav2vec 2.0 |
AudioSet + Voc125 |
.603±.183 |
.574±.194 |
voc2vec-ls-pt |
wav2vec 2.0 |
LibriSpeech + Voc125 |
.661±.206 |
.636±.223 |
voc2vec-hubert-ls-pt |
HuBERT |
LibriSpeech + Voc125 |
.696±.189 |
.678±.200 |
Available Models
Property |
Details |
Model Type |
Audio Classification |
Training Data |
AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound |
Model |
Description |
Link |
voc2vec |
Pre-trained model on 125 hours of non-verbal audio. |
🔗 Model |
voc2vec-as-pt |
Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset. |
🔗 Model |
voc2vec-ls-pt |
Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset. |
🔗 Model |
voc2vec-hubert-ls-pt |
Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset. |
🔗 Model |
💻 Usage Examples
Basic Usage
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec-as-pt")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec-as-pt")
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")
logits = model(**inputs).logits
📚 Documentation
BibTeX entry and citation info
@INPROCEEDINGS{koudounas2025icassp,
author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={voc2vec: A Foundation Model for Non-Verbal Vocalization},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
doi={10.1109/ICASSP49660.2025.10890672}}
📄 License
This project is licensed under the Apache-2.0 license.