Voc2vec Open-source Non-verbal Data Foundation Model - Processes approximately 125 hours of non-verbal audio data

Voc2vec

Developed by alkiskoudounas

voc2vec is a foundational model specifically designed for non-linguistic human data, built on the wav2vec 2.0 framework, with a pretraining dataset covering approximately 125 hours of non-linguistic audio.

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Non-linguistic audio classification #Infant cry detection #Self-supervised pretraining

Downloads 223

Release Time : 2/6/2025

Model Overview

voc2vec is a foundational model for non-linguistic human audio data, primarily used for audio classification tasks, especially suitable for the classification and analysis of non-linguistic vocalizations such as infant cries.

Model Features

Non-linguistic vocalization classification

Specifically designed for non-linguistic human audio data, such as infant cries, laughter, etc.

Multi-dataset pretraining

Pretrained using a collection of 10 different datasets, covering approximately 125 hours of non-linguistic audio.

Multiple model variants

Provides model variants based on different pretraining datasets, including AudioSet, LibriSpeech, and HuBERT.

Model Capabilities

Non-linguistic vocalization classification

Audio feature extraction

Infant cry recognition

Use Cases

Healthcare

Infant cry analysis

Used to analyze infant cries, helping to identify the needs or health status of infants.

Performs well on the Donate a Cry dataset.

Speech research

Non-linguistic vocalization research

Used to study the characteristics and patterns of human non-linguistic vocalizations.

Evaluated on multiple non-linguistic vocalization datasets.

🚀 voc2vec

voc2vec is a foundation model tailored for non - verbal human data, offering valuable solutions for audio classification tasks.

🚀 Quick Start

voc2vec is a foundation model specifically designed for non - verbal human data.

We employed a collection of 10 datasets covering around 125 hours of non - verbal audio and pre - trained a Wav2Vec2-like model.

✨ Features

Specifically designed for non - verbal human data.
Built upon the wav2vec 2.0 framework.
Evaluated on six datasets to demonstrate its performance.

📚 Documentation

Model description

Voc2vec is built upon the wav2vec 2.0 framework and follows its pre - training setup. The pre - training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound.

Task and datasets description

We evaluate voc2vec on six datasets: ASVP - ESD, ASPV - ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.

The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.

Model	Architecture	Pre - training DS	UAR	F1 Macro
voc2vec	wav2vec 2.0	Voc125	.612±.212	.580±.230
voc2vec - as - pt	wav2vec 2.0	AudioSet + Voc125	.603±.183	.574±.194
voc2vec - ls - pt	wav2vec 2.0	LibriSpeech + Voc125	.661±.206	.636±.223
voc2vec - hubert - ls - pt	HuBERT	LibriSpeech + Voc125	.696±.189	.678±.200

Available Models

Property	Details
Model	Description
Link
voc2vec	Pre - trained model on 125 hours of non - verbal audio.
voc2vec - as - pt	Continues pre - training from a wav2vec2 - like model that was initially trained on the AudioSet dataset.
voc2vec - ls - pt	Continues pre - training from a wav2vec2 - like model that was initially trained on the LibriSpeech dataset.
voc2vec - hubert - ls - pt	Continues pre - training from a hubert - like model that was initially trained on the LibriSpeech dataset.

💻 Usage Examples

Basic Usage

import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec")

## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

## Compute logits
logits = model(**inputs).logits

📄 License

This project is licensed under the Apache 2.0 license.

📚 BibTeX entry and citation info

@INPROCEEDINGS{koudounas2025icassp,
  author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={voc2vec: A Foundation Model for Non - Verbal Vocalization}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self - Supervised Models;Pre - trained Models},
  doi={10.1109/ICASSP49660.2025.10890672}}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご