voc2vec-as-pt Open-source Model - A Practical Fundamental Tool for Processing Non-linguistic Human Data

Voc2vec As Pt

Developed by alkiskoudounas

voc2vec is a foundational model specifically designed for non-linguistic human data, built upon the wav2vec 2.0 framework.

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Non-linguistic audio classification #Infant cry detection #Multi-dataset pre-training

Downloads 31

Release Time : 2/6/2025

Model Overview

This model is used for non-linguistic audio classification tasks, particularly for recognizing non-linguistic vocalizations such as infant cries.

Model Features

Non-linguistic audio processing

Model optimized specifically for non-linguistic human sounds (e.g., infant cries)

Multi-dataset pre-training

Pre-trained on 10 datasets containing approximately 125 hours of non-linguistic audio

Continued training based on AudioSet

Further pre-training from a model initially trained on the AudioSet dataset

Model Capabilities

Non-linguistic audio classification

Infant cry recognition

Audio feature extraction

Use Cases

Healthcare

Infant cry analysis

Used to identify and analyze different types of infant cries

Speech research

Non-linguistic vocalization research

Used to study the characteristics and patterns of human non-linguistic vocalizations

🚀 voc2vec-as-pt

Voc2vec is a foundation model tailored for non-verbal human data, offering unique insights and capabilities in this specialized domain.

Voc2vec is a foundation model specifically designed for non-verbal human data. We employed a collection of 10 datasets covering around 125 hours of non-verbal audio and pre-trained a Wav2Vec2-like model.

✨ Features

Model description

Voc2vec is built upon the wav2vec 2.0 framework and follows its pre-training setup. The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. This model continues pre-training from a model that was initially trained on the Audioset dataset.

Task and datasets description

We evaluate voc2vec-as-pt on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE.

The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.

Model	Architecture	Pre-training DS	UAR	F1 Macro
voc2vec	wav2vec 2.0	Voc125	.612±.212	.580±.230
voc2vec-as-pt	wav2vec 2.0	AudioSet + Voc125	.603±.183	.574±.194
voc2vec-ls-pt	wav2vec 2.0	LibriSpeech + Voc125	.661±.206	.636±.223
voc2vec-hubert-ls-pt	HuBERT	LibriSpeech + Voc125	.696±.189	.678±.200

Available Models

Property	Details
Model Type	Audio Classification
Training Data	AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound

Model	Description	Link
voc2vec	Pre-trained model on 125 hours of non-verbal audio.	🔗 Model
voc2vec-as-pt	Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset.	🔗 Model
voc2vec-ls-pt	Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset.	🔗 Model
voc2vec-hubert-ls-pt	Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset.	🔗 Model

💻 Usage Examples

Basic Usage

import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec-as-pt")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec-as-pt")

## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

## Compute logits
logits = model(**inputs).logits

📚 Documentation

BibTeX entry and citation info

@INPROCEEDINGS{koudounas2025icassp,
  author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={voc2vec: A Foundation Model for Non-Verbal Vocalization}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
  doi={10.1109/ICASSP49660.2025.10890672}}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご