Voc2vec-Hubert-ls-pt Open-Source Model - A Practical Tool for Free Processing of Non-verbal Human Data

Voc2vec Hubert Ls Pt

Developed by alkiskoudounas

voc2vec is a foundational model specifically designed for non-verbal human data, built on the HuBERT framework and pre-trained on 125 hours of non-verbal audio data.

Audio Classification

Transformers

EnglishOpen Source License:Apache-2.0 #Non-verbal sound recognition #Infant cry classification #HuBERT architecture

Downloads 114

Release Time : 4/14/2025

Model Overview

This model focuses on the classification and analysis of non-verbal human sounds, particularly suitable for scenarios like infant crying.

Model Features

Specialized for non-verbal sounds

A pre-trained model optimized specifically for non-verbal human sounds (e.g., infant cries, laughter, etc.)

Multi-dataset pre-training

Pre-trained on 125 hours of non-verbal audio from 10 different datasets

HuBERT architecture

Built on the HuBERT framework, inheriting its excellent audio representation learning capabilities

Transfer learning friendly

Continued training from the LibriSpeech pre-trained model, suitable for fine-tuning downstream tasks

Model Capabilities

Non-verbal audio classification

Infant cry recognition

Audio feature extraction

Use Cases

Infant care

Infant cry recognition

Identify and analyze different types of infant cries (hunger, discomfort, etc.)

Performs excellently on infant cry datasets such as Donate a Cry

Medical assistance

Non-verbal symptom analysis

Analyze patients' non-verbal sounds to assist in medical diagnosis

🚀 voc2vec-hubert-ls-pt

voc2vec is a foundation model specifically crafted for non-verbal human data. It addresses the need for understanding non-verbal cues by leveraging a collection of 10 datasets, encompassing approximately 125 hours of non-verbal audio. A HuBERT-like model was pre-trained on these datasets.

🚀 Quick Start

You can use the model directly in the following way:

Basic Usage

import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")
feature_extractor = AutoFeatureExtractor.from_pretrained("alkiskoudounas/voc2vec-hubert-ls-pt")

## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

## Compute logits
logits = model(**inputs).logits

✨ Features

Model description

voc2vec-hubert is built upon the HuBERT framework and adheres to its pre-training setup. The pre-training datasets include: AudioSet (vocalization), FreeSound (babies), HumanVoiceDataset, NNIME, NonSpeech7K, ReCANVo, SingingDatabase, TUT (babies), VocalSketch, VocalSound. This model continues pre-training from a model initially trained on the LibriSpeech dataset.

Task and datasets description

We evaluate voc2vec-hubert-ls-pt on six datasets: ASVP-ESD, ASPV-ESD (babies), CNVVE, NonVerbal Vocalization Dataset, Donate a Cry, VIVAE. This is currently the best released model of the voc2vec collection.

The following table reports the average performance in terms of Unweighted Average Recall (UAR) and F1 Macro across the six datasets described above.

Model	Architecture	Pre-training DS	UAR	F1 Macro
voc2vec	wav2vec 2.0	Voc125	.612±.212	.580±.230
voc2vec-as-pt	wav2vec 2.0	AudioSet + Voc125	.603±.183	.574±.194
voc2vec-ls-pt	wav2vec 2.0	LibriSpeech + Voc125	.661±.206	.636±.223
voc2vec-hubert-ls-pt	HuBERT	LibriSpeech + Voc125	.696±.189	.678±.200

Available Models

Property	Details
Model Type	voc2vec-hubert-ls-pt
Description	Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset.
Link	🔗 Model

Model	Description	Link
voc2vec	Pre-trained model on 125 hours of non-verbal audio.	🔗 Model
voc2vec-as-pt	Continues pre-training from a wav2vec2-like model that was initially trained on the AudioSet dataset.	🔗 Model
voc2vec-ls-pt	Continues pre-training from a wav2vec2-like model that was initially trained on the LibriSpeech dataset.	🔗 Model
voc2vec-hubert-ls-pt	Continues pre-training from a hubert-like model that was initially trained on the LibriSpeech dataset.	🔗 Model

📄 License

This project is licensed under the Apache-2.0 license.

📚 Documentation

BibTeX entry and citation info

@INPROCEEDINGS{koudounas2025icassp,
  author={Koudounas, Alkis and La Quatra, Moreno and Siniscalchi, Sabato Marco and Baralis, Elena},
  booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={voc2vec: A Foundation Model for Non-Verbal Vocalization}, 
  year={2025},
  volume={},
  number={},
  pages={1-5},
  keywords={Pediatrics;Accuracy;Foundation models;Benchmark testing;Signal processing;Data models;Acoustics;Speech processing;Nonverbal vocalization;Representation Learning;Self-Supervised Models;Pre-trained Models},
  doi={10.1109/ICASSP49660.2025.10890672}}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご