Whisper-large-v3-voice-quality Open-source Voice Model - Freely Analyze Voice Features such as Pitch and Sound Quality

Whisper Large V3 Voice Quality

Developed by tiantiaf

A voice quality classification model based on Whisper Large v3, used to analyze features such as pitch, timbre, volume, clarity, and rhythm of speech.

Audio Classification

Safetensors

English#Voice Feature Analysis #Multi-label Classification #Speaker Attribute Recognition

Downloads 162

Release Time : 5/22/2025

Model Overview

This model implements the voice quality classification method described in 'Vox-Profile: A Benchmark for Characterizing Diverse Speakers and Speech Features with Voice Foundation Models', capable of classifying multi-dimensional speech features.

Model Features

Multi-dimensional Voice Feature Analysis

Capable of analyzing multiple dimensions of speech features such as pitch, timbre, volume, clarity, and rhythm simultaneously.

Speaker-level Evaluation

Uses speaker-level macro-average F1 score for evaluation to ensure the representativeness of the results.

Efficient Audio Processing

Supports audio input up to 15 seconds in length, with a sampling rate of 16kHz and mono-channel processing.

Model Capabilities

Voice Quality Classification

Pitch Analysis

Timbre Analysis

Volume Analysis

Clarity Analysis

Rhythm Analysis

Use Cases

Speech Analysis

Speech Feature Labeling

Automatically labels speech samples with features such as pitch and timbre.

Provides detailed speech feature classification results

Speaker Feature Analysis

Analyzes the speech feature patterns of speakers.

Generates speaker-level speech feature reports

Speech Research

Speech Feature Research

Used for research on the correlation between speech features and speaker characteristics.

🚀 Whisper Large v3 for Voice (Sounding) Quality Classification

This model is designed for voice (sounding) quality classification. It offers a solution to accurately classify voice qualities, leveraging the power of the Whisper Large v3 base model and relevant research findings.

🚀 Quick Start

Download repo

git clone git@github.com:tiantiaf0627/vox-profile-release.git

Install the package

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

Load the model

# Load libraries
import torch
import torch.nn.functional as F
from src.model.voice_quality.whisper_voice_quality import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-voice-quality").to(device)
model.eval()

✨ Features

Based on Whisper Large v3: Utilizes the openai/whisper-large-v3 base model.
Comprehensive Voice Quality Labels: Covers various aspects of voice quality including pitch, texture, volume, clarity, and rhythm.
Specific Metric Calculation: Reports speaker - level Macro - F1 scores with a specific sampling and averaging process.

📦 Installation

Download repo

git clone git@github.com:tiantiaf0627/vox-profile-release.git

Install the package

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

💻 Usage Examples

Basic Usage

# Load libraries
import torch
import torch.nn.functional as F
from src.model.voice_quality.whisper_voice_quality import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-voice-quality").to(device)
model.eval()

# Label List
voice_quality_label_list = [
    'shrill', 'nasal', 'deep',  # Pitch
    'silky', 'husky', 'raspy', 'guttural', 'vocal-fry', # Texture
    'booming', 'authoritative', 'loud', 'hushed', 'soft', # Volume
    'crisp', 'slurred', 'lisp', 'stammering', # Clarity
    'singsong', 'pitchy', 'flowing', 'monotone', 'staccato', 'punctuated', 'enunciated',  'hesitant', # Rhythm
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits = model(
    data, return_feature=False
)
    
# Probability and output
voice_quality_prob = nn.Sigmoid()(torch.tensor(logits))
    
# In practice, a larger threshold would remove some noise, but it is best to aggregate predictions per speaker
voice_label = list()
threshold = 0.7
predictions = (voice_quality_prob > threshold).int().detach().cpu().numpy()[0].tolist()
for label_idx in range(len(predictions)):
    if predictions[label_idx] == 1: voice_label.append(voice_quality_label_list[label_idx])

# print the voice quality labels
print(voice_label)

📚 Documentation

Model Description

This model includes the implementation of voice quality classification described in Vox - Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits.

Metric

Specifically, we report speaker - level Macro - F1 scores. Specifically, we randomly sampled five utterances for each speaker and repeated this stratification process 20 times. The speaker - level score is computed as the average Macro - F1 across speakers. We then report the unweighted average of speaker - level Macro - F1 scores between VoxCeleb and Expresso.

Special Note

We exclude EARS from ParaSpeechCaps due to its limited number of samples in the holdout set.

The included labels are:

[
    'shrill', 'nasal', 'deep',  # Pitch
    'silky', 'husky', 'raspy', 'guttural', 'vocal-fry', # Texture
    'booming', 'authoritative', 'loud', 'hushed', 'soft', # Volume
    'crisp', 'slurred', 'lisp', 'stammering', # Clarity
    'singsong', 'pitchy', 'flowing', 'monotone', 'staccato', 'punctuated', 'enunciated',  'hesitant', # Rhythm
]

Library: https://github.com/tiantiaf0627/vox - profile - release

Kindly cite our paper if you are using our model or find it useful in your work

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}

📄 License

This model is released under the bsd - 2 - clause license.

⚠️ Important Note

We exclude EARS from ParaSpeechCaps due to its limited number of samples in the holdout set.

💡 Usage Tip

In practice, a larger threshold would remove some noise, but it is best to aggregate predictions per speaker.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご