whisper-large-v3-voice-quality开源语音模型 - 免费分析音高、音质等语音特征

首页

Whisper Large V3 Voice Quality

由 tiantiaf 开发

基于Whisper Large v3的语音质量分类模型，用于分析语音的音高、音质、音量、清晰度和节奏等特征。

音频分类

Safetensors

英语#语音特征分析 #多标签分类 #说话人属性识别

下载量 162

发布时间 : 5/22/2025

模型简介

本模型实现了《Vox-Profile: 用于表征多样化说话人与语音特征的语音基础模型基准》中描述的语音质量分类方法，能够对语音的多维度特征进行分类。

模型特点

多维度语音特征分析

能够同时分析语音的音高、音质、音量、清晰度和节奏等多个维度的特征。

说话人级别评估

采用说话人级别的宏平均F1分数进行评估，确保评估结果的代表性。

高效音频处理

支持最长15秒的音频输入，16kHz采样率，单声道处理。

模型能力

语音质量分类

音高分析

音质分析

音量分析

清晰度分析

节奏分析

使用案例

语音分析

语音特征标注

为语音样本自动标注音高、音质等特征标签。

提供详细的语音特征分类结果

说话人特征分析

分析说话人的语音特征模式。

生成说话人级别的语音特征报告

语音研究

语音特征研究

用于语音特征与说话人特征的相关性研究。

🚀 用于语音（发声）质量分类的Whisper Large v3

本模型基于OpenAI的Whisper Large v3，可用于语音质量分类，能有效识别多种语音特征，为语音相关研究和应用提供了有力支持。

🚀 快速开始

下载仓库

git clone git@github.com:tiantiaf0627/vox-profile-release.git

安装依赖包

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

加载模型

# Load libraries
import torch
import torch.nn.functional as F
from src.model.voice_quality.whisper_voice_quality import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-voice-quality").to(device)
model.eval()

进行预测

# Label List
voice_quality_label_list = [
    'shrill', 'nasal', 'deep',  # Pitch
    'silky', 'husky', 'raspy', 'guttural', 'vocal-fry', # Texture
    'booming', 'authoritative', 'loud', 'hushed', 'soft', # Volume
    'crisp', 'slurred', 'lisp', 'stammering', # Clarity
    'singsong', 'pitchy', 'flowing', 'monotone', 'staccato', 'punctuated', 'enunciated',  'hesitant', # Rhythm
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits = model(
    data, return_feature=False
)
    
# Probability and output
voice_quality_prob = nn.Sigmoid()(torch.tensor(logits))
    
# In practice, a larger threshold would remove some noise, but it is best to aggregate predictions per speaker
voice_label = list()
threshold = 0.7
predictions = (voice_quality_prob > threshold).int().detach().cpu().numpy()[0].tolist()
for label_idx in range(len(predictions)):
    if predictions[label_idx] == 1: voice_label.append(voice_quality_label_list[label_idx])

# print the voice quality labels
print(voice_label)

✨ 主要特性

模型实现：本模型包含了在Vox - Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648) 中描述的语音质量分类的实现。
评估指标：具体报告说话者级别的Macro - F1分数。随机为每个说话者采样五个话语，并将此分层过程重复20次。说话者级别的分数计算为所有说话者的平均Macro - F1。最后报告VoxCeleb和Expresso之间说话者级Macro - F1分数的未加权平均值。
特殊说明：由于ParaSpeechCaps数据集中EARS的保留集样本数量有限，因此将其排除。
标签涵盖：涵盖了音高、音质、音量、清晰度和节奏等多个方面的标签，具体如下：

[
    'shrill', 'nasal', 'deep',  # Pitch
    'silky', 'husky', 'raspy', 'guttural', 'vocal-fry', # Texture
    'booming', 'authoritative', 'loud', 'hushed', 'soft', # Volume
    'crisp', 'slurred', 'lisp', 'stammering', # Clarity
    'singsong', 'pitchy', 'flowing', 'monotone', 'staccato', 'punctuated', 'enunciated',  'hesitant', # Rhythm
]

相关库：可访问https://github.com/tiantiaf0627/vox-profile-release 获取相关库。

📚 详细文档

模型描述

本模型基于OpenAI的openai/whisper-large-v3基础模型，在ajd12342/paraspeechcaps数据集上进行训练，用于音频分类任务。

信息表格

属性	详情
模型类型	用于语音（发声）质量分类的Whisper Large v3
基础模型	openai/whisper-large-v3
训练数据集	ajd12342/paraspeechcaps
评估指标	准确率、说话者级别的Macro - F1分数
管道标签	音频分类

引用说明

如果您使用了我们的模型或在您的工作中发现它很有用，请引用我们的论文：

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}