whisper-large-v3-voice-quality開源語音模型 - 免費分析音高、音質等語音特徵

首頁

Whisper Large V3 Voice Quality

由tiantiaf開發

基於Whisper Large v3的語音質量分類模型，用於分析語音的音高、音質、音量、清晰度和節奏等特徵。

音頻分類

Safetensors

英語#語音特徵分析 #多標籤分類 #說話人屬性識別

下載量 162

發布時間 : 5/22/2025

模型概述

本模型實現了《Vox-Profile: 用於表徵多樣化說話人與語音特徵的語音基礎模型基準》中描述的語音質量分類方法，能夠對語音的多維度特徵進行分類。

模型特點

多維度語音特徵分析

能夠同時分析語音的音高、音質、音量、清晰度和節奏等多個維度的特徵。

說話人級別評估

採用說話人級別的宏平均F1分數進行評估，確保評估結果的代表性。

高效音頻處理

支持最長15秒的音頻輸入，16kHz採樣率，單聲道處理。

模型能力

語音質量分類

音高分析

音質分析

音量分析

清晰度分析

節奏分析

使用案例

語音分析

語音特徵標註

為語音樣本自動標註音高、音質等特徵標籤。

提供詳細的語音特徵分類結果

說話人特徵分析

分析說話人的語音特徵模式。

生成說話人級別的語音特徵報告

語音研究

語音特徵研究

用於語音特徵與說話人特徵的相關性研究。

🚀 用於語音（發聲）質量分類的Whisper Large v3

本模型基於OpenAI的Whisper Large v3，可用於語音質量分類，能有效識別多種語音特徵，為語音相關研究和應用提供了有力支持。

🚀 快速開始

下載倉庫

git clone git@github.com:tiantiaf0627/vox-profile-release.git

安裝依賴包

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

加載模型

# Load libraries
import torch
import torch.nn.functional as F
from src.model.voice_quality.whisper_voice_quality import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-voice-quality").to(device)
model.eval()

進行預測

# Label List
voice_quality_label_list = [
    'shrill', 'nasal', 'deep',  # Pitch
    'silky', 'husky', 'raspy', 'guttural', 'vocal-fry', # Texture
    'booming', 'authoritative', 'loud', 'hushed', 'soft', # Volume
    'crisp', 'slurred', 'lisp', 'stammering', # Clarity
    'singsong', 'pitchy', 'flowing', 'monotone', 'staccato', 'punctuated', 'enunciated',  'hesitant', # Rhythm
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits = model(
    data, return_feature=False
)
    
# Probability and output
voice_quality_prob = nn.Sigmoid()(torch.tensor(logits))
    
# In practice, a larger threshold would remove some noise, but it is best to aggregate predictions per speaker
voice_label = list()
threshold = 0.7
predictions = (voice_quality_prob > threshold).int().detach().cpu().numpy()[0].tolist()
for label_idx in range(len(predictions)):
    if predictions[label_idx] == 1: voice_label.append(voice_quality_label_list[label_idx])

# print the voice quality labels
print(voice_label)

✨ 主要特性

模型實現：本模型包含了在Vox - Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648) 中描述的語音質量分類的實現。
評估指標：具體報告說話者級別的Macro - F1分數。隨機為每個說話者採樣五個話語，並將此分層過程重複20次。說話者級別的分數計算為所有說話者的平均Macro - F1。最後報告VoxCeleb和Expresso之間說話者級Macro - F1分數的未加權平均值。
特殊說明：由於ParaSpeechCaps數據集中EARS的保留集樣本數量有限，因此將其排除。
標籤涵蓋：涵蓋了音高、音質、音量、清晰度和節奏等多個方面的標籤，具體如下：

[
    'shrill', 'nasal', 'deep',  # Pitch
    'silky', 'husky', 'raspy', 'guttural', 'vocal-fry', # Texture
    'booming', 'authoritative', 'loud', 'hushed', 'soft', # Volume
    'crisp', 'slurred', 'lisp', 'stammering', # Clarity
    'singsong', 'pitchy', 'flowing', 'monotone', 'staccato', 'punctuated', 'enunciated',  'hesitant', # Rhythm
]

相關庫：可訪問https://github.com/tiantiaf0627/vox-profile-release 獲取相關庫。

📚 詳細文檔

模型描述

本模型基於OpenAI的openai/whisper-large-v3基礎模型，在ajd12342/paraspeechcaps數據集上進行訓練，用於音頻分類任務。

信息表格

屬性	詳情
模型類型	用於語音（發聲）質量分類的Whisper Large v3
基礎模型	openai/whisper-large-v3
訓練數據集	ajd12342/paraspeechcaps
評估指標	準確率、說話者級別的Macro - F1分數
管道標籤	音頻分類

引用說明

如果您使用了我們的模型或在您的工作中發現它很有用，請引用我們的論文：

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}