whisper-large-v3-msp-podcast-emotion開源語音情感識別模型，支持9種情感分類

首頁

Whisper Large V3 Msp Podcast Emotion

由tiantiaf開發

基於Whisper-Large V3的語音情感識別模型，專為MSP-Podcast數據集優化，支持9種情感分類

音頻分類

Safetensors

英語#語音情感識別 #純語音系統 #短時音頻優化

下載量 282

發布時間 : 5/22/2025

模型概述

該模型實現了語音情感識別功能，基於MSP-Podcast數據集訓練，特別適合對網絡內容進行情感分類。

模型特點

高效純語音系統

未使用文本轉錄，構建了簡潔高效的純語音情感識別系統

多樣化情感分類

支持9種情感類別識別，包括憤怒、快樂、悲傷等

網絡內容優化

特別適合對網絡音頻內容進行情感分類

模型能力

語音情感識別

音頻分類

語音特徵提取

使用案例

內容分析

播客情感分析

分析播客內容中的情感傾向

可識別9種不同情感狀態

社交媒體監控

監測社交媒體音頻內容的情感傾向

幫助識別潛在負面情緒內容

🚀 用於分類情感分類的Whisper-Large V3

本模型基於Whisper-Large V3實現分類情感分類，可有效識別語音中的多種情感，為語音情感分析提供了強大的工具。

🚀 快速開始

本模型可用於語音情感分類任務，下面將介紹如何使用該模型。

✨ 主要特性

模型實現：本模型實現了Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits (https://arxiv.org/pdf/2505.14648) 中描述的分類情感分類。
訓練管道：使用的訓練管道是INTERSPEECH 2025—Speech Emotion Challenge (https://lab-msp.com/MSP-Podcast_Competition/IS2025/) 中的最佳解決方案（SAILER）。
訓練數據：使用MSP-Podcast數據進行訓練，模型在進行情感預測時可能對內容信息敏感，這對於從在線內容中分類情感是一個很好的特性。
支持的情感類別：包括憤怒、輕蔑、厭惡、恐懼、快樂、中性、悲傷、驚訝和其他。

📦 安裝指南

下載倉庫

git clone git@github.com:tiantiaf0627/vox-profile-release.git

安裝包

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

💻 使用示例

基礎用法

# Load libraries
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()

高級用法

# Label List
emotion_label_list = [
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
    data, return_feature=True
)
    
# Probability and output
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])

📚 詳細文檔

模型描述

本模型實現了Vox-Profile中描述的分類情感分類，使用的訓練管道是INTERSPEECH 2025—Speech Emotion Challenge中的最佳解決方案（SAILER）。與官方挑戰提交系統相比，本模型未使用所有增強方法，也未使用轉錄文本，而是創建了一個僅基於語音的系統，使模型簡單但仍然有效。

支持的情感類別

[
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]

📄 許可證

本模型使用BSD 2-Clause許可證。

引用信息

如果您使用了本模型或在您的工作中發現它很有用，請引用我們的論文：

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}