🚀 Whisper-Large V3 for Categorical Emotion Classification
This model is designed for categorical emotion classification, leveraging the power of Whisper-Large V3 to accurately identify emotions in speech.
🚀 Quick Start
Download repo
git clone git@github.com:tiantiaf0627/vox-profile-release.git
Install the package
conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .
Load the model
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()
Prediction
emotion_label_list = [
'Anger',
'Contempt',
'Disgust',
'Fear',
'Happiness',
'Neutral',
'Sadness',
'Surprise',
'Other'
]
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
data, return_feature=True
)
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])
✨ Features
- Emotion Classification: This model can classify categorical emotions from speech, including Anger, Contempt, Disgust, Fear, Happiness, Neutral, Sadness, Surprise, and Other.
- Based on Whisper-Large V3: Leveraging the powerful base model
openai/whisper-large-v3
for high-performance emotion classification.
- Trained on MSP-Podcast Data: The model is trained using MSP-Podcast data, which might make it sensitive to content information for better emotion prediction from online content.
📦 Installation
Download repo
git clone git@github.com:tiantiaf0627/vox-profile-release.git
Install the package
conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .
💻 Usage Examples
Basic Usage
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()
emotion_label_list = [
'Anger',
'Contempt',
'Disgust',
'Fear',
'Happiness',
'Neutral',
'Sadness',
'Surprise',
'Other'
]
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
data, return_feature=True
)
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])
📚 Documentation
Model Description
This model includes the implementation of categorical emotion classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits.
The training pipeline used is also the top-performing solution (SAILER) in INTERSPEECH 2025—Speech Emotion Challenge. Note that we did not use all the augmentation and did not use the transcript compared to our official challenge submission system, but we created a speech-only system to make the model simple but still effective.
We use the MSP-Podcast data to train this model, noting that the model might be sensitive to content information when making emotion predictions. However, this could be a good feature for classifying emotions from online content.
The included emotions are:
[
'Anger',
'Contempt',
'Disgust',
'Fear',
'Happiness',
'Neutral',
'Sadness',
'Surprise',
'Other'
]
📄 License
This model is released under the bsd-2-clause license.
📋 Information Table
Property |
Details |
Model Type |
Audio Classification |
Base Model |
openai/whisper-large-v3 |
Training Data |
MSP-Podcast |
Metrics |
Accuracy |
Pipeline Tag |
audio-classification |
Tags |
model_hub_mixin, pytorch_model_hub_mixin, speech_emotion_recognition |
📧 Contact
If you have any questions, please contact: Tiantian Feng (tiantiaf@usc.edu)
📖 Citation
Kindly cite our paper if you are using our model or find it useful in your work
@article{feng2025vox,
title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
journal={arXiv preprint arXiv:2505.14648},
year={2025}
}