The open-source speech emotion recognition model whisper-large-v3-msp-podcast-emotion supports 9 emotion classifications.

Whisper Large V3 Msp Podcast Emotion

Developed by tiantiaf

A speech emotion recognition model based on Whisper-Large V3, optimized for the MSP-Podcast dataset, supporting 9 emotion classifications

Audio Classification

Safetensors

English#Speech Emotion Recognition #Pure Audio System #Short Audio Optimization

Downloads 282

Release Time : 5/22/2025

Model Overview

This model implements speech emotion recognition, trained on the MSP-Podcast dataset, and is particularly suitable for emotion classification of online content.

Model Features

Efficient Pure Audio System

Constructs a concise and efficient pure audio emotion recognition system without using text transcription

Diverse Emotion Classification

Supports recognition of 9 emotion categories, including anger, happiness, sadness, etc.

Online Content Optimization

Especially suitable for emotion classification of online audio content

Model Capabilities

Speech Emotion Recognition

Audio Classification

Speech Feature Extraction

Use Cases

Content Analysis

Podcast Emotion Analysis

Analyze the emotional tendencies in podcast content

Can identify 9 different emotional states

Social Media Monitoring

Monitor the emotional tendencies of audio content on social media

Helps identify potentially negative emotional content

🚀 Whisper-Large V3 for Categorical Emotion Classification

This model is designed for categorical emotion classification, leveraging the power of Whisper-Large V3 to accurately identify emotions in speech.

🚀 Quick Start

Download repo

git clone git@github.com:tiantiaf0627/vox-profile-release.git

Install the package

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

Load the model

# Load libraries
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()

Prediction

# Label List
emotion_label_list = [
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
    data, return_feature=True
)
    
# Probability and output
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])

✨ Features

Emotion Classification: This model can classify categorical emotions from speech, including Anger, Contempt, Disgust, Fear, Happiness, Neutral, Sadness, Surprise, and Other.
Based on Whisper-Large V3: Leveraging the powerful base model openai/whisper-large-v3 for high-performance emotion classification.
Trained on MSP-Podcast Data: The model is trained using MSP-Podcast data, which might make it sensitive to content information for better emotion prediction from online content.

📦 Installation

Download repo

git clone git@github.com:tiantiaf0627/vox-profile-release.git

Install the package

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

💻 Usage Examples

Basic Usage

# Load libraries
import torch
import torch.nn.functional as F
from src.model.emotion.whisper_emotion import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-msp-podcast-emotion").to(device)
model.eval()

# Label List
emotion_label_list = [
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]
    
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embedding, _, _, _, _ = model(
    data, return_feature=True
)
    
# Probability and output
emotion_prob = F.softmax(logits, dim=1)
print(emotion_label_list[torch.argmax(emotion_prob).detach().cpu().item()])

📚 Documentation

Model Description

This model includes the implementation of categorical emotion classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits.

The training pipeline used is also the top-performing solution (SAILER) in INTERSPEECH 2025—Speech Emotion Challenge. Note that we did not use all the augmentation and did not use the transcript compared to our official challenge submission system, but we created a speech-only system to make the model simple but still effective.

We use the MSP-Podcast data to train this model, noting that the model might be sensitive to content information when making emotion predictions. However, this could be a good feature for classifying emotions from online content.

The included emotions are:

[
    'Anger', 
    'Contempt', 
    'Disgust', 
    'Fear', 
    'Happiness', 
    'Neutral', 
    'Sadness', 
    'Surprise', 
    'Other'
]

Library: https://github.com/tiantiaf0627/vox-profile-release

📄 License

This model is released under the bsd-2-clause license.

📋 Information Table

Property	Details
Model Type	Audio Classification
Base Model	openai/whisper-large-v3
Training Data	MSP-Podcast
Metrics	Accuracy
Pipeline Tag	audio-classification
Tags	model_hub_mixin, pytorch_model_hub_mixin, speech_emotion_recognition

📧 Contact

If you have any questions, please contact: Tiantian Feng (tiantiaf@usc.edu)

📖 Citation

Kindly cite our paper if you are using our model or find it useful in your work

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご