SER-Odyssey-Baseline-WavLM-Categorical Open-source Model - Free Prediction of 8 Basic Speech Emotion Categories

SER Odyssey Baseline WavLM Categorical

Developed by 3loi

A baseline model for speech emotion recognition based on the WavLM architecture, designed to predict 8 basic emotion categories

Audio Classification

Transformers

EnglishOpen Source License:MIT #Speech Emotion Recognition #WavLM Architecture #Multi-emotion Classification

Downloads 581

Release Time : 3/7/2024

Model Overview

This model is a speech emotion recognition classifier trained on the MSP-Podcast dataset, serving as the baseline model for the Odyssey 2024 Emotion Recognition Challenge. It can predict 8 emotion categories including anger, sadness, happiness, etc.

Model Features

Multi-emotion Classification

Capable of identifying 8 basic emotion categories: anger, sadness, happiness, surprise, fear, disgust, contempt, and neutral

Standardized Audio Processing

Supports mean/standard deviation normalization preprocessing to improve model recognition accuracy

Competition Baseline Model

Serves as the official baseline model for the Odyssey 2024 Emotion Recognition Challenge, providing reference value

Model Capabilities

Speech Emotion Recognition

Audio Classification

Multi-category Sentiment Analysis

Use Cases

Human-Computer Interaction

Voice Assistant Emotion Response

Adjusts interaction strategies by recognizing user's speech emotions

Enhances the naturalness and user experience of human-computer interaction

Mental Health

Emotional State Monitoring

Analyzes emotional changes in voice recordings

Assists in mental health assessment and intervention

🚀 Audio Classification Model for Emotion Recognition

This model is designed for audio classification in the context of emotion recognition. It was trained on the MSP - Podcast dataset for the Odyssey 2024 Emotion Recognition competition baseline. It specifically predicts eight emotional categories: "Angry", "Sad", "Happy", "Surprise", "Fear", "Disgust", "Contempt", and "Neutral".

✨ Features

Categorical Prediction: Predicts eight distinct emotional states from audio input.
Trained on MSP - Podcast: Utilizes the MSP - Podcast dataset for training, which is well - suited for emotion recognition tasks.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForAudioClassification
import librosa, torch

#load model
model = AutoModelForAudioClassification.from_pretrained("3loi/SER-Odyssey-Baseline-WavLM-Categorical-Attributes", trust_remote_code=True)

#get mean/std
mean = model.config.mean
std = model.config.std


#load an audio file
audio_path = "/path/to/audio.wav"
raw_wav, _ = librosa.load(audio_path, sr=model.config.sampling_rate)

#normalize the audio by mean/std
norm_wav = (raw_wav - mean) / (std+0.000001)

#generate the mask
mask = torch.ones(1, len(norm_wav))

#batch it (add dim)
wavs = torch.tensor(norm_wav).unsqueeze(0)


#predict
with torch.no_grad():
    pred = model(wavs, mask)

print(model.config.id2label)  
print(pred)
#{0: 'Angry', 1: 'Sad', 2: 'Happy', 3: 'Surprise', 4: 'Fear', 5: 'Disgust', 6: 'Contempt', 7: 'Neutral'}
#tensor([[0.0015, 0.3651, 0.0593, 0.0315, 0.0600, 0.0125, 0.0319, 0.4382]])

#convert logits to probability
probabilities = torch.nn.functional.softmax(pred, dim=1)
print(probabilities)
#[[0.0015, 0.3651, 0.0593, 0.0315, 0.0600, 0.0125, 0.0319, 0.4382]]

📚 Documentation

Benchmarks

The following table shows the F1 - scores based on the Test3 and Development sets of the Odyssey Competition for the categorical setup:

Property	Details
Model Type	Categorical based model for emotion recognition
Training Data	MSP - Podcast

	Test 3			Development
	F1 - Mic.	F1 - Ma.	Prec.	Rec.	F1 - Mic.	F1 - Ma.
	0.327	0.311	0.332	0.325	0.409	0.307

For more details, you can refer to the demo, paper, and GitHub.

Citation

@InProceedings{Goncalves_2024,
            author={L. Goncalves and A. N. Salman and A. {Reddy Naini} and L. Moro-Velazquez and T. Thebaud and L. {Paola Garcia} and N. Dehak and B. Sisman and C. Busso},
            title={Odyssey2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results},
            booktitle={Odyssey 2024: The Speaker and Language Recognition Workshop)},
            volume={To appear},
            year={2024},
            month={June},
            address =  {Quebec, Canada},
}

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご