SER-Odyssey-Baseline-WavLM-Multi-Attributes Open-Source Model - Accurately Predict Three-Dimensional Emotions in Speech

SER Odyssey Baseline WavLM Multi Attributes

Developed by 3loi

A multi-attribute speech emotion recognition baseline model based on WavLM architecture, predicting arousal, dominance, and valence dimensions

Audio Classification

Transformers

EnglishOpen Source License:MIT #Multidimensional Speech Emotion Prediction #WavLM Audio Encoding #Valence-Arousal-Dominance Recognition

Downloads 23.09k

Release Time : 3/5/2024

Model Overview

This model is a speech emotion recognition model trained on the MSP-Podcast dataset, specifically developed as a baseline for the Odyssey 2024 Emotion Recognition Competition. It simultaneously predicts three emotional dimensions in speech: arousal, dominance, and valence, with output values ranging from 0 to 1.

Model Features

Multi-Attribute Emotion Prediction

Simultaneously predicts three emotional dimensions—arousal, dominance, and valence—providing comprehensive emotional analysis

Trained on MSP-Podcast Dataset

Uses a professional emotional speech dataset for training, ensuring high reliability

Standardized Audio Processing

Built-in mean/standard deviation normalization ensures consistent input audio quality

Model Capabilities

Speech Emotion Recognition

Arousal Prediction

Dominance Prediction

Valence Prediction

Audio Classification

Use Cases

Affective Computing

Speech Emotion Analysis

Analyzes emotional states in speech for psychological research or user experience evaluation

Accurately identifies three emotional dimensions: arousal, dominance, and valence

Human-Computer Interaction

Intelligent Customer Service Emotion Recognition

Real-time identification of emotional states in user speech to optimize customer service response strategies

🚀 Speech Emotion Recognition Model for Odyssey 2024

This model is designed for the Odyssey 2024 Emotion Recognition competition. It can predict arousal, dominance, and valence based on audio input, trained on the MSP - Podcast dataset.

🚀 Quick Start

The model was trained on MSP - Podcast for the Odyssey 2024 Emotion Recognition competition baseline. This particular model is the multi - attributed based model which predicts arousal, dominance, and valence in a range of approximately 0...1.

✨ Features

Trained on the MSP - Podcast dataset.
Predicts arousal, dominance, and valence in the range of approximately 0...1.

📚 Documentation

Benchmarks

CCC based on Test3 and Development sets of the Odyssey Competition

Property	Details
Model Type	Multi - attributed based model for audio classification
Training Data	MSP - Podcast

Multi - Task Setup	Test 3			Development
	Val	Dom	Aro	Val	Dom	Aro
	0.577	0.577	0.405	0.652	0.688	0.579

For more details: demo, paper, and GitHub.

Citation

@InProceedings{Goncalves_2024,
            author={L. Goncalves and A. N. Salman and A. {Reddy Naini} and L. Moro-Velazquez and T. Thebaud and L. {Paola Garcia} and N. Dehak and B. Sisman and C. Busso},
            title={Odyssey2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results},
            booktitle={Odyssey 2024: The Speaker and Language Recognition Workshop)},
            volume={To appear},
            year={2024},
            month={June},
            address =  {Quebec, Canada},
}

💻 Usage Examples

Basic Usage

from transformers import AutoModelForAudioClassification
import librosa, torch

#load model
model = AutoModelForAudioClassification.from_pretrained("3loi/SER-Odyssey-Baseline-WavLM-Multi-Attributes", trust_remote_code=True)

#get mean/std
mean = model.config.mean
std = model.config.std


#load an audio file
audio_path = "/path/to/audio.wav"
raw_wav, _ = librosa.load(audio_path, sr=model.config.sampling_rate)

#normalize the audio by mean/std
norm_wav = (raw_wav - mean) / (std+0.000001)

#generate the mask
mask = torch.ones(1, len(norm_wav))

#batch it (add dim)
wavs = torch.tensor(norm_wav).unsqueeze(0)


#predict
with torch.no_grad():
    pred = model(wavs, mask)

print(model.config.id2label) 
print(pred)
#{0: 'arousal', 1: 'dominance', 2: 'valence'}
#tensor([[0.3670, 0.4553, 0.4240]])

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご