SER-Odyssey-Baseline-WavLM-Arousal Open-source Model - Accurately Predict Voice Arousal Values

SER Odyssey Baseline WavLM Arousal

Developed by 3loi

A speech emotion recognition baseline model based on the WavLM architecture, specifically designed to predict arousal values in speech (0-1 range)

Audio Classification

Transformers

EnglishOpen Source License:MIT #Speech Arousal Prediction #Single-task Emotion Recognition #MSP-Podcast Dataset

Downloads 72

Release Time : 3/15/2024

Model Overview

This model serves as the baseline for the Odyssey 2024 Emotion Recognition Competition, trained on the MSP-Podcast dataset with a focus on single-task arousal prediction.

Model Features

High-precision Arousal Prediction

Achieves CCC metrics of 0.566 on Test3 and 0.651 on the development set

Single-task Focused Design

Specifically optimized for arousal prediction, avoiding multi-task interference

Standardized Audio Processing

Built-in mean/standard deviation normalization process ensures input consistency

Model Capabilities

Speech Emotion Analysis

Arousal Value Prediction

Audio Feature Extraction

Use Cases

Mental Health Monitoring

Speech Emotion State Assessment

Analyzes users' emotional arousal levels through speech

Quantifiable output of arousal values in the 0-1 range

Human-Computer Interaction

Intelligent Customer Service Emotion Response

Real-time detection of user speech emotion states to adjust response strategies

🚀 WavLM-based Audio Emotion Arousal Recognition Model

This model is trained on MSP-Podcast for the Odyssey 2024 Emotion Recognition competition baseline, specializing in predicting audio emotion arousal.

🚀 Quick Start

The model was trained on MSP-Podcast for the Odyssey 2024 Emotion Recognition competition baseline. This particular model is the single - task specialized arousal model, which predicts arousal in a range of approximately 0...1.

📚 Documentation

Benchmarks

CCC based on Test3 and Development sets of the Odyssey Competition

Property	Test 3	Development
Aro	0.566	0.651

For more details: demo, paper, and GitHub.

Citation

@InProceedings{Goncalves_2024,
            author={L. Goncalves and A. N. Salman and A. {Reddy Naini} and L. Moro-Velazquez and T. Thebaud and L. {Paola Garcia} and N. Dehak and B. Sisman and C. Busso},
            title={Odyssey2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results},
            booktitle={Odyssey 2024: The Speaker and Language Recognition Workshop)},
            volume={To appear},
            year={2024},
            month={June},
            address =  {Quebec, Canada},
}

💻 Usage Examples

Basic Usage

from transformers import AutoModelForAudioClassification
import librosa, torch

#load model
model = AutoModelForAudioClassification.from_pretrained("3loi/SER-Odyssey-Baseline-WavLM-Arousal", trust_remote_code=True)

#get mean/std
mean = model.config.mean
std = model.config.std

#load an audio file
audio_path = "/path/to/audio.wav"
raw_wav, _ = librosa.load(audio_path, sr=model.config.sampling_rate)

#normalize the audio by mean/std
norm_wav = (raw_wav - mean) / (std+0.000001)

#generate the mask
mask = torch.ones(1, len(norm_wav))

#batch it (add dim)
wavs = torch.tensor(norm_wav).unsqueeze(0)

#predict
with torch.no_grad():
    pred = model(wavs, mask)

print(model.config.id2label) 
print(pred)
#{0: 'arousal'}
#tensor([[0.3670]])

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご