WavLM-Base-Emotion Open-source Speech Emotion Recognition Model - Free deployment, capable of classifying audio into 7 emotions

Wavlm Base Emotion

Developed by jihedjabnoun

A speech emotion recognition model fine-tuned based on WavLM-Base, capable of classifying audio into 7 different emotions

Audio Classification

Transformers

EnglishOpen Source License:MIT #Speech emotion recognition #Multi-emotion classification #English speech analysis

Downloads 111

Release Time : 6/2/2025

Model Overview

This model is a speech emotion recognition model fine-tuned based on microsoft/wavlm-base, mainly used to recognize 7 emotions such as anger, disgust, fear, happiness, neutral, sadness, and surprise from speech audio.

Model Features

Multi-emotion classification

Capable of recognizing 7 different emotional states

Training on multiple datasets

Trained on multiple datasets such as MELD, CREMA-D, TESS, RAVDESS, and SAVEE

Speaker diversity

The training set contains 380 unique speakers, improving the model's generalization ability

Model Capabilities

Speech emotion classification

Audio feature extraction

Emotion probability distribution output

Use Cases

Human-computer interaction

Intelligent customer service emotion analysis

Analyze the emotional state in the customer's speech to improve service quality

Can identify the customer satisfaction level

Mental health

Emotional state monitoring

Analyze the user's emotional changes through speech

Can be used in mental health applications

🚀 Speech Emotion Recognition with WavLM-Base

This model is a fine - tuned version of microsoft/wavlm - base for speech emotion recognition, capable of classifying audio into 7 different emotions.

🚀 Quick Start

This model is a fine - tuned version of microsoft/wavlm - base for speech emotion recognition. It can classify audio into 7 different emotions: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise.

✨ Features

Classify audio into 7 distinct emotions.
Trained on a diverse multi - dataset collection.

📦 Installation

No installation steps are provided in the original README, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2ForSequenceClassification
import torch
import librosa

# Load model and feature extractor
model_name = "jihedjabnoun/wavlm - base - emotion"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)

# Load and preprocess audio
audio_path = "path_to_your_audio.wav"
audio, sr = librosa.load(audio_path, sr = 16000, mono = True)

# Extract features
inputs = feature_extractor(audio, sampling_rate = 16000, return_tensors = "pt", padding = True)

# Predict emotion
with torch.no_grad():
    logits = model(**inputs).logits
    predicted_id = torch.argmax(logits, dim = -1).item()
    
# Get emotion label
emotions = ['Anger', 'Disgust', 'Fear', 'Happiness', 'Neutral', 'Sadness', 'Surprise']
predicted_emotion = emotions[predicted_id]
print(f"Predicted emotion: {predicted_emotion}")

# Get confidence scores
probabilities = torch.softmax(logits, dim = -1)
confidence_scores = {emotion: prob.item() for emotion, prob in zip(emotions, probabilities[0])}
print(f"Confidence scores: {confidence_scores}")

📚 Documentation

Model Details

Property	Details
Model Type	WavLM - Base for Sequence Classification
Base Model	`microsoft/wavlm - base`
Parameters	~95M parameters
Language	English
Task	Multi - class emotion classification from speech audio
Final Training Accuracy	30.3%

Training Data

The model was trained on a diverse multi - dataset collection totaling 18,687 training samples and validated on 4,672 validation samples:

MELD: 8,906 samples
CREMA - D: 5,950 samples
TESS: 2,305 samples
RAVDESS: 1,145 samples
SAVEE: 381 samples

Emotion Distribution (Training Set)

Neutral: 5,659 samples (30.3%)
Happiness: 3,063 samples (16.4%)
Anger: 2,548 samples (13.6%)
Sadness: 2,173 samples (11.6%)
Fear: 1,785 samples (9.6%)
Disgust: 1,773 samples (9.5%)
Surprise: 1,686 samples (9.0%)

Speaker Diversity

Training: 380 unique speakers
Validation: 283 unique speakers
Top speakers: Ross, Joey, Rachel, Phoebe (from MELD dataset)

Training Procedure

Training Hyperparameters

Epochs: 5
Batch Size: 4
Learning Rate: 3e - 5
Optimizer: AdamW
Scheduler: Linear with warmup
Mixed Precision: FP16
Gradient Checkpointing: Enabled for memory efficiency

Data Preprocessing

Sampling Rate: 16kHz
Audio Length: Padded/truncated to 10 seconds maximum
Normalization: Peak normalization applied
Feature Extraction: Using Wav2Vec2FeatureExtractor

Performance

The model was trained for 5 epochs and achieved a final accuracy of 30.3% on the validation set.

⚠️ Important Note

The relatively low accuracy suggests the model may need:

More training epochs

Different hyperparameters

Additional data preprocessing

Class balancing techniques

Training History

Epoch	Training Loss	Validation Loss	Accuracy
1	1.875	1.848	30.29%
2	1.877	1.847	30.29%
3	1.799	1.848	30.29%
4	1.827	1.846	30.29%
5	1.877	1.846	30.29%

Datasets Used

MELD (Multimodal EmotionLines Dataset): Emotion recognition in conversations from TV series
CREMA - D: Crowdsourced Emotional Multimodal Actors Dataset
TESS: Toronto Emotional Speech Set
RAVDESS: Ryerson Audio - Visual Database of Emotional Speech and Song
SAVEE: Surrey Audio - Visual Expressed Emotion Database

Limitations

Trained primarily on English speech
Performance may vary on different accents or speaking styles not well represented in training data
Audio quality and background noise can affect performance
Model shows signs of potential overfitting (plateau in validation accuracy)
May have bias towards neutral emotions due to class imbalance

Recommendations for Improvement

💡 Usage Tip

Longer Training: Try training for more epochs with early stopping

Learning Rate Scheduling: Use cosine annealing or reduce LR on plateau

Data Augmentation: Add noise, speed perturbation, or pitch shifting

Class Balancing: Use weighted loss or oversampling techniques

Regularization: Add dropout or weight decay

Ethical Considerations

This model should be used responsibly and not for:

Unauthorized emotion detection or surveillance
Making critical decisions about individuals without proper validation
Applications that could harm user privacy or well - being

Citation

If you use this model, please cite the original datasets and the base model:

@article{chen2022wavlm,
  title={WavLM: Large - Scale Self - Supervised Pre - Training for Full Stack Speech Processing},
  author={Chen, Sanyuan and Wang, Chengyi and Chen, Zhengyang and Wu, Yu and Liu, Shujie and Chen, Zhuo and Li, Jinyu and Kanda, Naoyuki and Yoshioka, Takuya and Xiao, Xiong and others},
  journal={IEEE Journal of Selected Topics in Signal Processing},
  volume={16},
  number={6},
  pages={1505--1518},
  year={2022},
  publisher={IEEE}
}

Model Card Authors

This model card was created as part of an emotion recognition research project.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご