Wav2vec English Speech Emotion Recognition Model - Open-source and Free, Accurately Recognize 7 English Speech Emotions

Wav2vec English Speech Emotion Recognition

Developed by r-f

English speech emotion recognition model fine-tuned based on Wav2Vec 2.0, capable of recognizing 7 different emotions

Open Source License:Apache-2.0 #High-precision speech emotion recognition #Multi-actor dataset training #English speech analysis

Downloads 139.06k

Release Time : 9/22/2022

Model Overview

This model achieves English speech emotion recognition by fine-tuning Wav2Vec 2.0, supporting the identification of 7 emotions: anger, disgust, fear, happiness, neutral, sadness, and surprise.

Model Features

High accuracy

Achieves 97.463% accuracy on the evaluation set

Multi-dataset training

Combines three professional speech emotion datasets: SAVEE, RAVDESS, and TESS

7 emotion recognition

Accurately identifies 7 emotions: anger, disgust, fear, happiness, neutral, sadness, and surprise

Model Capabilities

Speech emotion recognition

English speech processing

Emotion classification

Use Cases

Human-computer interaction

Customer service system emotion analysis

Analyzes emotional tendencies in customer speech to improve service quality

Can recognize over 97% of emotional states

Mental health

Emotional state monitoring

Analyzes user emotional changes through speech

🚀 Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0

This model is a fine-tuned version of jonatasgrosman/wav2vec2-large-xlsr-53-english designed for the Speech Emotion Recognition (SER) task. It leverages multiple datasets for fine - tuning and demonstrates high accuracy in emotion classification.

🚀 Quick Start

Installation

pip install transformers librosa torch

Usage Example

from transformers import *
import librosa
import torch

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")

def predict_emotion(audio_path):
    audio, rate = librosa.load(audio_path, sr=16000)
    inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
    
    with torch.no_grad():
        outputs = model(inputs.input_values)
        predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1)  # Average over sequence length
        predicted_label = torch.argmax(predictions, dim=-1)
        emotion = model.config.id2label[predicted_label.item()]
    return emotion

emotion = predict_emotion("example_audio.wav")
print(f"Predicted emotion: {emotion}")
>> Predicted emotion: angry

✨ Features

Multi - dataset fine - tuning: Utilizes several datasets including SAVEE, RAVDESS, and TESS for fine - tuning.
High accuracy: Achieves an accuracy of 0.97463 on the evaluation set.
7 - emotion classification: Can classify 7 different emotions: angry, disgust, fear, happy, neutral, sad, and surprise.

📦 Installation

pip install transformers librosa torch

💻 Usage Examples

Basic Usage

from transformers import *
import librosa
import torch

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")

def predict_emotion(audio_path):
    audio, rate = librosa.load(audio_path, sr=16000)
    inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
    
    with torch.no_grad():
        outputs = model(inputs.input_values)
        predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1)  # Average over sequence length
        predicted_label = torch.argmax(predictions, dim=-1)
        emotion = model.config.id2label[predicted_label.item()]
    return emotion

emotion = predict_emotion("example_audio.wav")
print(f"Predicted emotion: {emotion}")

📚 Documentation

Datasets Used for Fine - Tuning

Surrey Audio - Visual Expressed Emotion (SAVEE) - 480 audio files from 4 male actors
Ryerson Audio - Visual Database of Emotional Speech and Song (RAVDESS) - 1440 audio files from 24 professional actors (12 female, 12 male)
Toronto emotional speech set (TESS) - 2800 audio files from 2 female actors

Classification Labels

emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']

Evaluation Results

Loss: 0.104075
Accuracy: 0.97463

🔧 Technical Details

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 4
eval_steps: 500
seed: 42
gradient_accumulation_steps: 2
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
num_epochs: 4
max_steps=7500
save_steps: 1500

Training results

Step	Training Loss	Validation Loss	Accuracy
500	1.8124	1.365212	0.486258
1000	0.8872	0.773145	0.79704
1500	0.7035	0.574954	0.852008
2000	0.6879	1.286738	0.775899
2500	0.6498	0.697455	0.832981
3000	0.5696	0.33724	0.892178
3500	0.4218	0.307072	0.911205
4000	0.3088	0.374443	0.930233
4500	0.2688	0.260444	0.936575
5000	0.2973	0.302985	0.92389
5500	0.1765	0.165439	0.961945
6000	0.1475	0.170199	0.961945
6500	0.1274	0.15531	0.966173
7000	0.0699	0.103882	0.976744
7500	0.083	0.104075	0.97463

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご