The open-source model wav2vec2-ser-english-finetuned - Accurately recognize six emotions in English speech with an accuracy rate of 92.42%

Wav2vec2 Ser English Finetuned

Developed by dihuzz

This model is fine-tuned based on the Wav2Vec2 architecture, specifically designed to recognize six emotional states (sadness, anger, disgust, fear, happiness, neutral) in English speech, with an accuracy of 92.42%.

Audio Classification

Safetensors

English#High-precision emotion recognition #English speech analysis #Wav2Vec2 fine-tuning

Downloads 68

Release Time : 4/11/2025

Model Overview

A fine-tuned Wav2Vec2 model for English speech emotion recognition tasks, capable of accurately classifying six basic emotions.

Model Features

High accuracy

Achieves 92.42% accuracy on the test set with a loss value of only 0.219

Multi-emotion recognition

Can recognize six basic emotions: sadness, anger, disgust, fear, happiness, and neutral

Based on Wav2Vec2 architecture

Utilizes the powerful feature extraction capabilities of Wav2Vec2 for speech emotion classification

Lightweight inference

Suitable for real-time applications and can run efficiently on standard GPUs

Model Capabilities

English speech emotion classification

Real-time emotion analysis

Speech emotion recognition

Use Cases

Mental health

Mental state monitoring

Analyze users' emotional states through speech for mental health applications

Can automatically detect changes in users' emotions

Customer service

Customer service quality assessment

Analyze emotional states in customer service calls

Helps improve service quality

Human-computer interaction

Emotional voice assistants

Enable voice assistants to understand user emotions and respond accordingly

Enhances user experience

🚀 Wav2Vec2 Speech Emotion Recognition for English

This model is fine - tuned for English speech emotion recognition using the Wav2Vec2 architecture. It can accurately detect six basic emotions in English speech, offering high - performance results.

✨ Features

Emotion Recognition: Capable of detecting six emotions including sadness, anger, disgust, fear, happiness, and neutral.
High Performance: Achieves an accuracy of 92.42% with a loss of 0.219.

🧠 Model Overview

🔹 Model name: dihuzz/wav2vec2-ser-english-finetuned
✨ This model uses the Wav2Vec2 architecture and is fine - tuned to recognize emotions in English speech. The emotions it can detect are:

😢 Sadness
😠 Anger
🤢 Disgust
😨 Fear
😊 Happiness
😐 Neutral

🔧 It was created by fine - tuning r - f/wav2vec-english-speech-emotion-recognition on multiple well - known Speech Emotion Recognition datasets with English emotional speech samples.

📊 Performance Metrics:

🎯 Accuracy: 92.42%
📉 Loss: 0.219

🏋️ Training Procedure

⚙️ Training Details

Property	Details
Base Model	`r - f/wav2vec-english-speech-emotion-recognition`
Hardware	P100 GPU on Kaggle
Training Duration	10 epochs
Learning Rate	5e - 4
Batch Size	4
Gradient Accumulation Steps	8
Optimizer	AdamW (β₁ = 0.9, β₂ = 0.999)
Loss Function	Cross Entropy Loss
Learning Rate Scheduler	None

📜 Training Results

Epoch	Loss	Accuracy
1	1.0257	61.20%
2	0.7025	73.88%
3	0.5901	78.25%
4	0.4960	81.56%
5	0.4105	85.04%
6	0.3516	87.70%
7	0.3140	88.87%
8	0.2649	90.45%
9	0.2178	92.42%
10	0.2187	92.29%

📦 Installation

pip install transformers torch torchaudio

💻 Usage Examples

Basic Usage

import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import torchaudio  

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Load the fine-tuned model and feature extractor
model_name = "dihuzz/wav2vec2-ser-english-finetuned"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name).to(device)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)

# Set the model to evaluation mode
model.eval()

# Load and preprocess the audio file
def predict_emotion(audio_path):
    # Load audio
    waveform, sample_rate = torchaudio.load(audio_path) 
    # Alternatively, librosa can also be used to load the audio file

    # Resample if necessary
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
        waveform = resampler(waveform)

    # Convert to mono if stereo
    if waveform.shape[0] > 1:
        waveform = torch.mean(waveform, dim=0, keepdim=True)

    # Extract features and move them to device
    inputs = feature_extractor(
        waveform.squeeze().numpy(),
        sampling_rate=16000,
        return_tensors="pt",
        padding=True
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Perform inference (here we are using a batch size of 1 but you can increase batch size for faster inference)
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        predicted_class_id = torch.argmax(logits, dim=-1).item()

    # Map predicted class ID to emotion label
    label = model.config.id2label[predicted_class_id]
    return label

# Example usage
audio_file = "/path/to/your/audio.wav"
predicted_emotion = predict_emotion(audio_file)
print(f"Predicted Emotion: {predicted_emotion}")

📝 Example Output

The model returns a string representing the predicted emotion:

Predicted Emotion: <emotion_label>

🔧 Technical Details

This model is based on the Wav2Vec2 architecture and is fine - tuned on specific English speech emotion recognition datasets. The fine - tuning process involves adjusting the model's parameters to better fit the task of emotion recognition in English speech.

📄 License

No license information is provided in the original document.

⚠️ Important Note

This model has several important limitations:

🌐 Language Specificity: English - only support

🗣️ Dialect Sensitivity: Variable performance across accents

🎧 Audio Quality Needs: Requires clean, clear recordings

⚖️ Potential Biases: May reflect cultural biases in training data

6️⃣ Limited Categories: Only detects 6 basic emotions

🧠 Context Unaware: Doesn't consider speech content meaning

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご