đ Wav2Vec2 Speech Emotion Recognition for English
This model is fine - tuned for English speech emotion recognition using the Wav2Vec2 architecture. It can accurately detect six basic emotions in English speech, offering high - performance results.
⨠Features
- Emotion Recognition: Capable of detecting six emotions including sadness, anger, disgust, fear, happiness, and neutral.
- High Performance: Achieves an accuracy of 92.42% with a loss of 0.219.
đ§ Model Overview
đš Model name: dihuzz/wav2vec2-ser-english-finetuned
⨠This model uses the Wav2Vec2 architecture and is fine - tuned to recognize emotions in English speech. The emotions it can detect are:
- đĸ Sadness
- đ Anger
- đ¤ĸ Disgust
- đ¨ Fear
- đ Happiness
- đ Neutral
đ§ It was created by fine - tuning r - f/wav2vec-english-speech-emotion-recognition
on multiple well - known Speech Emotion Recognition datasets with English emotional speech samples.
đ Performance Metrics:
- đ¯ Accuracy: 92.42%
- đ Loss: 0.219
đī¸ Training Procedure
âī¸ Training Details
Property |
Details |
Base Model |
r - f/wav2vec-english-speech-emotion-recognition |
Hardware |
P100 GPU on Kaggle |
Training Duration |
10 epochs |
Learning Rate |
5e - 4 |
Batch Size |
4 |
Gradient Accumulation Steps |
8 |
Optimizer |
AdamW (βâ = 0.9, βâ = 0.999) |
Loss Function |
Cross Entropy Loss |
Learning Rate Scheduler |
None |
đ Training Results
Epoch |
Loss |
Accuracy |
1 |
1.0257 |
61.20% |
2 |
0.7025 |
73.88% |
3 |
0.5901 |
78.25% |
4 |
0.4960 |
81.56% |
5 |
0.4105 |
85.04% |
6 |
0.3516 |
87.70% |
7 |
0.3140 |
88.87% |
8 |
0.2649 |
90.45% |
9 |
0.2178 |
92.42% |
10 |
0.2187 |
92.29% |
đĻ Installation
pip install transformers torch torchaudio
đģ Usage Examples
Basic Usage
import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
import torchaudio
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
model_name = "dihuzz/wav2vec2-ser-english-finetuned"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name).to(device)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model.eval()
def predict_emotion(audio_path):
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
if waveform.shape[0] > 1:
waveform = torch.mean(waveform, dim=0, keepdim=True)
inputs = feature_extractor(
waveform.squeeze().numpy(),
sampling_rate=16000,
return_tensors="pt",
padding=True
)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class_id = torch.argmax(logits, dim=-1).item()
label = model.config.id2label[predicted_class_id]
return label
audio_file = "/path/to/your/audio.wav"
predicted_emotion = predict_emotion(audio_file)
print(f"Predicted Emotion: {predicted_emotion}")
đ Example Output
The model returns a string representing the predicted emotion:
Predicted Emotion: <emotion_label>
đ§ Technical Details
This model is based on the Wav2Vec2 architecture and is fine - tuned on specific English speech emotion recognition datasets. The fine - tuning process involves adjusting the model's parameters to better fit the task of emotion recognition in English speech.
đ License
No license information is provided in the original document.
â ī¸ Important Note
This model has several important limitations:
- đ Language Specificity: English - only support
- đŖī¸ Dialect Sensitivity: Variable performance across accents
- đ§ Audio Quality Needs: Requires clean, clear recordings
- âī¸ Potential Biases: May reflect cultural biases in training data
- 6ī¸âŖ Limited Categories: Only detects 6 basic emotions
- đ§ Context Unaware: Doesn't consider speech content meaning