đ Speech Emotion Recognition By Fine-Tuning Wav2Vec 2.0
This model is a fine-tuned version of jonatasgrosman/wav2vec2-large-xlsr-53-english designed for the Speech Emotion Recognition (SER) task. It leverages multiple datasets for fine - tuning and demonstrates high accuracy in emotion classification.
đ Quick Start
Installation
pip install transformers librosa torch
Usage Example
from transformers import *
import librosa
import torch
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
def predict_emotion(audio_path):
audio, rate = librosa.load(audio_path, sr=16000)
inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(inputs.input_values)
predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1)
predicted_label = torch.argmax(predictions, dim=-1)
emotion = model.config.id2label[predicted_label.item()]
return emotion
emotion = predict_emotion("example_audio.wav")
print(f"Predicted emotion: {emotion}")
>> Predicted emotion: angry
⨠Features
- Multi - dataset fine - tuning: Utilizes several datasets including SAVEE, RAVDESS, and TESS for fine - tuning.
- High accuracy: Achieves an accuracy of 0.97463 on the evaluation set.
- 7 - emotion classification: Can classify 7 different emotions: angry, disgust, fear, happy, neutral, sad, and surprise.
đĻ Installation
pip install transformers librosa torch
đģ Usage Examples
Basic Usage
from transformers import *
import librosa
import torch
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
model = Wav2Vec2ForCTC.from_pretrained("r-f/wav2vec-english-speech-emotion-recognition")
def predict_emotion(audio_path):
audio, rate = librosa.load(audio_path, sr=16000)
inputs = feature_extractor(audio, sampling_rate=rate, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(inputs.input_values)
predictions = torch.nn.functional.softmax(outputs.logits.mean(dim=1), dim=-1)
predicted_label = torch.argmax(predictions, dim=-1)
emotion = model.config.id2label[predicted_label.item()]
return emotion
emotion = predict_emotion("example_audio.wav")
print(f"Predicted emotion: {emotion}")
đ Documentation
Datasets Used for Fine - Tuning
- Surrey Audio - Visual Expressed Emotion (SAVEE) - 480 audio files from 4 male actors
- Ryerson Audio - Visual Database of Emotional Speech and Song (RAVDESS) - 1440 audio files from 24 professional actors (12 female, 12 male)
- Toronto emotional speech set (TESS) - 2800 audio files from 2 female actors
Classification Labels
emotions = ['angry' 'disgust' 'fear' 'happy' 'neutral' 'sad' 'surprise']
Evaluation Results
- Loss: 0.104075
- Accuracy: 0.97463
đ§ Technical Details
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 4
- eval_steps: 500
- seed: 42
- gradient_accumulation_steps: 2
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e - 08
- num_epochs: 4
- max_steps=7500
- save_steps: 1500
Training results
Step |
Training Loss |
Validation Loss |
Accuracy |
500 |
1.8124 |
1.365212 |
0.486258 |
1000 |
0.8872 |
0.773145 |
0.79704 |
1500 |
0.7035 |
0.574954 |
0.852008 |
2000 |
0.6879 |
1.286738 |
0.775899 |
2500 |
0.6498 |
0.697455 |
0.832981 |
3000 |
0.5696 |
0.33724 |
0.892178 |
3500 |
0.4218 |
0.307072 |
0.911205 |
4000 |
0.3088 |
0.374443 |
0.930233 |
4500 |
0.2688 |
0.260444 |
0.936575 |
5000 |
0.2973 |
0.302985 |
0.92389 |
5500 |
0.1765 |
0.165439 |
0.961945 |
6000 |
0.1475 |
0.170199 |
0.961945 |
6500 |
0.1274 |
0.15531 |
0.966173 |
7000 |
0.0699 |
0.103882 |
0.976744 |
7500 |
0.083 |
0.104075 |
0.97463 |
đ License
This project is licensed under the Apache - 2.0 license.